Go implementation of Mozilla Readability, aiming for fixture-level behavior
compatibility with mozilla/readability.
This project is at the compatibility porting stage.
- Mozilla
test/test-pagesfixtures are copied intotestdata/test-pages. - The upstream fixture source is pinned in
testdata/UPSTREAM. - Metadata and content comparison are wired up for all pinned Mozilla fixtures.
- Full compatibility benchmarks can be run with
READABILITY_FULL_COMPAT=1 go test -cover -count=1 -run 'TestParseAllMozilla(Metadata|Content)Fixtures'.
The implementation is intentionally self-contained and does not depend on other Go Readability ports. Current work is focused on general Readability heuristics that are checked against upstream fixtures without hard-coding those fixtures into production logic.
- Run
make allfor the default quality gate. - Run
make testfor the default test suite, race detector, coverage summary, and full Mozilla compatibility drift report. - Run
make vetfor static checks.
tools/compare-upstream.mjs compares this implementation with the current
Mozilla Readability checkout. Some differences are intentionally left open when
chasing current upstream would either break pinned fixtures or require
site-specific behavior. The machine-readable allowlist lives in
tools/known-upstream-drift.json. Pass --known-drift to the compare tool to
allow only these documented differences while still failing on new drift:
node tools/compare-upstream.mjs --all --char-threshold 1 --known-driftOnly add or change known drift entries after confirming the difference is not a general parser bug and documenting why matching current upstream would be less correct for this port or would break pinned fixtures.
firefox-nightly-blogandmedicalnewstoday: current upstream selects newsletter or print-message blocks, while the pinned fixtures and this port keep the article body.hukumusume: current upstream now returns a shorter legacy table extraction; the pinned fixture preserves the wider legacy table content.lifehacker-post-comment-loadandlifehacker-working: remaining drift istextContentwhitespace around block boundaries. A global text-content rewrite regresses many other fixtures, so this should wait for a parser-level whitespace model rather than a fixture-specific shortcut.wikipedia: current upstream serializes the first infobox without the parser-inserted<tbody>. Many pinned fixtures contain explicit<tbody>markup, so this needs an implicit-vs-explicit table-section strategy before changing serialization.cnn: current upstream keeps the outersmartassetcontainer(with only the "Powered by SmartAsset.com" attribution paragraph) while stripping the nested iframe/script payload. This port's embed cleanup removes the whole subtree. Reviewed under a 30-minute time box: replicating upstream would require a site-specific "attribution-bearing embed wrapper" heuristic that risks regressing other widget/embed fixtures (calculators, social embeds, chart containers), so the drift is intentionally left in place.
This port deliberately optimizes for fixture-level compatibility with the pinned mozilla/readability checkout rather than tracking upstream HEAD or matching any other Go port byte-for-byte. Concrete consequences:
- The 130 Mozilla
test/test-pagesfixtures are the regression suite. Any change must keep them green; deviations from current upstream that would break a pinned fixture are documented intools/known-upstream-drift.jsoninstead of being chased blindly. - Compatibility behaviors that come from a single fixture (a CMS quirk,
a news-site template) live in
compat.go/legacy.goand are kept out of the generic parser flow so they cannot leak into other paths. - No dependency on other Go Readability ports. Algorithms are re-implemented from the upstream source so behavioral differences are intentional and documented, not inherited.
If you need a port that tracks upstream HEAD aggressively, or one that ships extra heuristics on top, this is not it. If you need predictable behavior pinned to a known mozilla/readability snapshot with a machine-checkable drift report, this is the right project.
The public entry point lives in article.go. The parser implementation is
split by responsibility:
extract.gocoordinates article extraction and fallback selection.score.goscores article candidates and builds the final content tree.clean.go,condition.go,normalize.go, andmedia.goclean and normalize extracted content.compat.goandlegacy.gohold fixture-proven compatibility behavior that is intentionally kept separate from the generic parser flow.metadata.go,excerpt.go, andbyline.goextract document metadata.dom.goandurl.goprovide DOM and URL helpers used across the parser.
flowchart TD
User["Caller / CLI"] --> API["Public API<br/>FromReader / IsProbablyReaderable"]
API --> Full["Full extraction<br/>FromReader"]
API --> Probe["Fast pre-check<br/>IsProbablyReaderable"]
Probe --> Readerable["readerable.go<br/>candidate scan<br/>visibility filters<br/>text-length scoring"]
Full --> Parse["Parse HTML with goquery<br/>MaxElemsToParse guard"]
Parse --> Meta["metadata.go<br/>JSON-LD / meta / title<br/>site name / published time"]
Parse --> Byline["byline.go<br/>capture source byline<br/>before cleanup"]
Parse --> Extract["extract.go<br/>article extraction coordinator"]
Extract --> PreClean["Pre-cleanup<br/>scripts/styles/noscript<br/>font to span<br/>br normalization<br/>hidden node removal"]
PreClean --> URL["url.go<br/>resolve relative URLs"]
PreClean --> Legacy["legacy.go<br/>legacy table layout path"]
PreClean --> Explicit["explicit articleBody<br/>description block path"]
PreClean --> Score["score.go<br/>Readability candidate scoring"]
Score --> Prepare["prepareArticleScoring<br/>remove unlikely nodes<br/>promote div to p<br/>deduplicate title headers"]
Prepare --> Candidate["Candidate scoring<br/>paragraph text + commas + length<br/>propagate score to ancestors"]
Candidate --> Refine["Candidate refinement<br/>shared ancestor promotion<br/>parent promotion<br/>sibling merge"]
Legacy --> Clean["clean.go<br/>article cleanup pipeline"]
Explicit --> Clean
Refine --> Clean
Clean --> Condition["condition.go<br/>conditional cleanup<br/>link density / media / table checks"]
Clean --> Normalize["normalize.go<br/>structure normalization<br/>br / table / nested elements"]
Clean --> Media["media.go<br/>lazy images<br/>embed/video/audio filtering"]
Clean --> Compat["compat.go<br/>fixture-proven compatibility fixes"]
Condition --> ArticleTree["Final article DOM<br/>readability-content"]
Normalize --> ArticleTree
Media --> ArticleTree
Compat --> ArticleTree
ArticleTree --> Serialize["dom.go<br/>HTML serialization<br/>entity normalization"]
Meta --> Result["Article result"]
Byline --> Result
Serialize --> Result
Result --> Fields["Title / Content / TextContent<br/>Length / Excerpt / Byline<br/>Dir / SiteName / Lang / PublishedTime"]
CLI["cmd/readability"] --> API
CLI --> Render["Output formats<br/>text / html / json / markdown"]
sequenceDiagram
participant C as Caller
participant A as article.go
participant M as metadata.go
participant E as extract.go
participant S as score.go
participant CL as clean.go
participant R as Article
C->>A: FromReader(html, pageURL, options)
A->>A: Read input and parse goquery Document
A->>M: Extract metadata, title, excerpt, site name
A->>A: Capture source byline from pristine DOM
A->>E: extractArticleContent(doc, pageURL, title, cfg)
E->>E: Pre-clean, resolve URLs, clone fallbackDoc
E->>S: Run standard candidate scoring
S-->>E: readability-content candidate DOM
E->>CL: Clean candidate article
CL-->>E: Clean article DOM
E-->>A: content selection
A->>A: Build TextContent, Excerpt, Dir, Lang
A-->>R: Article
R-->>C: Return content and metadata
The cmd/readability command extracts the readable article from a URL, an
HTML file, or stdin:
go run ./cmd/readability https://example.com/post
go run ./cmd/readability article.html --url https://example.com/post --format json
cat article.html | go run ./cmd/readability - --url https://example.com/post --format md --metadataSupported output formats are text (default), html, json, and
markdown / md. Markdown output targets GitHub Flavored Markdown and
--metadata adds YAML front matter for Markdown output.
package main
import (
"fmt"
"log"
"os"
readability "github.com/miclle/readability.go"
)
func main() {
f, err := os.Open("article.html")
if err != nil {
log.Fatal(err)
}
defer f.Close()
article, err := readability.FromReader(f, "https://example.com/article", nil)
if err != nil {
log.Fatal(err)
}
fmt.Println(article.Title)
fmt.Println(article.TextContent)
}FromReader accepts an optional *Options to mirror the upstream parser
configuration knobs:
opts := &readability.Options{
CharThreshold: 500, // skip docs shorter than 500 chars
ClassesToPreserve: []string{"caption", "highlight"}, // extra classes kept during cleanup
KeepClasses: false, // true keeps every class attribute
NbTopCandidates: 5, // candidate pool size during scoring
DisableJSONLD: false, // skip JSON-LD metadata extraction
AllowedVideoRegex: nil, // override built-in video allow list
MaxElemsToParse: 0, // > 0 aborts on huge documents with ErrTooManyElements
LinkDensityModifier: 0, // shifts conditional-cleanup link-density thresholds (positive = looser)
}
article, err := readability.FromReader(f, pageURL, opts)When MaxElemsToParse is exceeded the call returns readability.ErrTooManyElements.
When the extracted text is shorter than CharThreshold, the call returns
readability.ErrBelowCharThreshold along with a zero-value Article. Use
errors.Is to distinguish that case from other failures.
For a fast pre-check that does not run the full extractor:
ok, err := readability.IsProbablyReaderable(f)
if err != nil {
log.Fatal(err)
}
if !ok {
return
}IsProbablyReaderable accepts an optional ReaderableOptions to tune
MinContentLength and MinScore.
A benchmark suite covering small / medium / large / visibility-heavy fixtures
lives in bench_test.go:
make bench # quick run, no baseline write
make bench-baseline # refresh testdata/bench-baseline.txt (6 samples)
make bench-compare # benchstat current vs committed baselinetestdata/bench-baseline.txt is captured on the maintainer's hardware
(Apple M4 Pro, darwin/arm64) and is intended for local developer reference
only. CI uses a dynamic baseline: the bench-compare job records both
origin/main and the PR head on the same runner and posts a benchstat
report as a build artifact. This neutralizes the CPU / scheduler variance
that would otherwise make a hardware-pinned baseline unreliable in CI.
A regression gate (tools/bench-regression-gate.sh) parses the benchstat
CSV output and fails the job when any benchmark regresses by ≥ 10% AND
benchstat marks the change as statistically significant (p < 0.05).
Improvements and noise (~) are ignored. The threshold is loose on
purpose — GitHub-hosted runners are noisy and tighter limits produce
false positives more often than they catch real regressions; tune it in
.github/workflows/ci.yml if your fork has access to dedicated runners.
Direct go test invocation is still supported for ad-hoc runs:
go test -bench=. -benchmem -benchtime=2s -run=^$fuzz_test.go defines fuzz harnesses for both FromReader and
IsProbablyReaderable. They check that arbitrary HTML byte sequences do not
trigger panics or unexpected (non-sentinel) errors:
go test -run=^$ -fuzz=FuzzFromReader -fuzztime=30s .
go test -run=^$ -fuzz=FuzzIsProbablyReaderable -fuzztime=30s .Run them in CI or locally before shipping changes that touch parsing, cleanup, or visibility logic.
Compatibility fixtures under testdata/test-pages are copied from
Mozilla Readability and are licensed under the Apache License, Version 2.0.
See NOTICE and testdata/UPSTREAM for source and copyright details.
Apache License 2.0.