readability.go

Go implementation of Mozilla Readability, aiming for fixture-level behavior compatibility with mozilla/readability.

Status

This project is at the compatibility porting stage.

Mozilla test/test-pages fixtures are copied into testdata/test-pages.
The upstream fixture source is pinned in testdata/UPSTREAM.
Metadata and content comparison are wired up for all pinned Mozilla fixtures.
Full compatibility benchmarks can be run with READABILITY_FULL_COMPAT=1 go test -cover -count=1 -run 'TestParseAllMozilla(Metadata|Content)Fixtures'.

The implementation is intentionally self-contained and does not depend on other Go Readability ports. Current work is focused on general Readability heuristics that are checked against upstream fixtures without hard-coding those fixtures into production logic.

Development

Run make all for the default quality gate.
Run make test for the default test suite, race detector, coverage summary, and full Mozilla compatibility drift report.
Run make vet for static checks.

Current Upstream Drift

tools/compare-upstream.mjs compares this implementation with the current Mozilla Readability checkout. Some differences are intentionally left open when chasing current upstream would either break pinned fixtures or require site-specific behavior. The machine-readable allowlist lives in tools/known-upstream-drift.json. Pass --known-drift to the compare tool to allow only these documented differences while still failing on new drift:

node tools/compare-upstream.mjs --all --char-threshold 1 --known-drift

Only add or change known drift entries after confirming the difference is not a general parser bug and documenting why matching current upstream would be less correct for this port or would break pinned fixtures.

firefox-nightly-blog and medicalnewstoday: current upstream selects newsletter or print-message blocks, while the pinned fixtures and this port keep the article body.
hukumusume: current upstream now returns a shorter legacy table extraction; the pinned fixture preserves the wider legacy table content.
lifehacker-post-comment-load and lifehacker-working: remaining drift is textContent whitespace around block boundaries. A global text-content rewrite regresses many other fixtures, so this should wait for a parser-level whitespace model rather than a fixture-specific shortcut.
wikipedia: current upstream serializes the first infobox without the parser-inserted <tbody>. Many pinned fixtures contain explicit <tbody> markup, so this needs an implicit-vs-explicit table-section strategy before changing serialization.
cnn: current upstream keeps the outer smartassetcontainer (with only the "Powered by SmartAsset.com" attribution paragraph) while stripping the nested iframe/script payload. This port's embed cleanup removes the whole subtree. Reviewed under a 30-minute time box: replicating upstream would require a site-specific "attribution-bearing embed wrapper" heuristic that risks regressing other widget/embed fixtures (calculators, social embeds, chart containers), so the drift is intentionally left in place.

Project Position

This port deliberately optimizes for fixture-level compatibility with the pinned mozilla/readability checkout rather than tracking upstream HEAD or matching any other Go port byte-for-byte. Concrete consequences:

The 130 Mozilla test/test-pages fixtures are the regression suite. Any change must keep them green; deviations from current upstream that would break a pinned fixture are documented in tools/known-upstream-drift.json instead of being chased blindly.
Compatibility behaviors that come from a single fixture (a CMS quirk, a news-site template) live in compat.go / legacy.go and are kept out of the generic parser flow so they cannot leak into other paths.
No dependency on other Go Readability ports. Algorithms are re-implemented from the upstream source so behavioral differences are intentional and documented, not inherited.

If you need a port that tracks upstream HEAD aggressively, or one that ships extra heuristics on top, this is not it. If you need predictable behavior pinned to a known mozilla/readability snapshot with a machine-checkable drift report, this is the right project.

Implementation Layout

The public entry point lives in article.go. The parser implementation is split by responsibility:

extract.go coordinates article extraction and fallback selection.
score.go scores article candidates and builds the final content tree.
clean.go, condition.go, normalize.go, and media.go clean and normalize extracted content.
compat.go and legacy.go hold fixture-proven compatibility behavior that is intentionally kept separate from the generic parser flow.
metadata.go, excerpt.go, and byline.go extract document metadata.
dom.go and url.go provide DOM and URL helpers used across the parser.

Architecture

flowchart TD
    User["Caller / CLI"] --> API["Public API<br/>FromReader / IsProbablyReaderable"]

    API --> Full["Full extraction<br/>FromReader"]
    API --> Probe["Fast pre-check<br/>IsProbablyReaderable"]

    Probe --> Readerable["readerable.go<br/>candidate scan<br/>visibility filters<br/>text-length scoring"]

    Full --> Parse["Parse HTML with goquery<br/>MaxElemsToParse guard"]
    Parse --> Meta["metadata.go<br/>JSON-LD / meta / title<br/>site name / published time"]
    Parse --> Byline["byline.go<br/>capture source byline<br/>before cleanup"]
    Parse --> Extract["extract.go<br/>article extraction coordinator"]

    Extract --> PreClean["Pre-cleanup<br/>scripts/styles/noscript<br/>font to span<br/>br normalization<br/>hidden node removal"]
    PreClean --> URL["url.go<br/>resolve relative URLs"]
    PreClean --> Legacy["legacy.go<br/>legacy table layout path"]
    PreClean --> Explicit["explicit articleBody<br/>description block path"]
    PreClean --> Score["score.go<br/>Readability candidate scoring"]

    Score --> Prepare["prepareArticleScoring<br/>remove unlikely nodes<br/>promote div to p<br/>deduplicate title headers"]
    Prepare --> Candidate["Candidate scoring<br/>paragraph text + commas + length<br/>propagate score to ancestors"]
    Candidate --> Refine["Candidate refinement<br/>shared ancestor promotion<br/>parent promotion<br/>sibling merge"]

    Legacy --> Clean["clean.go<br/>article cleanup pipeline"]
    Explicit --> Clean
    Refine --> Clean

    Clean --> Condition["condition.go<br/>conditional cleanup<br/>link density / media / table checks"]
    Clean --> Normalize["normalize.go<br/>structure normalization<br/>br / table / nested elements"]
    Clean --> Media["media.go<br/>lazy images<br/>embed/video/audio filtering"]
    Clean --> Compat["compat.go<br/>fixture-proven compatibility fixes"]

    Condition --> ArticleTree["Final article DOM<br/>readability-content"]
    Normalize --> ArticleTree
    Media --> ArticleTree
    Compat --> ArticleTree

    ArticleTree --> Serialize["dom.go<br/>HTML serialization<br/>entity normalization"]
    Meta --> Result["Article result"]
    Byline --> Result
    Serialize --> Result

    Result --> Fields["Title / Content / TextContent<br/>Length / Excerpt / Byline<br/>Dir / SiteName / Lang / PublishedTime"]

    CLI["cmd/readability"] --> API
    CLI --> Render["Output formats<br/>text / html / json / markdown"]

sequenceDiagram
    participant C as Caller
    participant A as article.go
    participant M as metadata.go
    participant E as extract.go
    participant S as score.go
    participant CL as clean.go
    participant R as Article

    C->>A: FromReader(html, pageURL, options)
    A->>A: Read input and parse goquery Document
    A->>M: Extract metadata, title, excerpt, site name
    A->>A: Capture source byline from pristine DOM
    A->>E: extractArticleContent(doc, pageURL, title, cfg)
    E->>E: Pre-clean, resolve URLs, clone fallbackDoc
    E->>S: Run standard candidate scoring
    S-->>E: readability-content candidate DOM
    E->>CL: Clean candidate article
    CL-->>E: Clean article DOM
    E-->>A: content selection
    A->>A: Build TextContent, Excerpt, Dir, Lang
    A-->>R: Article
    R-->>C: Return content and metadata

Usage

CLI

The cmd/readability command extracts the readable article from a URL, an HTML file, or stdin:

go run ./cmd/readability https://example.com/post
go run ./cmd/readability article.html --url https://example.com/post --format json
cat article.html | go run ./cmd/readability - --url https://example.com/post --format md --metadata

Supported output formats are text (default), html, json, and markdown / md. Markdown output targets GitHub Flavored Markdown and --metadata adds YAML front matter for Markdown output.

Library

package main

import (
	"fmt"
	"log"
	"os"

	readability "github.com/miclle/readability.go"
)

func main() {
	f, err := os.Open("article.html")
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	article, err := readability.FromReader(f, "https://example.com/article", nil)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(article.Title)
	fmt.Println(article.TextContent)
}

Options

FromReader accepts an optional *Options to mirror the upstream parser configuration knobs:

opts := &readability.Options{
	CharThreshold:       500,                                    // skip docs shorter than 500 chars
	ClassesToPreserve:   []string{"caption", "highlight"},      // extra classes kept during cleanup
	KeepClasses:         false,                                  // true keeps every class attribute
	NbTopCandidates:     5,                                      // candidate pool size during scoring
	DisableJSONLD:       false,                                  // skip JSON-LD metadata extraction
	AllowedVideoRegex:   nil,                                    // override built-in video allow list
	MaxElemsToParse:     0,                                      // > 0 aborts on huge documents with ErrTooManyElements
	LinkDensityModifier: 0,                                      // shifts conditional-cleanup link-density thresholds (positive = looser)
}
article, err := readability.FromReader(f, pageURL, opts)

When MaxElemsToParse is exceeded the call returns readability.ErrTooManyElements. When the extracted text is shorter than CharThreshold, the call returns readability.ErrBelowCharThreshold along with a zero-value Article. Use errors.Is to distinguish that case from other failures.

Probing readerability

For a fast pre-check that does not run the full extractor:

ok, err := readability.IsProbablyReaderable(f)
if err != nil {
	log.Fatal(err)
}
if !ok {
	return
}

IsProbablyReaderable accepts an optional ReaderableOptions to tune MinContentLength and MinScore.

Benchmarks

A benchmark suite covering small / medium / large / visibility-heavy fixtures lives in bench_test.go:

make bench                       # quick run, no baseline write
make bench-baseline              # refresh testdata/bench-baseline.txt (6 samples)
make bench-compare               # benchstat current vs committed baseline

testdata/bench-baseline.txt is captured on the maintainer's hardware (Apple M4 Pro, darwin/arm64) and is intended for local developer reference only. CI uses a dynamic baseline: the bench-compare job records both origin/main and the PR head on the same runner and posts a benchstat report as a build artifact. This neutralizes the CPU / scheduler variance that would otherwise make a hardware-pinned baseline unreliable in CI.

A regression gate (tools/bench-regression-gate.sh) parses the benchstat CSV output and fails the job when any benchmark regresses by ≥ 10% AND benchstat marks the change as statistically significant (p < 0.05). Improvements and noise (~) are ignored. The threshold is loose on purpose — GitHub-hosted runners are noisy and tighter limits produce false positives more often than they catch real regressions; tune it in .github/workflows/ci.yml if your fork has access to dedicated runners.

Direct go test invocation is still supported for ad-hoc runs:

go test -bench=. -benchmem -benchtime=2s -run=^$

Fuzzing

fuzz_test.go defines fuzz harnesses for both FromReader and IsProbablyReaderable. They check that arbitrary HTML byte sequences do not trigger panics or unexpected (non-sentinel) errors:

go test -run=^$ -fuzz=FuzzFromReader -fuzztime=30s .
go test -run=^$ -fuzz=FuzzIsProbablyReaderable -fuzztime=30s .

Run them in CI or locally before shipping changes that touch parsing, cleanup, or visibility logic.

Upstream Test Data

Compatibility fixtures under testdata/test-pages are copied from Mozilla Readability and are licensed under the Apache License, Version 2.0. See NOTICE and testdata/UPSTREAM for source and copyright details.

License

Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
cmd/readability		cmd/readability
testdata		testdata
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
README_zh.md		README_zh.md
article.go		article.go
article_test.go		article_test.go
bench_test.go		bench_test.go
byline.go		byline.go
clean.go		clean.go
compat.go		compat.go
compat_helpers_test.go		compat_helpers_test.go
condition.go		condition.go
dom.go		dom.go
edge_cases_test.go		edge_cases_test.go
example_test.go		example_test.go
excerpt.go		excerpt.go
excerpt_test.go		excerpt_test.go
extract.go		extract.go
fixture_test.go		fixture_test.go
fuzz_test.go		fuzz_test.go
go.mod		go.mod
go.sum		go.sum
helpers_test.go		helpers_test.go
legacy.go		legacy.go
media.go		media.go
media_clean_test.go		media_clean_test.go
media_count_test.go		media_count_test.go
metadata.go		metadata.go
metadata_test.go		metadata_test.go
normalize.go		normalize.go
options_test.go		options_test.go
readerable.go		readerable.go
readerable_test.go		readerable_test.go
regexps.go		regexps.go
regexps_test.go		regexps_test.go
score.go		score.go
score_branches_test.go		score_branches_test.go
score_more_test.go		score_more_test.go
score_test.go		score_test.go
text.go		text.go
url.go		url.go
url_test.go		url_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

readability.go

Status

Development

Current Upstream Drift

Project Position

Implementation Layout

Architecture

Usage

CLI

Library

Options

Probing readerability

Benchmarks

Fuzzing

Upstream Test Data

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

readability.go

Status

Development

Current Upstream Drift

Project Position

Implementation Layout

Architecture

Usage

CLI

Library

Options

Probing readerability

Benchmarks

Fuzzing

Upstream Test Data

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages