Rules-first · ML-assisted · LLM-optional · Offline-ready
简体中文 | Documentation | Releases
CleanBook is an open-source, offline-first bookmark cleaning and classification tool. It transforms chaotic browser bookmark collections into well-organized, categorized libraries using a hybrid approach that prioritizes rules, enhances with machine learning, and optionally leverages LLM capabilities.
| Feature | Description |
|---|---|
| 🚀 Offline-First | Complete pipeline runs locally without cloud services. Perfect for local batch processing and long-term maintenance |
| 🤖 Hybrid Classification | Rule engine + ML classifier (91.4% accuracy) + optional LLM fallback. Automatic degradation when services unavailable |
| ⚙️ Configuration-Driven | Customize rules, thresholds, and vocabularies via JSON/YAML—no code changes required |
| 📦 Multi-Format Export | Export to HTML (Netscape), Markdown (reports), and JSON (structured data) |
| 🔧 CLI + Wizard | Command-line tool for automation, interactive wizard for guided experience |
| 🎯 Smart Deduplication | URL normalization and multi-dimensional similarity detection |
| 💾 LRU Caching | Intelligent caching with automatic eviction for optimal performance |
# Via pipx (recommended for isolated environment)
pipx install cleanbook
# Via pip
pip install cleanbook
# From source
git clone https://github.com/LessUp/bookmarks-cleaner.git
cd bookmarks-cleaner
pip install .# Process a bookmark HTML file
cleanbook -i bookmarks.html -o output/
# Interactive wizard mode
cleanbook-wizard
# With ML training enabled
cleanbook -i bookmarks.html --train
# Health check
cleanbook --health-checkHTML Bookmarks
↓
┌─────────────────────────────────────────────────────┐
│ 1. Rule Engine (Fast, 0.1ms, Priority: 0.3) │
│ Domain/Title/URL pattern matching │
├─────────────────────────────────────────────────────┤
│ 2. ML Classifier (91.4% accuracy, Priority: 0.25) │
│ TF-IDF + Ensemble (RF + LR + Naive Bayes) │
├─────────────────────────────────────────────────────┤
│ 3. Semantic Analysis (Priority: 0.2) │
│ Word vectors, TF-IDF similarity │
├─────────────────────────────────────────────────────┤
│ 4. LLM Classifier (Optional, Priority: 0.15) │
│ OpenAI-compatible API with auto-fallback │
└─────────────────────────────────────────────────────┘
↓
Weighted Voting Fusion → Organized Output
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Input │───▶│ Process │───▶│ Output │
│ bookmarks │ │ ┌────────┐ │ │ bookmarks.html │
│ .html │ │ │ Parse │ │ │ bookmarks.json │
└──────────────┘ │ │ Deduplicate │ │ │ report.md │
│ │ Classify│ │ └──────────────────┘
│ │ Organize│ │
│ └────────┘ │
└──────────────┘
↓
┌──────────────┐
│ Config │
│ ├─ Rules │
│ ├─ ML Model │
│ └─ Taxonomy │
└──────────────┘
| Resource | Link |
|---|---|
| Homepage | lessup.github.io/bookmarks-cleaner |
| Quick Start | /en/quickstart |
| Best Practices | /en/guide/best-practices |
| Architecture | /en/design/architecture |
| Development | /en/guide/development |
| API Reference | /en/reference/llm-templates |
{
"category_rules": {
"Technology/AI": {
"rules": [
{
"match": "domain",
"keywords": ["openai.com", "huggingface.co", "arxiv.org"],
"weight": 15
},
{
"match": "title",
"keywords": ["GPT", "LLM", "neural network"],
"weight": 10
}
]
}
},
"ai_settings": {
"confidence_threshold": 0.7,
"cache_size": 10000,
"max_workers": 4
},
"llm": {
"enable": false,
"provider": "openai",
"model": "gpt-4o-mini"
}
}| Metric | Value |
|---|---|
| Classification Accuracy | 91.4% |
| Processing Speed | ~50 bookmarks/second |
| Cache Hit Rate | 87-92% |
| Memory (baseline) | ~45MB |
| Memory (1000 bookmarks) | ~125MB |
# Clone repository
git clone https://github.com/LessUp/bookmarks-cleaner.git
cd bookmarks-cleaner
# Create virtual environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Run tests
pytest
# Run with coverage
pytest --cov=src --cov-report=htmlContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch:
git checkout -b feature/amazing-feature - Commit your changes:
git commit -m 'feat: add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
This project is licensed under the MIT License.
- Inspired by the need for efficient personal knowledge management
- Built with scikit-learn, BeautifulSoup, and Rich
Made with ❤️ by LessUp