Skip to content

LessUp/bookmarks-cleaner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CleanBook — Smart Bookmark Cleaning & Classification

Python 3.10+ License: MIT Platform CI Documentation

Rules-first · ML-assisted · LLM-optional · Offline-ready

简体中文 | Documentation | Releases


CleanBook is an open-source, offline-first bookmark cleaning and classification tool. It transforms chaotic browser bookmark collections into well-organized, categorized libraries using a hybrid approach that prioritizes rules, enhances with machine learning, and optionally leverages LLM capabilities.

✨ Features

Feature Description
🚀 Offline-First Complete pipeline runs locally without cloud services. Perfect for local batch processing and long-term maintenance
🤖 Hybrid Classification Rule engine + ML classifier (91.4% accuracy) + optional LLM fallback. Automatic degradation when services unavailable
⚙️ Configuration-Driven Customize rules, thresholds, and vocabularies via JSON/YAML—no code changes required
📦 Multi-Format Export Export to HTML (Netscape), Markdown (reports), and JSON (structured data)
🔧 CLI + Wizard Command-line tool for automation, interactive wizard for guided experience
🎯 Smart Deduplication URL normalization and multi-dimensional similarity detection
💾 LRU Caching Intelligent caching with automatic eviction for optimal performance

🚀 Quick Start

Installation

# Via pipx (recommended for isolated environment)
pipx install cleanbook

# Via pip
pip install cleanbook

# From source
git clone https://github.com/LessUp/bookmarks-cleaner.git
cd bookmarks-cleaner
pip install .

Basic Usage

# Process a bookmark HTML file
cleanbook -i bookmarks.html -o output/

# Interactive wizard mode
cleanbook-wizard

# With ML training enabled
cleanbook -i bookmarks.html --train

# Health check
cleanbook --health-check

📊 Classification Pipeline

HTML Bookmarks
    ↓
┌─────────────────────────────────────────────────────┐
│  1. Rule Engine (Fast, 0.1ms, Priority: 0.3)       │
│     Domain/Title/URL pattern matching               │
├─────────────────────────────────────────────────────┤
│  2. ML Classifier (91.4% accuracy, Priority: 0.25) │
│     TF-IDF + Ensemble (RF + LR + Naive Bayes)      │
├─────────────────────────────────────────────────────┤
│  3. Semantic Analysis (Priority: 0.2)              │
│     Word vectors, TF-IDF similarity                │
├─────────────────────────────────────────────────────┤
│  4. LLM Classifier (Optional, Priority: 0.15)      │
│     OpenAI-compatible API with auto-fallback       │
└─────────────────────────────────────────────────────┘
    ↓
Weighted Voting Fusion → Organized Output

🏗️ Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────────┐
│    Input     │───▶│   Process    │───▶│     Output       │
│  bookmarks   │    │  ┌────────┐  │    │  bookmarks.html  │
│   .html      │    │  │ Parse  │  │    │  bookmarks.json  │
└──────────────┘    │  │ Deduplicate  │  │    │  report.md       │
                    │  │ Classify│  │    └──────────────────┘
                    │  │ Organize│  │
                    │  └────────┘  │
                    └──────────────┘
                         ↓
                    ┌──────────────┐
                    │  Config      │
                    │  ├─ Rules    │
                    │  ├─ ML Model │
                    │  └─ Taxonomy │
                    └──────────────┘

📖 Documentation

Resource Link
Homepage lessup.github.io/bookmarks-cleaner
Quick Start /en/quickstart
Best Practices /en/guide/best-practices
Architecture /en/design/architecture
Development /en/guide/development
API Reference /en/reference/llm-templates

⚙️ Configuration Example

{
  "category_rules": {
    "Technology/AI": {
      "rules": [
        {
          "match": "domain",
          "keywords": ["openai.com", "huggingface.co", "arxiv.org"],
          "weight": 15
        },
        {
          "match": "title",
          "keywords": ["GPT", "LLM", "neural network"],
          "weight": 10
        }
      ]
    }
  },
  "ai_settings": {
    "confidence_threshold": 0.7,
    "cache_size": 10000,
    "max_workers": 4
  },
  "llm": {
    "enable": false,
    "provider": "openai",
    "model": "gpt-4o-mini"
  }
}

🔬 Performance Benchmarks

Metric Value
Classification Accuracy 91.4%
Processing Speed ~50 bookmarks/second
Cache Hit Rate 87-92%
Memory (baseline) ~45MB
Memory (1000 bookmarks) ~125MB

🛠️ Development

# Clone repository
git clone https://github.com/LessUp/bookmarks-cleaner.git
cd bookmarks-cleaner

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Run tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch: git checkout -b feature/amazing-feature
  3. Commit your changes: git commit -m 'feat: add amazing feature'
  4. Push to the branch: git push origin feature/amazing-feature
  5. Open a Pull Request

📝 License

This project is licensed under the MIT License.

🙏 Acknowledgments


Made with ❤️ by LessUp

About

Smart Bookmark Cleanup & Classification: Rules + ML + Optional LLM, Dedup & Multi-Format Export (Python CLI) | 智能书签清理与分类工具:规则 + ML + 可选 LLM,去重、标题清理、多格式导出(Python CLI)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages