Skip to content

streetquant/pdf-splitter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PDF/EPUB Chapter Splitter

A powerful CLI tool that automatically splits PDF and EPUB files into chapter-wise sections with smart OCR detection. Perfect for learners using AI-powered study assistants.

Python Version License

Why This Tool?

Learning from books becomes significantly more effective when combined with AI study assistants. This tool enables you to:

  • Feed individual chapters to AI tutors for focused discussions
  • Use ChatGPT's Study Mode on specific sections
  • Leverage Gemini's Guided Learning with curated content
  • Create notebooks in NotebookLM chapter by chapter
  • Generate summaries and quizzes per chapter

AI Learning Workflow

πŸ“š Full Book (PDF/EPUB)
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PDF Splitter Tool    β”‚
β”‚  β”œβ”€β”€ Pre-text         β”‚  β†’ Intro materials
β”‚  β”œβ”€β”€ Chapter 01.pdf   β”‚  β†’ AI Study Session 1
β”‚  β”œβ”€β”€ Chapter 02.pdf   β”‚  β†’ AI Study Session 2
β”‚  β”œβ”€β”€ Chapter 03.pdf   β”‚  β†’ AI Study Session 3
β”‚  └── Post-text        β”‚  β†’ Appendices/References
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
πŸ€– AI Learning Tools
    β”œβ”€β”€ ChatGPT Study Mode
    β”œβ”€β”€ Gemini Guided Learning  
    β”œβ”€β”€ NotebookLM
    └── Custom AI Tutors

Use with AI Study Tools

ChatGPT Study Mode

ChatGPT Study Mode was launched by OpenAI on July 29, 2025. It offers step-by-step guidance instead of quick answers, helping you learn and retain knowledge better.

Upload individual chapter PDFs to ChatGPT and ask:

"Analyze this chapter about [topic] and create:
1. Key concepts summary
2. 5 quiz questions
3. Real-world examples
4. Connections to previous chapters"

Key Features:

  • Guided learning with questions, hints, and step-by-step explanations
  • Personalized support that adapts to your skill level
  • Knowledge checks with quizzes and open-ended questions
  • Progress tracking to show mastery and areas to focus

Learn more: chatgpt.com/features/study-mode

Gemini Guided Learning

Google launched Guided Learning in Gemini on August 6, 2025. It acts as a personal learning companion that helps you build deep understanding of subjects.

Feed chapters sequentially:

"Guide me through this chapter using the Feynman technique.
Start with the main thesis, then break down complex concepts,
and end with practical applications."

Key Features:

  • Interactive study partner for deeper understanding
  • Uploads course material, debug code, or understand concepts
  • Visual responses and interactive study aids
  • Personalized explanations adapting to your needs

Learn more: blog.google/products/gemini/guided-learning

NotebookLM

Google's NotebookLM continues to evolve with powerful AI learning features updated in 2025.

2025 Features:

  • Audio Overviews - Turn sources into engaging "Deep Dive" discussions
  • Video Overviews - Generate video summaries of your documents
  • Flashcards & Quizzes - Create from your documents instantly
  • Learning Guide - Generate tailored study guides
  • Mind Maps - Visualize connections between concepts
  • Presentations - Create polished outlines with talking points

How to Use with Split Chapters:

NotebookLM is used through the web interface at notebooklm.google. Here's how to use it with this tool's output:

  1. Split your book:

    python -m pdfsplitter.cli "book.pdf"
  2. Open NotebookLM at notebooklm.google

  3. Click "Upload" and select the chapter PDFs from the output folder

  4. Use NotebookLM features:

    • Click "Audio Overview" to create AI audio discussions
    • Use "Guide" to generate study guides
    • Ask questions about specific chapters

Example Workflow:

Output folder: book_output/
β”œβ”€β”€ chapter_01.pdf  β†’ Upload to NotebookLM
β”œβ”€β”€ chapter_02.pdf  β†’ Upload to NotebookLM  
β”œβ”€β”€ chapter_03.pdf  β†’ Upload to NotebookLM
└── ...

NotebookLM will create Audio Overviews, summaries, and answer questions
about your content using Gemini's AI capabilities.

Note: NotebookLM doesn't have a public Python API for creating notebooks programmatically. The free version is used through the web interface. For enterprise/automated use, Google Cloud offers NotebookLM Enterprise APIs.

Features

  • Automatic Chapter Detection - Uses TOC and text analysis
  • Smart OCR Decision - LLM determines if scanned PDFs need OCR
  • Pre/Post Text Separation - Isolates front matter and appendices
  • Multiple Format Support - Handles both PDF and EPUB
  • CLI Interface - Simple, command-line based
  • Caching - Remembers OCR decisions for speed

Installation

# Clone the repository
git clone https://github.com/streetquant/pdf-splitter.git
cd pdf-splitter

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env and add your OPENROUTER_API_KEY

Configuration

Create a .env file with your API key:

OPENROUTER_API_KEY=your-api-key-here
MODEL_NAME=nvidia/nemotron-3-nano-30b-a3b:free

Get a free API key from OpenRouter.

Usage

Basic Usage

# Process a PDF
python -m pdfsplitter.cli book.pdf

# Process an EPUB
python -m pdfsplitter.cli textbook.epub

# Specify output directory
python -m pdfsplitter.cli book.pdf -o ./my-chapters

# Verbose mode
python -m pdfsplitter.cli book.pdf -v

Output Structure

book_output/
β”œβ”€β”€ metadata.json        # Chapter information
β”œβ”€β”€ pretext.pdf          # Front matter (TOC, preface)
β”œβ”€β”€ chapter_01.pdf       # Chapter 1
β”œβ”€β”€ chapter_02.pdf       # Chapter 2
β”œβ”€β”€ chapter_03.pdf       # Chapter 3
β”œβ”€β”€ ...
└── posttext.pdf         # Appendices, references

Example: Learning Workflow

1. Split Your Book

$ python -m pdfsplitter.cli "Deep Learning.pdf"
βœ“ Processing complete!
  Input: Deep Learning.pdf
  Output: Deep Learning_output
  Chapters: 12
    01. Pre-text
    02. Chapter 1: Math Foundations
    03. Chapter 2: Neural Networks
    ...
    12. Post-text

2. Study Chapter by Chapter with AI

Prompt for ChatGPT Study Mode:

I'm studying Chapter 3 on Neural Networks from a deep learning book.
Please:
1. Explain the key concepts in simple terms
2. Create a concept map connecting to Chapter 2
3. Generate 5 practice problems
4. Suggest real-world applications
5. Recommend which sections to reread

Prompt for Gemini Guided Learning:

Using the Feynman technique, help me understand this chapter:
- Start with the main thesis
- Break down 3 complex concepts
- End with practical applications in real-world scenarios

NotebookLM Integration:

# Upload chapters to NotebookLM for comprehensive learning
# Features: Audio Overviews, Flashcards, Mind Maps, Quizzes

3. Build a Knowledge Base

Combine chapters across multiple AI tools for comprehensive coverage:

Tool Best For Chapter Usage
ChatGPT Study Mode Interactive Q&A, step-by-step explanations Upload 1-2 chapters per session
Gemini Guided Learning Personalized learning paths Sequential chapter progression
NotebookLM Audio summaries, flashcards, research Upload entire book

Supported Patterns

The tool detects chapters using:

  • CHAPTER 1, Chapter 1, CHAPTER ONE
  • Part I, Part 1
  • Table of Contents entries

Posttext detection includes:

  • APPENDIX, REFERENCES, BIBLIOGRAPHY
  • INDEX, GLOSSARY, ACKNOWLEDGMENTS

API Integration

from pdfsplitter.core import split_pdf, split_epub

# Programmatic usage
result = split_pdf("book.pdf", "./output")
for chapter in result.chapters:
    print(f"{chapter.title}: pages {chapter.start_page}-{chapter.end_page}")

Testing

# Run all tests
pytest tests/ -v

# Run specific test
pytest tests/test_pdf_processor.py -v

Project Structure

pdf-splitter/
β”œβ”€β”€ src/pdfsplitter/
β”‚   β”œβ”€β”€ cli.py              # Command-line interface
β”‚   β”œβ”€β”€ config.py           # Configuration
β”‚   β”œβ”€β”€ constants.py        # Patterns and constants
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ pdf_processor.py    # PDF splitting logic
β”‚   β”‚   β”œβ”€β”€ epub_processor.py   # EPUB splitting logic
β”‚   β”‚   β”œβ”€β”€ ocr_detector.py     # Smart OCR detection
β”‚   β”‚   └── models.py           # Data models
β”‚   └── utils/
β”‚       β”œβ”€β”€ llm.py          # OpenRouter integration
β”‚       β”œβ”€β”€ cache.py        # Caching
β”‚       └── logging.py      # Logging
β”œβ”€β”€ tests/                  # Test suite
β”œβ”€β”€ pyproject.toml          # Project config
└── requirements.txt        # Dependencies

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

License

MIT License - feel free to use in your projects.

Acknowledgments


Happy Learning! πŸ“šπŸ€–

About

A CLI tool to split PDF and EPUB files into chapterwise sections with smart OCR detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages