Scholarly document workflow automation with bibliography management
docflow is a Python package and CLI tool for automating workflows around scholarly documents (grants, papers, blog posts, technical reports). It helps with:
- Document metadata extraction - Extract comments, citations, track changes from Word/Google Docs
- Bibliography management - Integrate with Zotero for citation management
- Format conversion - Convert between docx, markdown, LaTeX, HTML
- Workflow automation - Sync with Google Drive, Overleaf, and other platforms
- Extract comments from Word documents (.docx) with threading support
- Parse citations (DOI, arXiv, PMID, bioRxiv) from comment text
- Extract citations from spreadsheet columns (Excel/Google Sheets)
- Extract track changes and TODO comments
- Support for Google Docs via Drive API
- Zotero integration with deduplication
- Automatic metadata fetching from CrossRef
- Export to BibTeX for LaTeX/Overleaf
- Batch operations for efficiency
- Word β Markdown (via pandoc)
- Excel/Sheets β TSV/CSV with citation extraction
- Preserve metadata during conversion
- Customizable templates
- Google Drive sync (via rclone)
- Overleaf git submodule support
- Makefile generation for common tasks
- Configuration-driven workflows
git clone https://github.com/con/docflow.git
cd docflow
uv venv
source .venv/bin/activate
uv pip install -e ".[devel]"uv pip install docflowpip install docflow- Automated Workflow:
docflow init+ Makefile generation - CLI Commands:
extract comments/spreadsheet-citations,zotero check/push,convert docx-to-md/md-to-docx/xlsx-to-tsv,init - Configuration System: Pydantic validation, document-type defaults
- Citation Extraction: DOI, arXiv, PMID, bioRxiv, URLs with normalization
- Word Comment Extraction: Threading, automatic citation detection
- Spreadsheet Processing: Excel/Sheets β TSV/CSV with citation extraction
- Zotero Integration: Deduplication, CrossRef metadata fetching
- Document Conversion: Word β Markdown (via pandoc), Excel β TSV/CSV
- Test Coverage: 135 tests passing, 76.7% coverage
- Additional conversions: md β tex β html
docflow sync- Automated Google Drive sync command- Google Docs API integration
- BibTeX export command
- Document templates (grant-nih-r01, paper-nature, etc.)
Initialize your project with automated Makefile:
# Initialize project
docflow init --name "My Paper" \
--gdrive-remote "gdrive:MyFolder/" \
--zotero-group-id "12345"
# Run automated workflow
make # Sync, convert to markdown, extract comments
make all # Full workflow + Zotero checkThe Makefile automates:
- Sync from Google Drive (rclone)
- Convert .docx β .md and .xlsx β .tsv
- Extract comments and spreadsheet citations
- Push citations to Zotero
Or use CLI commands directly:
# Extract comments with citations
docflow extract comments document.docx
# Extract from multiple files with threading
docflow extract comments --threading gdrive/*.docx
# Highlight specific comments
docflow extract comments --alert "@reviewer" research-strategy.docx# Convert Word to Markdown for editing
docflow convert docx-to-md manuscript.docx
# Convert Markdown back to Word
docflow convert md-to-docx manuscript.md
# Extract images while converting
docflow convert docx-to-md manuscript.docx --extract-media ./imagesRequires pandoc.
# Convert Excel/Sheets to TSV with citation extraction
docflow convert xlsx-to-tsv datasets.xlsx
# Creates: converted/datasets.tsv
# converted/datasets_citations.json
# Extract citations from specific columns
docflow extract spreadsheet-citations datasets.xlsx --columns publication_url doi# Preview what would be added (dry-run)
docflow zotero check gdrive/*_comments.json
# Push citations to Zotero group
docflow zotero push gdrive/*_comments.json
# Export bibliography to BibTeX
docflow zotero export --collection ABCD1234 > references.bibCreate .docflow/config.yaml in your project:
project:
name: "My Grant Proposal"
type: "grant-nih-r01"
zotero:
mode: "group"
group_id: "1234567"
collection_key: "ABCD1234"
api_key_env: "ZOTERO_API_KEY"
extract:
comments:
threading: true
alert_regex: '(@dartmouth.edu|@reviewer)'
output_format: "json"
sync:
google_drive:
enabled: true
remote: "gdrive:My Documents/"
local_path: "gdrive/"# Sync from Google Drive
make sync
# Extract citations from comments
make comments
# Check what would be added to Zotero
make zotero-check
# Push to Zotero
make zotero-push
# Export bibliography to Overleaf
make zotero-sync# Extract comments from manuscript
docflow extract comments manuscript.docx
# Convert Word to Markdown
docflow convert docx-to-md manuscript.docx > manuscript.md
# Check word count against journal limits
docflow meta wordcount manuscript.docx --limit 4000# Sync from Google Docs
docflow sync gdrive "gdrive:Blog Posts/" local/
# Convert to HTML
docflow convert md-to-html blog-post.md > blog-post.html
# Check SEO metadata
docflow meta seo-check blog-post.md# Clone repository
git clone https://github.com/con/docflow.git
cd docflow
# Set up virtual environment with uv
uv venv
source .venv/bin/activate
uv pip install -e ".[devel]"# Run all tests
pytest
# Run with coverage
pytest --cov=docflow --cov-report=html
# Run specific test file
pytest tests/test_config.py
# Using tox for multiple Python versions
tox# Linting
ruff check docflow/ tests/
# Type checking
mypy docflow/
# Run all checks
tox -e lint,typedocflow/
βββ cli/ # Command-line interface
βββ extract/ # Metadata extraction (comments, citations)
βββ integrations/ # External services (Zotero, Google Drive)
βββ config/ # Configuration loading and validation
βββ util/ # Utilities and helpers
templates/ # Project templates (copier)
βββ grant/ # Grant proposal templates
βββ paper/ # Scientific paper templates
βββ blog/ # Blog post templates
- Grants: NIH R01, NSF CAREER
- Papers: Nature format, arXiv preprints
- Blog Posts: Technical blogs
- Reports: Technical reports
- Book Chapters: Academic books
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
MIT License - see LICENSE file for details.
If you use docflow in your research, please cite:
@software{docflow,
author = {Halchenko, Yaroslav O.},
title = {docflow: Scholarly document workflow automation},
year = {2026},
url = {https://github.com/con/docflow}
}This project follows the LAD (LLM-Assisted Development) framework for systematic, test-driven Python development.
- Zotero - Reference management
- Manubot - Manuscript automation
- Overleaf - LaTeX editor
- rclone - Cloud storage sync
- Documentation: https://docflow.readthedocs.io
- Issues: https://github.com/con/docflow/issues
- Discussions: https://github.com/con/docflow/discussions