Skip to content

con/docflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

docflow

Scholarly document workflow automation with bibliography management

Python License

Overview

docflow is a Python package and CLI tool for automating workflows around scholarly documents (grants, papers, blog posts, technical reports). It helps with:

  • Document metadata extraction - Extract comments, citations, track changes from Word/Google Docs
  • Bibliography management - Integrate with Zotero for citation management
  • Format conversion - Convert between docx, markdown, LaTeX, HTML
  • Workflow automation - Sync with Google Drive, Overleaf, and other platforms

Key Features

πŸ” Metadata Extraction

  • Extract comments from Word documents (.docx) with threading support
  • Parse citations (DOI, arXiv, PMID, bioRxiv) from comment text
  • Extract citations from spreadsheet columns (Excel/Google Sheets)
  • Extract track changes and TODO comments
  • Support for Google Docs via Drive API

πŸ“š Bibliography Management

  • Zotero integration with deduplication
  • Automatic metadata fetching from CrossRef
  • Export to BibTeX for LaTeX/Overleaf
  • Batch operations for efficiency

πŸ”„ Format Conversion

  • Word ↔ Markdown (via pandoc)
  • Excel/Sheets β†’ TSV/CSV with citation extraction
  • Preserve metadata during conversion
  • Customizable templates

βš™οΈ Workflow Integration

  • Google Drive sync (via rclone)
  • Overleaf git submodule support
  • Makefile generation for common tasks
  • Configuration-driven workflows

Installation

From source (development)

git clone https://github.com/con/docflow.git
cd docflow
uv venv
source .venv/bin/activate
uv pip install -e ".[devel]"

Using uv (recommended)

uv pip install docflow

Using pip

pip install docflow

Implementation Status

βœ… Fully Implemented

  • Automated Workflow: docflow init + Makefile generation
  • CLI Commands: extract comments/spreadsheet-citations, zotero check/push, convert docx-to-md/md-to-docx/xlsx-to-tsv, init
  • Configuration System: Pydantic validation, document-type defaults
  • Citation Extraction: DOI, arXiv, PMID, bioRxiv, URLs with normalization
  • Word Comment Extraction: Threading, automatic citation detection
  • Spreadsheet Processing: Excel/Sheets β†’ TSV/CSV with citation extraction
  • Zotero Integration: Deduplication, CrossRef metadata fetching
  • Document Conversion: Word ↔ Markdown (via pandoc), Excel β†’ TSV/CSV
  • Test Coverage: 135 tests passing, 76.7% coverage

🚧 Coming Soon

  • Additional conversions: md ↔ tex ↔ html
  • docflow sync - Automated Google Drive sync command
  • Google Docs API integration
  • BibTeX export command
  • Document templates (grant-nih-r01, paper-nature, etc.)

Quick Start

Automated Workflow (Recommended)

Initialize your project with automated Makefile:

# Initialize project
docflow init --name "My Paper" \
  --gdrive-remote "gdrive:MyFolder/" \
  --zotero-group-id "12345"

# Run automated workflow
make          # Sync, convert to markdown, extract comments
make all      # Full workflow + Zotero check

The Makefile automates:

  • Sync from Google Drive (rclone)
  • Convert .docx β†’ .md and .xlsx β†’ .tsv
  • Extract comments and spreadsheet citations
  • Push citations to Zotero

Manual CLI Commands

Or use CLI commands directly:

Extract comments from Word documents

# Extract comments with citations
docflow extract comments document.docx

# Extract from multiple files with threading
docflow extract comments --threading gdrive/*.docx

# Highlight specific comments
docflow extract comments --alert "@reviewer" research-strategy.docx

Convert between formats

# Convert Word to Markdown for editing
docflow convert docx-to-md manuscript.docx

# Convert Markdown back to Word
docflow convert md-to-docx manuscript.md

# Extract images while converting
docflow convert docx-to-md manuscript.docx --extract-media ./images

Requires pandoc.

Process spreadsheets

# Convert Excel/Sheets to TSV with citation extraction
docflow convert xlsx-to-tsv datasets.xlsx
# Creates: converted/datasets.tsv
#          converted/datasets_citations.json

# Extract citations from specific columns
docflow extract spreadsheet-citations datasets.xlsx --columns publication_url doi

Manage citations with Zotero

# Preview what would be added (dry-run)
docflow zotero check gdrive/*_comments.json

# Push citations to Zotero group
docflow zotero push gdrive/*_comments.json

# Export bibliography to BibTeX
docflow zotero export --collection ABCD1234 > references.bib

Configuration

Create .docflow/config.yaml in your project:

project:
  name: "My Grant Proposal"
  type: "grant-nih-r01"

zotero:
  mode: "group"
  group_id: "1234567"
  collection_key: "ABCD1234"
  api_key_env: "ZOTERO_API_KEY"

extract:
  comments:
    threading: true
    alert_regex: '(@dartmouth.edu|@reviewer)'
    output_format: "json"

sync:
  google_drive:
    enabled: true
    remote: "gdrive:My Documents/"
    local_path: "gdrive/"

Usage Examples

Example 1: NIH R01 Grant Workflow

# Sync from Google Drive
make sync

# Extract citations from comments
make comments

# Check what would be added to Zotero
make zotero-check

# Push to Zotero
make zotero-push

# Export bibliography to Overleaf
make zotero-sync

Example 2: Scientific Paper

# Extract comments from manuscript
docflow extract comments manuscript.docx

# Convert Word to Markdown
docflow convert docx-to-md manuscript.docx > manuscript.md

# Check word count against journal limits
docflow meta wordcount manuscript.docx --limit 4000

Example 3: Blog Post

# Sync from Google Docs
docflow sync gdrive "gdrive:Blog Posts/" local/

# Convert to HTML
docflow convert md-to-html blog-post.md > blog-post.html

# Check SEO metadata
docflow meta seo-check blog-post.md

Development

Setup development environment

# Clone repository
git clone https://github.com/con/docflow.git
cd docflow

# Set up virtual environment with uv
uv venv
source .venv/bin/activate
uv pip install -e ".[devel]"

Run tests

# Run all tests
pytest

# Run with coverage
pytest --cov=docflow --cov-report=html

# Run specific test file
pytest tests/test_config.py

# Using tox for multiple Python versions
tox

Code quality

# Linting
ruff check docflow/ tests/

# Type checking
mypy docflow/

# Run all checks
tox -e lint,type

Architecture

docflow/
β”œβ”€β”€ cli/              # Command-line interface
β”œβ”€β”€ extract/          # Metadata extraction (comments, citations)
β”œβ”€β”€ integrations/     # External services (Zotero, Google Drive)
β”œβ”€β”€ config/           # Configuration loading and validation
└── util/             # Utilities and helpers

templates/            # Project templates (copier)
β”œβ”€β”€ grant/           # Grant proposal templates
β”œβ”€β”€ paper/           # Scientific paper templates
└── blog/            # Blog post templates

Document Types Supported

  • Grants: NIH R01, NSF CAREER
  • Papers: Nature format, arXiv preprints
  • Blog Posts: Technical blogs
  • Reports: Technical reports
  • Book Chapters: Academic books

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE file for details.

Citation

If you use docflow in your research, please cite:

@software{docflow,
  author = {Halchenko, Yaroslav O.},
  title = {docflow: Scholarly document workflow automation},
  year = {2026},
  url = {https://github.com/con/docflow}
}

Acknowledgments

This project follows the LAD (LLM-Assisted Development) framework for systematic, test-driven Python development.

Related Projects

Support

About

helper to collab on google docs etc

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors