A Python tool for extracting and indexing tagged entities from PageXML files. The tool automatically discovers all tags in PageXML collections, groups them by category, and generates an interactive HTML index for browsing and analysis.
Digital scholarly editions of historical sources increasingly employ Named Entity Recognition (NER) and other annotation approaches to enrich transcriptions. Tools such as Transkribus enable researchers to add semantic tags to TextLine elements within PageXML files, marking persons, locations, organisations, dates, and other entities of historical interest. These annotations typically appear as key-value pairs in the custom attribute of TextLine elements, following patterns such as persoon {offset:12; length:8;} or geonames_locations {offset:4; length:6; wikiData:Q12345;}.
However, once annotation is complete, researchers face the challenge of efficiently accessing and analysing these tagged entities across large document collections. Manual inspection of XML files proves impractical beyond a few dozen pages, and generic XML tools lack the domain-specific functionality needed for historical text analysis. Researchers require the ability to:
- Survey which entities have been tagged throughout a collection
- Identify how frequently particular entities appear
- Locate all contexts in which an entity is mentioned
- Verify annotation consistency and completeness
- Access original page images for validation
The PageXML Indexer addresses these requirements through automated extraction and presentation of all tagged entities. The tool scans PageXML collections, identifies all semantic tags regardless of category, extracts the tagged text and its context, groups identical entities, and generates a self-contained interactive HTML index. This index enables researchers to browse entities alphabetically within categories, expand entries to view all occurrences with context, and link directly to Transkribus page images when available.
- Detection of all tag categories present in PageXML custom attributes
- No pre-configuration required; adapts to project-specific tag schemas
- Recognition of any attribute following the pattern
tagname {param:value;} - Exclusion of structural tags (readingOrder, structure, etc.)
- Recursive scanning of directory structures
- Automatic detection of collection directories containing page/ subdirectories
- Processing of all PageXML files across multiple collections
- Preservation of collection identity for each tagged entity
- Page numbers from Transkribus metadata
- Document and page identifiers for reference
- Image URLs for direct linking to scans
- TextLine coordinates for precise location
- WikiData identifiers when present in tags
- Additional parameters from custom attributes
- Case-insensitive clustering of identical entities
- Alphabetical sorting within each category
- Preservation of original spelling for display
- Aggregation of all occurrences per unique entity
- Self-contained single-file index
- Tabbed interface with separate sections per tag category
- Search functionality within active category
- Expandable/collapsible entity entries
- Direct links to Transkribus page images
- Full context display for each occurrence
- Collection, page number, and filename references
- Support for arbitrary tag categories and naming conventions
- Customisable category labels through configuration
- Unicode support for multi-language content
- Standard library dependencies only
- Python 3.7 or higher
- Standard library modules only (no external dependencies required)
The tool uses exclusively Python standard library modules (xml.etree.ElementTree, pathlib, re, json, collections, argparse), ensuring compatibility across different computing environments without dependency management concerns.
Download the script directly or clone the repository containing the PageXML indexer tool. No installation process is required beyond ensuring Python 3.7+ is available on your system.
Place pagexml_indexer.py in a convenient location, such as a tools/ or scripts/ directory within your project structure.
The tool expects PageXML collections organised with the following structure:
collections_base/
├── Collection_A/
│ └── page/
│ ├── 0001_archive_collection_00001.xml
│ ├── 0002_archive_collection_00002.xml
│ └── 0003_archive_collection_00003.xml
├── Collection_B/
│ └── page/
│ ├── 0001_archive_collection_00001.xml
│ └── 0002_archive_collection_00002.xml
└── Collection_C/
└── page/
├── 0001_archive_collection_00001.xml
├── 0002_archive_collection_00002.xml
└── 0003_archive_collection_00003.xml
Each collection directory must contain a page/ subdirectory housing individual PageXML files. The script identifies collections by scanning for directories containing page/ subdirectories with XML files.
To generate an index for all collections in a base directory:
python pagexml_indexer.py /path/to/collectionsThe tool will:
- Scan for all collections in the specified directory
- Process all PageXML files in each collection
- Extract all tagged entities
- Generate an HTML index file named
pagexml_index.htmlin the base directory
Specify a custom output filename:
python pagexml_indexer.py /path/to/collections --output my_index.htmlSuppress progress output:
python pagexml_indexer.py /path/to/collections --quietInteractive path entry (no command-line argument):
python pagexml_indexer.pyThe tool will prompt: Enter path to collections directory:
positional arguments:
base_path Path to base directory containing PageXML collections
optional arguments:
-h, --help Show help message and exit
-o OUTPUT, --output OUTPUT
Output HTML file path (default: pagexml_index.html in base directory)
-q, --quiet Suppress progress output
Complete example demonstrating typical usage:
$ python pagexml_indexer.py ~/Documents/Resoluties/PageXML
=== PageXML Indexer ===
Found 3 collection(s):
- 0018
- 0019
- 0020
Processing collection: 0018
Found 156 XML files
Processed 156/156 files
Extracted 234 tags from 0018
Processing collection: 0019
Found 142 XML files
Processed 142/142 files
Extracted 198 tags from 0019
Processing collection: 0020
Found 178 XML files
Processed 178/178 files
Extracted 267 tags from 0020
=== Summary ===
Total tags extracted: 699
Collections processed: 3
Grouping and sorting tags...
Found 8 tag categories:
- Personen: 45 unique items
- Locaties: 38 unique items
- Hoedanigheden: 67 unique items
- Organisaties: 23 unique items
- Gebeurtenissen: 12 unique items
- Data: 89 unique items
- Afkortingen: 156 unique items
- Documenten: 8 unique items
Generating HTML output...
Generated HTML index: /Users/researcher/Documents/Resoluties/PageXML/pagexml_index.html
✓ Index successfully generated: /Users/researcher/Documents/Resoluties/PageXML/pagexml_index.html
Open the file in your web browser to view the indexThe indexer begins by scanning the provided base directory for collection structures. It identifies subdirectories containing a page/ folder with XML files. This approach accommodates both single-collection and multi-collection directory structures.
The discovery process:
- Iterates through all immediate subdirectories of the base path
- Checks each subdirectory for a
page/folder - Verifies that the
page/folder contains.xmlfiles - Adds qualifying directories to the collection list
Collections are processed in alphabetical order by directory name.
For each XML file, the tool employs Python's ElementTree parser with namespace awareness. The PageXML format uses the namespace http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15, which the tool registers for consistent element lookup.
The parser extracts several levels of metadata:
Document-Level Metadata
docId: Transkribus document identifierpageId: Transkribus page identifierpageNr: Sequential page number within the documentimgUrl: Direct URL to page image in Transkribus storage
Region-Level Elements All TextLine elements are examined regardless of their parent TextRegion. This ensures comprehensive tag extraction even when document structure varies.
Line-Level Attributes
id: Unique identifier for the TextLinecustom: String containing semicolon-separated key-value pairsCoords points: Polygon coordinates defining the line's bounding boxUnicode: Transcribed text content
The custom attribute contains metadata in a specific format:
readingOrder {index:0;} persoon {offset:12; length:8;} geonames_locations {offset:45; length:6; wikiData:Q12345;}
The parser uses regular expressions to identify each tag:
- Matches pattern:
tagname {param1:value1; param2:value2;} - Extracts tag name
- Splits parameter string on semicolons
- Parses each parameter as key:value pair
- Filters out structural tags (readingOrder, structure, score, type, index, caption)
For each identified tag, the tool extracts the corresponding text from the Unicode content using the offset and length parameters:
offset = 12
length = 8
unicode_text = "Alsoe ons Mathias van Buyren geschreven heeft..."
tagged_text = unicode_text[12:20] # "Mathias "This positional extraction ensures accurate correspondence between tags and text, even when multiple tags reference the same TextLine.
Extracted entities are grouped through a multi-stage process:
Stage 1: Category Grouping All tags are first grouped by their tag category (persoon, geonames_locations, etc.). This separates different entity types for independent processing.
Stage 2: Normalisation Within each category, entity text is normalised for comparison:
- Whitespace trimming
- Lowercase conversion
- Retention of original diacritics and punctuation
Stage 3: Clustering Entities with identical normalised forms are clustered together. For display purposes, the original spelling from the first occurrence is preserved:
"Mathias" (3 occurrences)
- "Mathias" from page 12
- "mathias" from page 45
- "MATHIAS" from page 67
Stage 4: Alphabetical Sorting Clustered entities are sorted alphabetically by their normalised forms within each category.
The output HTML file is entirely self-contained, embedding all CSS and JavaScript. This design ensures the index remains functional when moved, archived, or shared without external dependencies.
Structure
The HTML employs a tabbed interface with one tab per tag category. Each tab contains:
- Category heading
- Search input field
- List of unique entities
Entity Display
Each entity appears as a collapsible item with:
- Toggle icon (chevron)
- Entity text
- Occurrence count in parentheses
- WikiData link (when available)
- Hidden details section
When expanded, the details section shows all occurrences with:
- Full TextLine content for context
- Page number
- Collection name
- Filename
- Direct link to Transkribus image (when available)
Interactivity
JavaScript provides three interactive features:
- Tab Switching: Clicking category tabs shows/hides corresponding content
- Entity Expansion: Clicking entity headers toggles visibility of occurrence details
- Search Filtering: Typing in the search box filters visible entities in the active tab
When tags include WikiData identifiers in their parameters:
geonames_locations {offset:4; length:6; wikiData:Q12345;}
The indexer extracts these identifiers and generates clickable links:
<a href="https://www.wikidata.org/wiki/Q12345" target="_blank">Wikidata</a>These links enable researchers to access additional context, variant names, geographic coordinates, and other linked data resources associated with the entity.
The generated HTML file follows this structure:
<!DOCTYPE html>
<html lang="nl">
<head>
<meta charset="UTF-8">
<title>PageXML Index</title>
<style>
/* Embedded CSS for all visual styling */
</style>
</head>
<body>
<div class="container">
<header>
<h1>PageXML Index</h1>
<p class="subtitle">Overzicht van alle getagde elementen</p>
</header>
<div class="search-box">
<input type="text" placeholder="Zoek...">
</div>
<div class="tabs">
<div class="tab-header">
<!-- Tab buttons for each category -->
</div>
<div class="tab-content">
<!-- Entity listings -->
</div>
</div>
<footer>
<!-- Statistics -->
</footer>
</div>
<script>
/* JavaScript for interactivity */
</script>
</body>
</html>A typical entity entry in the HTML:
<div class="item">
<div class="item-header" onclick="toggleItem(this)">
<svg class="toggle-icon"><!-- Chevron icon --></svg>
<h3 class="item-title">
Mathias van Buyren
<span class="item-count">(3)</span>
<a href="https://www.wikidata.org/wiki/Q123456" class="wiki-link">(Wikidata)</a>
</h3>
</div>
<div class="occurrences">
<div class="occurrence">
<div class="occurrence-header">
<div class="occurrence-text">
<p class="context-text">
<span class="context-label">Context:</span>
Alsoe ons Mathias van Buyren geschreven heeft...
</p>
<p class="occurrence-meta">
Pagina 12 • 0018 • 0012_NL-ZlHCO_0003.1_0018_00012.xml
</p>
</div>
<a href="https://files.transkribus.eu/..." class="scan-link">
Bekijk scan
</a>
</div>
</div>
<!-- Additional occurrences... -->
</div>
</div>The PageXML indexer serves as an analysis and validation tool within the broader workflow:
Stage 1: Transcription
- Manual transcription in Transkribus
- Export to PageXML format
Stage 2: Entity Annotation
- Manual or automated tagging of entities
- Addition of semantic markup to TextLine custom attributes
Stage 3: Index Generation (this tool)
- Extraction of all tagged entities
- Generation of browsable index
- Identification of annotation patterns
Stage 4: Validation and Refinement
- Review of entity index for consistency
- Identification of annotation errors or omissions
- Correction of tags in Transkribus
- Re-export and re-indexing
Stage 5: Further Analysis
- Meeting header identification
- Linked data conversion
- Statistical analysis
- Network graph generation
Iterative Indexing
Generate indices regularly throughout the annotation process rather than only at completion. Early indices help identify:
- Tag category naming inconsistencies
- Entities requiring WikiData linking
- Annotation coverage gaps
- Systematic transcription errors
Validation Workflow
Use the index systematically for quality control:
- Generate initial index after annotating a sample of pages
- Review each category for unexpected entries
- Check occurrence contexts for annotation accuracy
- Identify entities appearing with variant spellings
- Standardise spelling in Transkribus where appropriate
- Verify WikiData links function correctly
- Re-export corrected PageXML
- Regenerate index to confirm corrections
Documentation
Maintain a record of tag categories and their definitions alongside the index. Document:
- Criteria for entity selection within each category
- Boundary cases and decision rules
- WikiData identifier sources
- Annotation conventions specific to the project
Version Control
Track both PageXML collections and generated indices under version control:
- Commit PageXML after annotation sessions
- Commit generated index HTML after each indexing
- Tag versions corresponding to major milestones
- Include commit messages noting which collections were updated
Sharing and Publication
The self-contained HTML format facilitates sharing:
- Indices can be attached to emails
- Hosted on project websites without server-side processing
- Archived with PageXML collections
- Published as supplementary material with articles
The tool includes default labels for common tag categories. To modify these labels, edit the CATEGORY_LABELS dictionary in the PageXMLIndexer class:
CATEGORY_LABELS = {
'persoon': 'Personen',
'geonames_locations': 'Locaties',
'capaciteit_hoedanigheid': 'Hoedanigheden',
'organisatie': 'Organisaties',
'event_gebeurtenis': 'Gebeurtenissen',
'datum': 'Data',
'abbrev': 'Afkortingen',
'Doc': 'Documenten',
# Add custom categories here
'custom_tag': 'Custom Label'
}Categories not specified in this dictionary will display using their original tag names.
To exclude tag categories from indexing, add them to the SKIP_TAGS set:
SKIP_TAGS = {
'readingOrder',
'structure',
'score',
'type',
'index',
'caption',
# Add additional tags to skip
'internal_note',
'processing_flag'
}The embedded CSS can be modified to change visual appearance. Key style classes include:
.tabs: Main container styling.tab-button: Category tab appearance.item: Entity entry styling.item-header: Clickable entity header.occurrence: Individual occurrence styling.scan-link: Button for Transkribus links
Problem: "No collections found in [directory]"
Causes and Solutions:
- Wrong directory level: Ensure you are pointing to the parent directory containing collection folders, not a collection folder itself
- Missing page/ subdirectories: Verify each collection has a
page/subfolder - No XML files: Check that XML files exist within
page/folders - File permissions: Ensure read access to the directory structure
Problem: "No tags found in any collection"
Causes and Solutions:
- Tags not yet added: Verify that entity annotation has been performed in Transkribus
- Export timing: Ensure PageXML was exported after annotation, not before
- Tag format: Check that tags follow the pattern
tagname {param:value;} - All tags excluded: Verify your custom SKIP_TAGS hasn't excluded all relevant tags
- XML parsing issues: Check a sample file manually to confirm tags are present
Problem: Page numbers in index don't match expected values
Causes and Solutions:
- Metadata source: The tool extracts page numbers from TranskribusMetadata pageNr attribute, not from filenames
- Page numbering in Transkribus: Verify page numbering is correctly set in the Transkribus interface
- Re-export required: If page numbers were corrected in Transkribus, re-export the PageXML
Problem: "Bekijk scan" links are absent or non-functional
Causes and Solutions:
- Private collections: Transkribus image URLs may require authentication
- Export settings: Verify that image URLs were included in the PageXML export
- URL expiration: Some Transkribus URLs may be time-limited
- Network access: Links require internet connectivity to function
Problem: Warnings about unparseable XML files
Causes and Solutions:
- Corrupted exports: Re-export affected files from Transkribus
- Manual editing errors: Validate XML syntax if files were manually edited
- Encoding issues: Ensure files are UTF-8 encoded
- Incomplete downloads: Verify file sizes match expected values
Problem: Generated HTML file won't open or displays incorrectly
Causes and Solutions:
- File associations: Ensure
.htmlfiles are associated with a web browser - Character encoding: Some older browsers may have issues with UTF-8; try a modern browser (Chrome, Firefox, Safari, Edge)
- File size: Very large collections may generate HTML files that are slow to load; be patient
- JavaScript disabled: The index requires JavaScript; ensure it's enabled in browser settings
Problem: Script crashes with memory errors on large collections
Causes and Solutions:
- File-by-file processing: The script processes files sequentially to minimise memory usage
- Large occurrence counts: Entities with thousands of occurrences may require significant memory
- Split processing: Process collections separately and generate individual indices
- System resources: Close other applications to free memory
Processing performance scales approximately linearly with:
- Number of XML files
- Number of tagged entities
- Length of Unicode text in TextLines
Typical performance on modern hardware:
- ~50-100 files per second for parsing
- ~1-2 seconds for grouping and sorting
- ~5-10 seconds for HTML generation
Total processing time for 1000 files: approximately 1-2 minutes
Memory usage remains modest:
- Peak memory: typically under 500 MB
- HTML file size: approximately 1-2 KB per tagged occurrence
The tool registers the PAGE namespace before parsing:
ET.register_namespace('page', 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15')This ensures that namespace prefixes are preserved if the tool were extended to write modified XML files.
All file operations use UTF-8 encoding explicitly:
html_content = Path(output_path).write_text(html_content, encoding='utf-8')This ensures correct handling of:
- Early modern Dutch orthography (ij, aen, etc.)
- Diacritical marks (é, ë, ñ)
- Special characters in entity names
- Unicode symbols in generated HTML
Tag parsing employs the pattern:
r'(\w+)\s*\{([^}]+)\}'This matches:
\w+: Tag name (letters, digits, underscore)\s*: Optional whitespace\{: Opening brace[^}]+: Parameters (any characters except closing brace)\}: Closing brace
Parameter parsing splits on semicolons and colons, handling various spacing conventions in custom attributes.
The tool's architecture facilitates addition of other output formats:
CSV Export Generate tabular data for spreadsheet analysis:
def generate_csv_output(self, output_path: Path):
# Write category, entity, count, occurrences to CSVJSON Export Create structured data for programmatic access:
def generate_json_output(self, output_path: Path):
# Serialise grouped_tags to JSONMarkdown Export Generate human-readable plain text listings:
def generate_markdown_output(self, output_path: Path):
# Create Markdown with entity listsAdditional methods could calculate:
- Tag category distributions
- Entities per page statistics
- Co-occurrence matrices
- Temporal distributions (using datum tags)
The grouped_tags data structure could feed:
- Network graphs of entity relationships
- Geographic maps of locations
- Timeline visualisations of events
- Word clouds of entity frequencies
The tool could be extended to:
- Query WikiData API for entity enrichment
- Validate geonames against GeoNames database
- Cross-reference persons with authority files (VIAF, ORCID)
- Link organisations to cultural heritage databases
- No disambiguation: Identical text strings are treated as the same entity even when referring to different individuals
- Case sensitivity: Clustering uses lowercase normalisation, potentially merging entities that should remain distinct
- Context window: Only the containing TextLine is shown as context, not surrounding lines
- Single-page processing: Each page is processed independently; multi-page entities are not connected
- No temporal ordering: Occurrences are listed in processing order, not chronological order
- Memory constraints: Very large collections (10,000+ pages) may approach memory limits on systems with <4GB RAM
- HTML size: Extremely large indices (100,000+ tags) may cause browser performance issues
- No concurrent processing: Files are processed sequentially; parallel processing could improve speed
Potential enhancements for subsequent versions:
- Integration with authority files for person names
- Automated WikiData identifier suggestions
- Geographic coordinate extraction for locations
- Cross-reference resolution across collections
- Interactive network graphs of entity co-occurrences
- Timeline views of dated entities
- Geographic maps of tagged locations
- Statistical dashboards
- CSV format for quantitative analysis
- JSON format for API integration
- RDF/TTL format for linked data
- GraphML format for network analysis tools
- Identification of entities lacking WikiData identifiers
- Detection of potential annotation errors
- Suggestion of standardised spellings
- Flagging of entities with very few occurrences
- Parallel processing of multiple files
- Incremental indexing (process only changed files)
- Database backend for very large collections
- Chunked HTML generation for massive indices
This PageXML Indexer Tool was developed within the context of the HAICu project on the Resoluties van de Staten van Overijssel (Resolutions of the States of Overijssel), funded by the Dutch Research Council/Nederlandse Organisatie voor Wetenschappelijk Onderzoek/Nationale Wetenschapsagenda [NWA.1518.22.105].
Development was assisted by Claude (Anthropic) for code implementation and documentation.
- Email: c.a.romein@utwente.nl
This project is licensed under the MIT Licence.
Version 1.0 (2025): Initial release with automatic tag discovery, multi-collection processing, entity grouping, and interactive HTML index generation