PageXML Indexer Tool

A Python tool for extracting and indexing tagged entities from PageXML files. The tool automatically discovers all tags in PageXML collections, groups them by category, and generates an interactive HTML index for browsing and analysis.

Purpose

Digital scholarly editions of historical sources increasingly employ Named Entity Recognition (NER) and other annotation approaches to enrich transcriptions. Tools such as Transkribus enable researchers to add semantic tags to TextLine elements within PageXML files, marking persons, locations, organisations, dates, and other entities of historical interest. These annotations typically appear as key-value pairs in the custom attribute of TextLine elements, following patterns such as persoon {offset:12; length:8;} or geonames_locations {offset:4; length:6; wikiData:Q12345;}.

However, once annotation is complete, researchers face the challenge of efficiently accessing and analysing these tagged entities across large document collections. Manual inspection of XML files proves impractical beyond a few dozen pages, and generic XML tools lack the domain-specific functionality needed for historical text analysis. Researchers require the ability to:

Survey which entities have been tagged throughout a collection
Identify how frequently particular entities appear
Locate all contexts in which an entity is mentioned
Verify annotation consistency and completeness
Access original page images for validation

The PageXML Indexer addresses these requirements through automated extraction and presentation of all tagged entities. The tool scans PageXML collections, identifies all semantic tags regardless of category, extracts the tagged text and its context, groups identical entities, and generates a self-contained interactive HTML index. This index enables researchers to browse entities alphabetically within categories, expand entries to view all occurrences with context, and link directly to Transkribus page images when available.

Features

Automatic Tag Discovery

Detection of all tag categories present in PageXML custom attributes
No pre-configuration required; adapts to project-specific tag schemas
Recognition of any attribute following the pattern tagname {param:value;}
Exclusion of structural tags (readingOrder, structure, etc.)

Multi-Collection Processing

Recursive scanning of directory structures
Automatic detection of collection directories containing page/ subdirectories
Processing of all PageXML files across multiple collections
Preservation of collection identity for each tagged entity

Comprehensive Metadata Extraction

Page numbers from Transkribus metadata
Document and page identifiers for reference
Image URLs for direct linking to scans
TextLine coordinates for precise location
WikiData identifiers when present in tags
Additional parameters from custom attributes

Intelligent Grouping and Sorting

Case-insensitive clustering of identical entities
Alphabetical sorting within each category
Preservation of original spelling for display
Aggregation of all occurrences per unique entity

Interactive HTML Output

Self-contained single-file index
Tabbed interface with separate sections per tag category
Search functionality within active category
Expandable/collapsible entity entries
Direct links to Transkribus page images
Full context display for each occurrence
Collection, page number, and filename references

Extensible Design

Support for arbitrary tag categories and naming conventions
Customisable category labels through configuration
Unicode support for multi-language content
Standard library dependencies only

Requirements

Python 3.7 or higher
Standard library modules only (no external dependencies required)

The tool uses exclusively Python standard library modules (xml.etree.ElementTree, pathlib, re, json, collections, argparse), ensuring compatibility across different computing environments without dependency management concerns.

Installation

Download the script directly or clone the repository containing the PageXML indexer tool. No installation process is required beyond ensuring Python 3.7+ is available on your system.

Place pagexml_indexer.py in a convenient location, such as a tools/ or scripts/ directory within your project structure.

PageXML Folder Structure

The tool expects PageXML collections organised with the following structure:

collections_base/
├── Collection_A/
│   └── page/
│       ├── 0001_archive_collection_00001.xml
│       ├── 0002_archive_collection_00002.xml
│       └── 0003_archive_collection_00003.xml
├── Collection_B/
│   └── page/
│       ├── 0001_archive_collection_00001.xml
│       └── 0002_archive_collection_00002.xml
└── Collection_C/
    └── page/
        ├── 0001_archive_collection_00001.xml
        ├── 0002_archive_collection_00002.xml
        └── 0003_archive_collection_00003.xml

Each collection directory must contain a page/ subdirectory housing individual PageXML files. The script identifies collections by scanning for directories containing page/ subdirectories with XML files.

Usage

Basic Invocation

To generate an index for all collections in a base directory:

python pagexml_indexer.py /path/to/collections

The tool will:

Scan for all collections in the specified directory
Process all PageXML files in each collection
Extract all tagged entities
Generate an HTML index file named pagexml_index.html in the base directory

Alternative Invocation Methods

Specify a custom output filename:

python pagexml_indexer.py /path/to/collections --output my_index.html

Suppress progress output:

python pagexml_indexer.py /path/to/collections --quiet

Interactive path entry (no command-line argument):

python pagexml_indexer.py

The tool will prompt: Enter path to collections directory:

Command-Line Options

positional arguments:
  base_path             Path to base directory containing PageXML collections

optional arguments:
  -h, --help            Show help message and exit
  -o OUTPUT, --output OUTPUT
                        Output HTML file path (default: pagexml_index.html in base directory)
  -q, --quiet           Suppress progress output

Example Session

Complete example demonstrating typical usage:

$ python pagexml_indexer.py ~/Documents/Resoluties/PageXML

=== PageXML Indexer ===

Found 3 collection(s):
  - 0018
  - 0019
  - 0020

  Processing collection: 0018
  Found 156 XML files
    Processed 156/156 files    
  Extracted 234 tags from 0018

  Processing collection: 0019
  Found 142 XML files
    Processed 142/142 files    
  Extracted 198 tags from 0019

  Processing collection: 0020
  Found 178 XML files
    Processed 178/178 files    
  Extracted 267 tags from 0020

=== Summary ===
Total tags extracted: 699
Collections processed: 3

Grouping and sorting tags...
Found 8 tag categories:
  - Personen: 45 unique items
  - Locaties: 38 unique items
  - Hoedanigheden: 67 unique items
  - Organisaties: 23 unique items
  - Gebeurtenissen: 12 unique items
  - Data: 89 unique items
  - Afkortingen: 156 unique items
  - Documenten: 8 unique items

Generating HTML output...
Generated HTML index: /Users/researcher/Documents/Resoluties/PageXML/pagexml_index.html

✓ Index successfully generated: /Users/researcher/Documents/Resoluties/PageXML/pagexml_index.html
  Open the file in your web browser to view the index

How It Works

Collection Discovery

The indexer begins by scanning the provided base directory for collection structures. It identifies subdirectories containing a page/ folder with XML files. This approach accommodates both single-collection and multi-collection directory structures.

The discovery process:

Iterates through all immediate subdirectories of the base path
Checks each subdirectory for a page/ folder
Verifies that the page/ folder contains .xml files
Adds qualifying directories to the collection list

Collections are processed in alphabetical order by directory name.

PageXML Parsing

For each XML file, the tool employs Python's ElementTree parser with namespace awareness. The PageXML format uses the namespace http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15, which the tool registers for consistent element lookup.

The parser extracts several levels of metadata:

Document-Level Metadata

docId: Transkribus document identifier
pageId: Transkribus page identifier
pageNr: Sequential page number within the document
imgUrl: Direct URL to page image in Transkribus storage

Region-Level Elements All TextLine elements are examined regardless of their parent TextRegion. This ensures comprehensive tag extraction even when document structure varies.

Line-Level Attributes

id: Unique identifier for the TextLine
custom: String containing semicolon-separated key-value pairs
Coords points: Polygon coordinates defining the line's bounding box
Unicode: Transcribed text content

Custom Attribute Parsing

The custom attribute contains metadata in a specific format:

readingOrder {index:0;} persoon {offset:12; length:8;} geonames_locations {offset:45; length:6; wikiData:Q12345;}

The parser uses regular expressions to identify each tag:

Matches pattern: tagname {param1:value1; param2:value2;}
Extracts tag name
Splits parameter string on semicolons
Parses each parameter as key:value pair
Filters out structural tags (readingOrder, structure, score, type, index, caption)

Text Extraction

For each identified tag, the tool extracts the corresponding text from the Unicode content using the offset and length parameters:

offset = 12
length = 8
unicode_text = "Alsoe ons Mathias van Buyren geschreven heeft..."
tagged_text = unicode_text[12:20]  # "Mathias "

This positional extraction ensures accurate correspondence between tags and text, even when multiple tags reference the same TextLine.

Entity Grouping

Extracted entities are grouped through a multi-stage process:

Stage 1: Category Grouping All tags are first grouped by their tag category (persoon, geonames_locations, etc.). This separates different entity types for independent processing.

Stage 2: Normalisation Within each category, entity text is normalised for comparison:

Whitespace trimming
Lowercase conversion
Retention of original diacritics and punctuation

Stage 3: Clustering Entities with identical normalised forms are clustered together. For display purposes, the original spelling from the first occurrence is preserved:

"Mathias" (3 occurrences)
  - "Mathias" from page 12
  - "mathias" from page 45
  - "MATHIAS" from page 67

Stage 4: Alphabetical Sorting Clustered entities are sorted alphabetically by their normalised forms within each category.

HTML Generation

The output HTML file is entirely self-contained, embedding all CSS and JavaScript. This design ensures the index remains functional when moved, archived, or shared without external dependencies.

Structure

The HTML employs a tabbed interface with one tab per tag category. Each tab contains:

Category heading
Search input field
List of unique entities

Entity Display

Each entity appears as a collapsible item with:

Toggle icon (chevron)
Entity text
Occurrence count in parentheses
WikiData link (when available)
Hidden details section

When expanded, the details section shows all occurrences with:

Full TextLine content for context
Page number
Collection name
Filename
Direct link to Transkribus image (when available)

Interactivity

JavaScript provides three interactive features:

Tab Switching: Clicking category tabs shows/hides corresponding content
Entity Expansion: Clicking entity headers toggles visibility of occurrence details
Search Filtering: Typing in the search box filters visible entities in the active tab

WikiData Integration

When tags include WikiData identifiers in their parameters:

geonames_locations {offset:4; length:6; wikiData:Q12345;}

The indexer extracts these identifiers and generates clickable links:

<a href="https://www.wikidata.org/wiki/Q12345" target="_blank">Wikidata</a>

These links enable researchers to access additional context, variant names, geographic coordinates, and other linked data resources associated with the entity.

Output Format

HTML Index Structure

The generated HTML file follows this structure:

<!DOCTYPE html>
<html lang="nl">
<head>
    <meta charset="UTF-8">
    <title>PageXML Index</title>
    <style>
        /* Embedded CSS for all visual styling */
    </style>
</head>
<body>
    <div class="container">
        <header>
            <h1>PageXML Index</h1>
            <p class="subtitle">Overzicht van alle getagde elementen</p>
        </header>
        
        <div class="search-box">
            <input type="text" placeholder="Zoek...">
        </div>
        
        <div class="tabs">
            <div class="tab-header">
                <!-- Tab buttons for each category -->
            </div>
            
            <div class="tab-content">
                <!-- Entity listings -->
            </div>
        </div>
        
        <footer>
            <!-- Statistics -->
        </footer>
    </div>
    
    <script>
        /* JavaScript for interactivity */
    </script>
</body>
</html>

Entity Entry Example

A typical entity entry in the HTML:

<div class="item">
    <div class="item-header" onclick="toggleItem(this)">
        <svg class="toggle-icon"><!-- Chevron icon --></svg>
        <h3 class="item-title">
            Mathias van Buyren
            <span class="item-count">(3)</span>
            <a href="https://www.wikidata.org/wiki/Q123456" class="wiki-link">(Wikidata)</a>
        </h3>
    </div>
    <div class="occurrences">
        <div class="occurrence">
            <div class="occurrence-header">
                <div class="occurrence-text">
                    <p class="context-text">
                        <span class="context-label">Context:</span>
                        Alsoe ons Mathias van Buyren geschreven heeft...
                    </p>
                    <p class="occurrence-meta">
                        Pagina 12 • 0018 • 0012_NL-ZlHCO_0003.1_0018_00012.xml
                    </p>
                </div>
                <a href="https://files.transkribus.eu/..." class="scan-link">
                    Bekijk scan
                </a>
            </div>
        </div>
        <!-- Additional occurrences... -->
    </div>
</div>

Workflow Integration

Position in Processing Pipeline

The PageXML indexer serves as an analysis and validation tool within the broader workflow:

Stage 1: Transcription

Manual transcription in Transkribus
Export to PageXML format

Stage 2: Entity Annotation

Manual or automated tagging of entities
Addition of semantic markup to TextLine custom attributes

Stage 3: Index Generation (this tool)

Extraction of all tagged entities
Generation of browsable index
Identification of annotation patterns

Stage 4: Validation and Refinement

Review of entity index for consistency
Identification of annotation errors or omissions
Correction of tags in Transkribus
Re-export and re-indexing

Stage 5: Further Analysis

Meeting header identification
Linked data conversion
Statistical analysis
Network graph generation

Recommended Practices

Iterative Indexing

Generate indices regularly throughout the annotation process rather than only at completion. Early indices help identify:

Tag category naming inconsistencies
Entities requiring WikiData linking
Annotation coverage gaps
Systematic transcription errors

Validation Workflow

Use the index systematically for quality control:

Generate initial index after annotating a sample of pages
Review each category for unexpected entries
Check occurrence contexts for annotation accuracy
Identify entities appearing with variant spellings
Standardise spelling in Transkribus where appropriate
Verify WikiData links function correctly
Re-export corrected PageXML
Regenerate index to confirm corrections

Documentation

Maintain a record of tag categories and their definitions alongside the index. Document:

Criteria for entity selection within each category
Boundary cases and decision rules
WikiData identifier sources
Annotation conventions specific to the project

Version Control

Track both PageXML collections and generated indices under version control:

Commit PageXML after annotation sessions
Commit generated index HTML after each indexing
Tag versions corresponding to major milestones
Include commit messages noting which collections were updated

Sharing and Publication

The self-contained HTML format facilitates sharing:

Indices can be attached to emails
Hosted on project websites without server-side processing
Archived with PageXML collections
Published as supplementary material with articles

Customisation

Modifying Category Labels

The tool includes default labels for common tag categories. To modify these labels, edit the CATEGORY_LABELS dictionary in the PageXMLIndexer class:

CATEGORY_LABELS = {
    'persoon': 'Personen',
    'geonames_locations': 'Locaties',
    'capaciteit_hoedanigheid': 'Hoedanigheden',
    'organisatie': 'Organisaties',
    'event_gebeurtenis': 'Gebeurtenissen',
    'datum': 'Data',
    'abbrev': 'Afkortingen',
    'Doc': 'Documenten',
    # Add custom categories here
    'custom_tag': 'Custom Label'
}

Categories not specified in this dictionary will display using their original tag names.

Excluding Additional Tags

To exclude tag categories from indexing, add them to the SKIP_TAGS set:

SKIP_TAGS = {
    'readingOrder', 
    'structure', 
    'score', 
    'type', 
    'index', 
    'caption',
    # Add additional tags to skip
    'internal_note',
    'processing_flag'
}

Adjusting HTML Styling

The embedded CSS can be modified to change visual appearance. Key style classes include:

.tabs: Main container styling
.tab-button: Category tab appearance
.item: Entity entry styling
.item-header: Clickable entity header
.occurrence: Individual occurrence styling
.scan-link: Button for Transkribus links

Troubleshooting

No Collections Found

Problem: "No collections found in [directory]"

Causes and Solutions:

Wrong directory level: Ensure you are pointing to the parent directory containing collection folders, not a collection folder itself
Missing page/ subdirectories: Verify each collection has a page/ subfolder
No XML files: Check that XML files exist within page/ folders
File permissions: Ensure read access to the directory structure

No Tags Found

Problem: "No tags found in any collection"

Causes and Solutions:

Tags not yet added: Verify that entity annotation has been performed in Transkribus
Export timing: Ensure PageXML was exported after annotation, not before
Tag format: Check that tags follow the pattern tagname {param:value;}
All tags excluded: Verify your custom SKIP_TAGS hasn't excluded all relevant tags
XML parsing issues: Check a sample file manually to confirm tags are present

Incorrect Page Numbers

Problem: Page numbers in index don't match expected values

Causes and Solutions:

Metadata source: The tool extracts page numbers from TranskribusMetadata pageNr attribute, not from filenames
Page numbering in Transkribus: Verify page numbering is correctly set in the Transkribus interface
Re-export required: If page numbers were corrected in Transkribus, re-export the PageXML

Missing Image Links

Problem: "Bekijk scan" links are absent or non-functional

Causes and Solutions:

Private collections: Transkribus image URLs may require authentication
Export settings: Verify that image URLs were included in the PageXML export
URL expiration: Some Transkribus URLs may be time-limited
Network access: Links require internet connectivity to function

XML Parsing Errors

Problem: Warnings about unparseable XML files

Causes and Solutions:

Corrupted exports: Re-export affected files from Transkribus
Manual editing errors: Validate XML syntax if files were manually edited
Encoding issues: Ensure files are UTF-8 encoded
Incomplete downloads: Verify file sizes match expected values

HTML Not Opening

Problem: Generated HTML file won't open or displays incorrectly

Causes and Solutions:

File associations: Ensure .html files are associated with a web browser
Character encoding: Some older browsers may have issues with UTF-8; try a modern browser (Chrome, Firefox, Safari, Edge)
File size: Very large collections may generate HTML files that are slow to load; be patient
JavaScript disabled: The index requires JavaScript; ensure it's enabled in browser settings

Memory Issues

Problem: Script crashes with memory errors on large collections

Causes and Solutions:

File-by-file processing: The script processes files sequentially to minimise memory usage
Large occurrence counts: Entities with thousands of occurrences may require significant memory
Split processing: Process collections separately and generate individual indices
System resources: Close other applications to free memory

Technical Details

Performance Characteristics

Processing performance scales approximately linearly with:

Number of XML files
Number of tagged entities
Length of Unicode text in TextLines

Typical performance on modern hardware:

~50-100 files per second for parsing
~1-2 seconds for grouping and sorting
~5-10 seconds for HTML generation

Total processing time for 1000 files: approximately 1-2 minutes

Memory usage remains modest:

Peak memory: typically under 500 MB
HTML file size: approximately 1-2 KB per tagged occurrence

XML Namespace Handling

The tool registers the PAGE namespace before parsing:

ET.register_namespace('page', 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15')

This ensures that namespace prefixes are preserved if the tool were extended to write modified XML files.

Character Encoding

All file operations use UTF-8 encoding explicitly:

html_content = Path(output_path).write_text(html_content, encoding='utf-8')

This ensures correct handling of:

Early modern Dutch orthography (ij, aen, etc.)
Diacritical marks (é, ë, ñ)
Special characters in entity names
Unicode symbols in generated HTML

Regular Expression Patterns

Tag parsing employs the pattern:

r'(\w+)\s*\{([^}]+)\}'

This matches:

\w+: Tag name (letters, digits, underscore)
\s*: Optional whitespace
\{: Opening brace
[^}]+: Parameters (any characters except closing brace)
\}: Closing brace

Parameter parsing splits on semicolons and colons, handling various spacing conventions in custom attributes.

Extension Points

Alternative Output Formats

The tool's architecture facilitates addition of other output formats:

CSV Export Generate tabular data for spreadsheet analysis:

def generate_csv_output(self, output_path: Path):
    # Write category, entity, count, occurrences to CSV

JSON Export Create structured data for programmatic access:

def generate_json_output(self, output_path: Path):
    # Serialise grouped_tags to JSON

Markdown Export Generate human-readable plain text listings:

def generate_markdown_output(self, output_path: Path):
    # Create Markdown with entity lists

Statistical Analysis

Additional methods could calculate:

Tag category distributions
Entities per page statistics
Co-occurrence matrices
Temporal distributions (using datum tags)

Visualisation

The grouped_tags data structure could feed:

Network graphs of entity relationships
Geographic maps of locations
Timeline visualisations of events
Word clouds of entity frequencies

Integration with External Services

The tool could be extended to:

Query WikiData API for entity enrichment
Validate geonames against GeoNames database
Cross-reference persons with authority files (VIAF, ORCID)
Link organisations to cultural heritage databases

Limitations

No disambiguation: Identical text strings are treated as the same entity even when referring to different individuals
Case sensitivity: Clustering uses lowercase normalisation, potentially merging entities that should remain distinct
Context window: Only the containing TextLine is shown as context, not surrounding lines
Single-page processing: Each page is processed independently; multi-page entities are not connected
No temporal ordering: Occurrences are listed in processing order, not chronological order
Memory constraints: Very large collections (10,000+ pages) may approach memory limits on systems with <4GB RAM
HTML size: Extremely large indices (100,000+ tags) may cause browser performance issues
No concurrent processing: Files are processed sequentially; parallel processing could improve speed

Future Development

Potential enhancements for subsequent versions:

Entity Linking

Integration with authority files for person names
Automated WikiData identifier suggestions
Geographic coordinate extraction for locations
Cross-reference resolution across collections

Advanced Visualisations

Interactive network graphs of entity co-occurrences
Timeline views of dated entities
Geographic maps of tagged locations
Statistical dashboards

Export Flexibility

CSV format for quantitative analysis
JSON format for API integration
RDF/TTL format for linked data
GraphML format for network analysis tools

Annotation Support

Identification of entities lacking WikiData identifiers
Detection of potential annotation errors
Suggestion of standardised spellings
Flagging of entities with very few occurrences

Performance Optimisation

Parallel processing of multiple files
Incremental indexing (process only changed files)
Database backend for very large collections
Chunked HTML generation for massive indices

Acknowledgements

This PageXML Indexer Tool was developed within the context of the HAICu project on the Resoluties van de Staten van Overijssel (Resolutions of the States of Overijssel), funded by the Dutch Research Council/Nederlandse Organisatie voor Wetenschappelijk Onderzoek/Nationale Wetenschapsagenda [NWA.1518.22.105].

Development was assisted by Claude (Anthropic) for code implementation and documentation.

Contact

Email: c.a.romein@utwente.nl

Licence

This project is licensed under the MIT Licence.

Version History

Version 1.0 (2025): Initial release with automatic tag discovery, multi-collection processing, entity grouping, and interactive HTML index generation

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
pagexml-indexer-script.py		pagexml-indexer-script.py

Folders and files

Latest commit

History

Repository files navigation

PageXML Indexer Tool

Purpose

Features

Automatic Tag Discovery

Multi-Collection Processing

Comprehensive Metadata Extraction

Intelligent Grouping and Sorting

Interactive HTML Output

Extensible Design

Requirements

Installation

PageXML Folder Structure

Usage

Basic Invocation

Alternative Invocation Methods

Command-Line Options

Example Session

How It Works

Collection Discovery

PageXML Parsing

Custom Attribute Parsing

Text Extraction

Entity Grouping

HTML Generation

WikiData Integration

Output Format

HTML Index Structure

Entity Entry Example

Workflow Integration

Position in Processing Pipeline

Recommended Practices

Customisation

Modifying Category Labels

Excluding Additional Tags

Adjusting HTML Styling

Troubleshooting

No Collections Found

No Tags Found

Incorrect Page Numbers

Missing Image Links

XML Parsing Errors

HTML Not Opening

Memory Issues

Technical Details

Performance Characteristics

XML Namespace Handling

Character Encoding

Regular Expression Patterns

Extension Points

Alternative Output Formats

Statistical Analysis

Visualisation

Integration with External Services

Limitations

Future Development

Entity Linking

Advanced Visualisations

Export Flexibility

Annotation Support

Performance Optimisation

Acknowledgements

Contact

Licence

Version History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages