Skip to content

CARomein/PageXML_indexing_tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

PageXML Indexer Tool

A Python tool for extracting and indexing tagged entities from PageXML files. The tool automatically discovers all tags in PageXML collections, groups them by category, and generates an interactive HTML index for browsing and analysis.

Purpose

Digital scholarly editions of historical sources increasingly employ Named Entity Recognition (NER) and other annotation approaches to enrich transcriptions. Tools such as Transkribus enable researchers to add semantic tags to TextLine elements within PageXML files, marking persons, locations, organisations, dates, and other entities of historical interest. These annotations typically appear as key-value pairs in the custom attribute of TextLine elements, following patterns such as persoon {offset:12; length:8;} or geonames_locations {offset:4; length:6; wikiData:Q12345;}.

However, once annotation is complete, researchers face the challenge of efficiently accessing and analysing these tagged entities across large document collections. Manual inspection of XML files proves impractical beyond a few dozen pages, and generic XML tools lack the domain-specific functionality needed for historical text analysis. Researchers require the ability to:

  • Survey which entities have been tagged throughout a collection
  • Identify how frequently particular entities appear
  • Locate all contexts in which an entity is mentioned
  • Verify annotation consistency and completeness
  • Access original page images for validation

The PageXML Indexer addresses these requirements through automated extraction and presentation of all tagged entities. The tool scans PageXML collections, identifies all semantic tags regardless of category, extracts the tagged text and its context, groups identical entities, and generates a self-contained interactive HTML index. This index enables researchers to browse entities alphabetically within categories, expand entries to view all occurrences with context, and link directly to Transkribus page images when available.

Features

Automatic Tag Discovery

  • Detection of all tag categories present in PageXML custom attributes
  • No pre-configuration required; adapts to project-specific tag schemas
  • Recognition of any attribute following the pattern tagname {param:value;}
  • Exclusion of structural tags (readingOrder, structure, etc.)

Multi-Collection Processing

  • Recursive scanning of directory structures
  • Automatic detection of collection directories containing page/ subdirectories
  • Processing of all PageXML files across multiple collections
  • Preservation of collection identity for each tagged entity

Comprehensive Metadata Extraction

  • Page numbers from Transkribus metadata
  • Document and page identifiers for reference
  • Image URLs for direct linking to scans
  • TextLine coordinates for precise location
  • WikiData identifiers when present in tags
  • Additional parameters from custom attributes

Intelligent Grouping and Sorting

  • Case-insensitive clustering of identical entities
  • Alphabetical sorting within each category
  • Preservation of original spelling for display
  • Aggregation of all occurrences per unique entity

Interactive HTML Output

  • Self-contained single-file index
  • Tabbed interface with separate sections per tag category
  • Search functionality within active category
  • Expandable/collapsible entity entries
  • Direct links to Transkribus page images
  • Full context display for each occurrence
  • Collection, page number, and filename references

Extensible Design

  • Support for arbitrary tag categories and naming conventions
  • Customisable category labels through configuration
  • Unicode support for multi-language content
  • Standard library dependencies only

Requirements

  • Python 3.7 or higher
  • Standard library modules only (no external dependencies required)

The tool uses exclusively Python standard library modules (xml.etree.ElementTree, pathlib, re, json, collections, argparse), ensuring compatibility across different computing environments without dependency management concerns.

Installation

Download the script directly or clone the repository containing the PageXML indexer tool. No installation process is required beyond ensuring Python 3.7+ is available on your system.

Place pagexml_indexer.py in a convenient location, such as a tools/ or scripts/ directory within your project structure.

PageXML Folder Structure

The tool expects PageXML collections organised with the following structure:

collections_base/
├── Collection_A/
│   └── page/
│       ├── 0001_archive_collection_00001.xml
│       ├── 0002_archive_collection_00002.xml
│       └── 0003_archive_collection_00003.xml
├── Collection_B/
│   └── page/
│       ├── 0001_archive_collection_00001.xml
│       └── 0002_archive_collection_00002.xml
└── Collection_C/
    └── page/
        ├── 0001_archive_collection_00001.xml
        ├── 0002_archive_collection_00002.xml
        └── 0003_archive_collection_00003.xml

Each collection directory must contain a page/ subdirectory housing individual PageXML files. The script identifies collections by scanning for directories containing page/ subdirectories with XML files.

Usage

Basic Invocation

To generate an index for all collections in a base directory:

python pagexml_indexer.py /path/to/collections

The tool will:

  1. Scan for all collections in the specified directory
  2. Process all PageXML files in each collection
  3. Extract all tagged entities
  4. Generate an HTML index file named pagexml_index.html in the base directory

Alternative Invocation Methods

Specify a custom output filename:

python pagexml_indexer.py /path/to/collections --output my_index.html

Suppress progress output:

python pagexml_indexer.py /path/to/collections --quiet

Interactive path entry (no command-line argument):

python pagexml_indexer.py

The tool will prompt: Enter path to collections directory:

Command-Line Options

positional arguments:
  base_path             Path to base directory containing PageXML collections

optional arguments:
  -h, --help            Show help message and exit
  -o OUTPUT, --output OUTPUT
                        Output HTML file path (default: pagexml_index.html in base directory)
  -q, --quiet           Suppress progress output

Example Session

Complete example demonstrating typical usage:

$ python pagexml_indexer.py ~/Documents/Resoluties/PageXML

=== PageXML Indexer ===

Found 3 collection(s):
  - 0018
  - 0019
  - 0020

  Processing collection: 0018
  Found 156 XML files
    Processed 156/156 files    
  Extracted 234 tags from 0018

  Processing collection: 0019
  Found 142 XML files
    Processed 142/142 files    
  Extracted 198 tags from 0019

  Processing collection: 0020
  Found 178 XML files
    Processed 178/178 files    
  Extracted 267 tags from 0020

=== Summary ===
Total tags extracted: 699
Collections processed: 3

Grouping and sorting tags...
Found 8 tag categories:
  - Personen: 45 unique items
  - Locaties: 38 unique items
  - Hoedanigheden: 67 unique items
  - Organisaties: 23 unique items
  - Gebeurtenissen: 12 unique items
  - Data: 89 unique items
  - Afkortingen: 156 unique items
  - Documenten: 8 unique items

Generating HTML output...
Generated HTML index: /Users/researcher/Documents/Resoluties/PageXML/pagexml_index.html

✓ Index successfully generated: /Users/researcher/Documents/Resoluties/PageXML/pagexml_index.html
  Open the file in your web browser to view the index

How It Works

Collection Discovery

The indexer begins by scanning the provided base directory for collection structures. It identifies subdirectories containing a page/ folder with XML files. This approach accommodates both single-collection and multi-collection directory structures.

The discovery process:

  1. Iterates through all immediate subdirectories of the base path
  2. Checks each subdirectory for a page/ folder
  3. Verifies that the page/ folder contains .xml files
  4. Adds qualifying directories to the collection list

Collections are processed in alphabetical order by directory name.

PageXML Parsing

For each XML file, the tool employs Python's ElementTree parser with namespace awareness. The PageXML format uses the namespace http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15, which the tool registers for consistent element lookup.

The parser extracts several levels of metadata:

Document-Level Metadata

  • docId: Transkribus document identifier
  • pageId: Transkribus page identifier
  • pageNr: Sequential page number within the document
  • imgUrl: Direct URL to page image in Transkribus storage

Region-Level Elements All TextLine elements are examined regardless of their parent TextRegion. This ensures comprehensive tag extraction even when document structure varies.

Line-Level Attributes

  • id: Unique identifier for the TextLine
  • custom: String containing semicolon-separated key-value pairs
  • Coords points: Polygon coordinates defining the line's bounding box
  • Unicode: Transcribed text content

Custom Attribute Parsing

The custom attribute contains metadata in a specific format:

readingOrder {index:0;} persoon {offset:12; length:8;} geonames_locations {offset:45; length:6; wikiData:Q12345;}

The parser uses regular expressions to identify each tag:

  1. Matches pattern: tagname {param1:value1; param2:value2;}
  2. Extracts tag name
  3. Splits parameter string on semicolons
  4. Parses each parameter as key:value pair
  5. Filters out structural tags (readingOrder, structure, score, type, index, caption)

Text Extraction

For each identified tag, the tool extracts the corresponding text from the Unicode content using the offset and length parameters:

offset = 12
length = 8
unicode_text = "Alsoe ons Mathias van Buyren geschreven heeft..."
tagged_text = unicode_text[12:20]  # "Mathias "

This positional extraction ensures accurate correspondence between tags and text, even when multiple tags reference the same TextLine.

Entity Grouping

Extracted entities are grouped through a multi-stage process:

Stage 1: Category Grouping All tags are first grouped by their tag category (persoon, geonames_locations, etc.). This separates different entity types for independent processing.

Stage 2: Normalisation Within each category, entity text is normalised for comparison:

  • Whitespace trimming
  • Lowercase conversion
  • Retention of original diacritics and punctuation

Stage 3: Clustering Entities with identical normalised forms are clustered together. For display purposes, the original spelling from the first occurrence is preserved:

"Mathias" (3 occurrences)
  - "Mathias" from page 12
  - "mathias" from page 45
  - "MATHIAS" from page 67

Stage 4: Alphabetical Sorting Clustered entities are sorted alphabetically by their normalised forms within each category.

HTML Generation

The output HTML file is entirely self-contained, embedding all CSS and JavaScript. This design ensures the index remains functional when moved, archived, or shared without external dependencies.

Structure

The HTML employs a tabbed interface with one tab per tag category. Each tab contains:

  • Category heading
  • Search input field
  • List of unique entities

Entity Display

Each entity appears as a collapsible item with:

  • Toggle icon (chevron)
  • Entity text
  • Occurrence count in parentheses
  • WikiData link (when available)
  • Hidden details section

When expanded, the details section shows all occurrences with:

  • Full TextLine content for context
  • Page number
  • Collection name
  • Filename
  • Direct link to Transkribus image (when available)

Interactivity

JavaScript provides three interactive features:

  1. Tab Switching: Clicking category tabs shows/hides corresponding content
  2. Entity Expansion: Clicking entity headers toggles visibility of occurrence details
  3. Search Filtering: Typing in the search box filters visible entities in the active tab

WikiData Integration

When tags include WikiData identifiers in their parameters:

geonames_locations {offset:4; length:6; wikiData:Q12345;}

The indexer extracts these identifiers and generates clickable links:

<a href="https://www.wikidata.org/wiki/Q12345" target="_blank">Wikidata</a>

These links enable researchers to access additional context, variant names, geographic coordinates, and other linked data resources associated with the entity.

Output Format

HTML Index Structure

The generated HTML file follows this structure:

<!DOCTYPE html>
<html lang="nl">
<head>
    <meta charset="UTF-8">
    <title>PageXML Index</title>
    <style>
        /* Embedded CSS for all visual styling */
    </style>
</head>
<body>
    <div class="container">
        <header>
            <h1>PageXML Index</h1>
            <p class="subtitle">Overzicht van alle getagde elementen</p>
        </header>
        
        <div class="search-box">
            <input type="text" placeholder="Zoek...">
        </div>
        
        <div class="tabs">
            <div class="tab-header">
                <!-- Tab buttons for each category -->
            </div>
            
            <div class="tab-content">
                <!-- Entity listings -->
            </div>
        </div>
        
        <footer>
            <!-- Statistics -->
        </footer>
    </div>
    
    <script>
        /* JavaScript for interactivity */
    </script>
</body>
</html>

Entity Entry Example

A typical entity entry in the HTML:

<div class="item">
    <div class="item-header" onclick="toggleItem(this)">
        <svg class="toggle-icon"><!-- Chevron icon --></svg>
        <h3 class="item-title">
            Mathias van Buyren
            <span class="item-count">(3)</span>
            <a href="https://www.wikidata.org/wiki/Q123456" class="wiki-link">(Wikidata)</a>
        </h3>
    </div>
    <div class="occurrences">
        <div class="occurrence">
            <div class="occurrence-header">
                <div class="occurrence-text">
                    <p class="context-text">
                        <span class="context-label">Context:</span>
                        Alsoe ons Mathias van Buyren geschreven heeft...
                    </p>
                    <p class="occurrence-meta">
                        Pagina 12 • 0018 • 0012_NL-ZlHCO_0003.1_0018_00012.xml
                    </p>
                </div>
                <a href="https://files.transkribus.eu/..." class="scan-link">
                    Bekijk scan
                </a>
            </div>
        </div>
        <!-- Additional occurrences... -->
    </div>
</div>

Workflow Integration

Position in Processing Pipeline

The PageXML indexer serves as an analysis and validation tool within the broader workflow:

Stage 1: Transcription

  • Manual transcription in Transkribus
  • Export to PageXML format

Stage 2: Entity Annotation

  • Manual or automated tagging of entities
  • Addition of semantic markup to TextLine custom attributes

Stage 3: Index Generation (this tool)

  • Extraction of all tagged entities
  • Generation of browsable index
  • Identification of annotation patterns

Stage 4: Validation and Refinement

  • Review of entity index for consistency
  • Identification of annotation errors or omissions
  • Correction of tags in Transkribus
  • Re-export and re-indexing

Stage 5: Further Analysis

  • Meeting header identification
  • Linked data conversion
  • Statistical analysis
  • Network graph generation

Recommended Practices

Iterative Indexing

Generate indices regularly throughout the annotation process rather than only at completion. Early indices help identify:

  • Tag category naming inconsistencies
  • Entities requiring WikiData linking
  • Annotation coverage gaps
  • Systematic transcription errors

Validation Workflow

Use the index systematically for quality control:

  1. Generate initial index after annotating a sample of pages
  2. Review each category for unexpected entries
  3. Check occurrence contexts for annotation accuracy
  4. Identify entities appearing with variant spellings
  5. Standardise spelling in Transkribus where appropriate
  6. Verify WikiData links function correctly
  7. Re-export corrected PageXML
  8. Regenerate index to confirm corrections

Documentation

Maintain a record of tag categories and their definitions alongside the index. Document:

  • Criteria for entity selection within each category
  • Boundary cases and decision rules
  • WikiData identifier sources
  • Annotation conventions specific to the project

Version Control

Track both PageXML collections and generated indices under version control:

  • Commit PageXML after annotation sessions
  • Commit generated index HTML after each indexing
  • Tag versions corresponding to major milestones
  • Include commit messages noting which collections were updated

Sharing and Publication

The self-contained HTML format facilitates sharing:

  • Indices can be attached to emails
  • Hosted on project websites without server-side processing
  • Archived with PageXML collections
  • Published as supplementary material with articles

Customisation

Modifying Category Labels

The tool includes default labels for common tag categories. To modify these labels, edit the CATEGORY_LABELS dictionary in the PageXMLIndexer class:

CATEGORY_LABELS = {
    'persoon': 'Personen',
    'geonames_locations': 'Locaties',
    'capaciteit_hoedanigheid': 'Hoedanigheden',
    'organisatie': 'Organisaties',
    'event_gebeurtenis': 'Gebeurtenissen',
    'datum': 'Data',
    'abbrev': 'Afkortingen',
    'Doc': 'Documenten',
    # Add custom categories here
    'custom_tag': 'Custom Label'
}

Categories not specified in this dictionary will display using their original tag names.

Excluding Additional Tags

To exclude tag categories from indexing, add them to the SKIP_TAGS set:

SKIP_TAGS = {
    'readingOrder', 
    'structure', 
    'score', 
    'type', 
    'index', 
    'caption',
    # Add additional tags to skip
    'internal_note',
    'processing_flag'
}

Adjusting HTML Styling

The embedded CSS can be modified to change visual appearance. Key style classes include:

  • .tabs: Main container styling
  • .tab-button: Category tab appearance
  • .item: Entity entry styling
  • .item-header: Clickable entity header
  • .occurrence: Individual occurrence styling
  • .scan-link: Button for Transkribus links

Troubleshooting

No Collections Found

Problem: "No collections found in [directory]"

Causes and Solutions:

  • Wrong directory level: Ensure you are pointing to the parent directory containing collection folders, not a collection folder itself
  • Missing page/ subdirectories: Verify each collection has a page/ subfolder
  • No XML files: Check that XML files exist within page/ folders
  • File permissions: Ensure read access to the directory structure

No Tags Found

Problem: "No tags found in any collection"

Causes and Solutions:

  • Tags not yet added: Verify that entity annotation has been performed in Transkribus
  • Export timing: Ensure PageXML was exported after annotation, not before
  • Tag format: Check that tags follow the pattern tagname {param:value;}
  • All tags excluded: Verify your custom SKIP_TAGS hasn't excluded all relevant tags
  • XML parsing issues: Check a sample file manually to confirm tags are present

Incorrect Page Numbers

Problem: Page numbers in index don't match expected values

Causes and Solutions:

  • Metadata source: The tool extracts page numbers from TranskribusMetadata pageNr attribute, not from filenames
  • Page numbering in Transkribus: Verify page numbering is correctly set in the Transkribus interface
  • Re-export required: If page numbers were corrected in Transkribus, re-export the PageXML

Missing Image Links

Problem: "Bekijk scan" links are absent or non-functional

Causes and Solutions:

  • Private collections: Transkribus image URLs may require authentication
  • Export settings: Verify that image URLs were included in the PageXML export
  • URL expiration: Some Transkribus URLs may be time-limited
  • Network access: Links require internet connectivity to function

XML Parsing Errors

Problem: Warnings about unparseable XML files

Causes and Solutions:

  • Corrupted exports: Re-export affected files from Transkribus
  • Manual editing errors: Validate XML syntax if files were manually edited
  • Encoding issues: Ensure files are UTF-8 encoded
  • Incomplete downloads: Verify file sizes match expected values

HTML Not Opening

Problem: Generated HTML file won't open or displays incorrectly

Causes and Solutions:

  • File associations: Ensure .html files are associated with a web browser
  • Character encoding: Some older browsers may have issues with UTF-8; try a modern browser (Chrome, Firefox, Safari, Edge)
  • File size: Very large collections may generate HTML files that are slow to load; be patient
  • JavaScript disabled: The index requires JavaScript; ensure it's enabled in browser settings

Memory Issues

Problem: Script crashes with memory errors on large collections

Causes and Solutions:

  • File-by-file processing: The script processes files sequentially to minimise memory usage
  • Large occurrence counts: Entities with thousands of occurrences may require significant memory
  • Split processing: Process collections separately and generate individual indices
  • System resources: Close other applications to free memory

Technical Details

Performance Characteristics

Processing performance scales approximately linearly with:

  • Number of XML files
  • Number of tagged entities
  • Length of Unicode text in TextLines

Typical performance on modern hardware:

  • ~50-100 files per second for parsing
  • ~1-2 seconds for grouping and sorting
  • ~5-10 seconds for HTML generation

Total processing time for 1000 files: approximately 1-2 minutes

Memory usage remains modest:

  • Peak memory: typically under 500 MB
  • HTML file size: approximately 1-2 KB per tagged occurrence

XML Namespace Handling

The tool registers the PAGE namespace before parsing:

ET.register_namespace('page', 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15')

This ensures that namespace prefixes are preserved if the tool were extended to write modified XML files.

Character Encoding

All file operations use UTF-8 encoding explicitly:

html_content = Path(output_path).write_text(html_content, encoding='utf-8')

This ensures correct handling of:

  • Early modern Dutch orthography (ij, aen, etc.)
  • Diacritical marks (é, ë, ñ)
  • Special characters in entity names
  • Unicode symbols in generated HTML

Regular Expression Patterns

Tag parsing employs the pattern:

r'(\w+)\s*\{([^}]+)\}'

This matches:

  • \w+: Tag name (letters, digits, underscore)
  • \s*: Optional whitespace
  • \{: Opening brace
  • [^}]+: Parameters (any characters except closing brace)
  • \}: Closing brace

Parameter parsing splits on semicolons and colons, handling various spacing conventions in custom attributes.

Extension Points

Alternative Output Formats

The tool's architecture facilitates addition of other output formats:

CSV Export Generate tabular data for spreadsheet analysis:

def generate_csv_output(self, output_path: Path):
    # Write category, entity, count, occurrences to CSV

JSON Export Create structured data for programmatic access:

def generate_json_output(self, output_path: Path):
    # Serialise grouped_tags to JSON

Markdown Export Generate human-readable plain text listings:

def generate_markdown_output(self, output_path: Path):
    # Create Markdown with entity lists

Statistical Analysis

Additional methods could calculate:

  • Tag category distributions
  • Entities per page statistics
  • Co-occurrence matrices
  • Temporal distributions (using datum tags)

Visualisation

The grouped_tags data structure could feed:

  • Network graphs of entity relationships
  • Geographic maps of locations
  • Timeline visualisations of events
  • Word clouds of entity frequencies

Integration with External Services

The tool could be extended to:

  • Query WikiData API for entity enrichment
  • Validate geonames against GeoNames database
  • Cross-reference persons with authority files (VIAF, ORCID)
  • Link organisations to cultural heritage databases

Limitations

  • No disambiguation: Identical text strings are treated as the same entity even when referring to different individuals
  • Case sensitivity: Clustering uses lowercase normalisation, potentially merging entities that should remain distinct
  • Context window: Only the containing TextLine is shown as context, not surrounding lines
  • Single-page processing: Each page is processed independently; multi-page entities are not connected
  • No temporal ordering: Occurrences are listed in processing order, not chronological order
  • Memory constraints: Very large collections (10,000+ pages) may approach memory limits on systems with <4GB RAM
  • HTML size: Extremely large indices (100,000+ tags) may cause browser performance issues
  • No concurrent processing: Files are processed sequentially; parallel processing could improve speed

Future Development

Potential enhancements for subsequent versions:

Entity Linking

  • Integration with authority files for person names
  • Automated WikiData identifier suggestions
  • Geographic coordinate extraction for locations
  • Cross-reference resolution across collections

Advanced Visualisations

  • Interactive network graphs of entity co-occurrences
  • Timeline views of dated entities
  • Geographic maps of tagged locations
  • Statistical dashboards

Export Flexibility

  • CSV format for quantitative analysis
  • JSON format for API integration
  • RDF/TTL format for linked data
  • GraphML format for network analysis tools

Annotation Support

  • Identification of entities lacking WikiData identifiers
  • Detection of potential annotation errors
  • Suggestion of standardised spellings
  • Flagging of entities with very few occurrences

Performance Optimisation

  • Parallel processing of multiple files
  • Incremental indexing (process only changed files)
  • Database backend for very large collections
  • Chunked HTML generation for massive indices

Acknowledgements

This PageXML Indexer Tool was developed within the context of the HAICu project on the Resoluties van de Staten van Overijssel (Resolutions of the States of Overijssel), funded by the Dutch Research Council/Nederlandse Organisatie voor Wetenschappelijk Onderzoek/Nationale Wetenschapsagenda [NWA.1518.22.105].

Development was assisted by Claude (Anthropic) for code implementation and documentation.

Contact

Licence

This project is licensed under the MIT Licence.

Version History

Version 1.0 (2025): Initial release with automatic tag discovery, multi-collection processing, entity grouping, and interactive HTML index generation

About

A Python tool for extracting and indexing tagged entities from PageXML files. The tool automatically discovers all tags in PageXML collections, groups them by category, and generates an interactive HTML index for browsing and analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages