Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
5a26d52
Initial species analysis, first ported loader.
anjackson May 23, 2024
c0a4bc5
Merge branch 'master' into 2024-refresh
anjackson May 23, 2024
c77ef66
Initial new-style PRONOM parser.
anjackson Jun 20, 2024
2ddf18e
Adding DVC setup.
anjackson Jun 20, 2024
37ac28b
Working dvc repro, added genres.
anjackson Jun 20, 2024
643672c
Clean up PRONOM genres.
anjackson Jun 20, 2024
355ef82
Extend the new model, add NARA grabber.
anjackson Jul 11, 2024
61b3610
Add NARA to DVC workflow.
anjackson Jul 11, 2024
694d4df
Added initial TCDB parser.
anjackson Sep 10, 2024
8f9c5d5
Added in some version support.
anjackson Sep 10, 2024
78a31a0
Updated data hashes.
anjackson Sep 10, 2024
03a2063
Switched to Makefile, extended TCDB support.
anjackson Feb 13, 2025
d4d32a0
Updated pywikibot submodule.
anjackson Feb 13, 2025
4612a82
Notes to update.
anjackson Feb 13, 2025
5e68733
Update submodules.
anjackson Feb 13, 2025
d9a09b7
Merge branch 'master' into temp-merge
anjackson Feb 13, 2025
618d577
Add initial WikiData processor and try workflowing it.
anjackson Feb 13, 2025
8b2fdbb
Use a script instead.
anjackson Feb 13, 2025
0ab49f0
Repair aggregator script.
anjackson Feb 13, 2025
3bfbd7a
Make the dir first.
anjackson Feb 14, 2025
2cb2cb0
Core working WikiData-to-SQLite engine.
anjackson Feb 14, 2025
47f387a
Added the other fields.
anjackson Feb 14, 2025
a0977cc
Added actual SQLModel implementation.
anjackson Feb 14, 2025
f51d143
Three now working.
anjackson Feb 14, 2025
1f42368
LC FDD now included.
anjackson Feb 14, 2025
eb36ac6
Add NARA, fix up build system.
anjackson Feb 14, 2025
3ec38c2
Add dates that had been missed.
anjackson Feb 14, 2025
2ac5827
Couple of fixes, PRONOM prefix and media types.
anjackson Feb 14, 2025
3d1605c
Some NARA fixes.
anjackson Feb 15, 2025
fd1dceb
Switch TCDB data fields.
anjackson Feb 15, 2025
e651e6a
Missed dependency.
anjackson Feb 15, 2025
ef992e3
Split NARA tools.
anjackson Feb 15, 2025
f384631
Added FFW import using current minimal data.
anjackson Feb 15, 2025
76a04ac
Also add more FFW data.
anjackson Feb 15, 2025
3a55a79
Added GitHub Linguist into the new DB.
anjackson Feb 16, 2025
b12fcf1
Added Tika back in.
anjackson Feb 16, 2025
8fd329d
Some more Tika info.
anjackson Feb 16, 2025
3341658
And finally, TrID.
anjackson Feb 16, 2025
7c50d23
Strip globs from Tika entries.
anjackson Feb 16, 2025
4255b4b
Update latest version of digipres repo, inc. new sources.
anjackson Jul 18, 2025
a52a73e
Adding file/libmagic and MediaInfo sources.
anjackson Jul 18, 2025
09afd29
Moving species work out of here.
anjackson Aug 12, 2025
30809a5
Updating sources.
anjackson Aug 12, 2025
6ab7c0c
Use a plain list for extensions.
anjackson Aug 12, 2025
f4ce893
clean up extension sets export.
anjackson Aug 14, 2025
8d37eaa
Added CLI and args.
anjackson Aug 14, 2025
61249ba
Simplified to generate JSON first, use a simpler data model.
anjackson Sep 26, 2025
b252615
Fixed up PK/FK relationships.
anjackson Sep 26, 2025
829c958
Now outputs multiple sorted Parquet files and JSONL works too.
anjackson Sep 26, 2025
bec5459
Added ABC for a bit more rigor.
anjackson Sep 26, 2025
6aa8718
Added back in full Software objects, nested only for now.
anjackson Sep 26, 2025
bc76991
Add more outputs.
anjackson Sep 26, 2025
b737500
Adding some experimental code for talking to COPTR.
anjackson Sep 26, 2025
d289aa3
Now generates extension sets for each registry
anjackson Sep 26, 2025
e11ee6c
Add TFFH source. Add support for release dates. Fix error in Linguist…
anjackson Oct 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/data-update.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ jobs:
- name: Update from various data sources...
run: ./update.sh

- name: Generate derivatives...
run: ./derive.sh

- name: Deploy updated site...
run: ./deploy.sh
env:
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,8 @@
/bin
/pywikibot.lwp
/passwordfile
*.pyc
/registries.db
/.venv
/build
/data
9 changes: 9 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@

all: registries.db

registries.db: foreging/*.py digipres.github.io/_sources/registries/*
rm -fr data
mkdir -p data
python -m foreging.populate --json data
cp data/registries.db digipres.github.io/_data/formats/registries.db
cp data/*.parquet digipres.github.io/_data/formats/index/
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ To Do
* http://en.wikipedia.org/wiki/Alphabetical_list_of_filename_extensions_%28M%E2%80%93R%29
* http://www.webarchive.org.uk/aadda-discovery/formats?f[0]=content_type_ext:%22.bmp%22
* https://twitter.com/benfinoradin/status/532212803630039041
* Talk about how to use `git submodule update --recursive --remote` to make sure `pywikibot` and `digipres.github.io` are up to date.
* Using `uvx datasette serve data/registries.db` to quickly poke around in the database.

COPTR Bot
---------
Expand Down
10 changes: 7 additions & 3 deletions aggregates.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,6 @@ def addFormat(rid,fid,finfo):
# And add:
fmts[rid]['formats'][fid] = finfo


def aggregateFDD():
rid = "fdd"
print("Parsing %s..." % rid)
Expand Down Expand Up @@ -132,6 +131,7 @@ def aggregateFDD():
if rid in fmts: # FIXME this needs to be more robust, rather than relying on happening after 'addFormat' is called for the first time.
fmts[rid]['warnings'].append(f"Error when parsing XML from '{filename}': {e}")


def aggregateTRiD():
rid = "trid"
print("Parsing %s..." % rid)
Expand Down Expand Up @@ -394,9 +394,9 @@ def aggregateWikiData():
with open("%s/extensions.yml" % data_dir, 'w') as outfile:
outfile.write( yaml.safe_dump(extensions, default_flow_style=False) )

# Write out Venn data
# Write out Venn data, starting from a list like [extension] -> Registry_ID:
print("Outputting Venn data based on extensions...")
# Key all the RID-to-integer mappings:
# Key all the Registry_ID-to-integer mappings:
vennls = {}
i = 0
for fmt in fmts:
Expand All @@ -407,15 +407,19 @@ def aggregateWikiData():
venndsl = defaultdict(list)
vennlt = defaultdict(int)
vennids = {}
# Loop over all extensions:
for extension in exts:
regs = set()
regIds = set()
# Loop over each registry the extension appears in:
for ridder in exts[extension]['identifiers']:
regs.add(vennls[ridder['regId']])
regIds.add(ridder['regId'])
for rid in regs:
vennlt[rid] += 1
# Build a unique key for each registry combination:
key = ','.join(sorted(regs))
# Use the key to build up each overlap set:
vennids[key] = sorted(regIds)
venndsl[key].append(extension)
vennds[key] += 1
Expand Down
10 changes: 10 additions & 0 deletions derive.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash
set -e

source venv/bin/activate

make

cp data/registries.db digipres.github.io/_data/formats/
cp data/*.parquet digipres.github.io/_data/formats/index

2 changes: 1 addition & 1 deletion digipres.github.io
1 change: 1 addition & 0 deletions foreging/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# FOrmat REGistry INdexinG
72 changes: 72 additions & 0 deletions foreging/coptr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import mwclient as mw
from mwclient.listing import Category, PageList
import mwparserfromhell

from .models import Software

import logging
logging.basicConfig(level=logging.WARNING)


coptr_host = 'coptr.digipres.org'
user_agent = 'DigiPresFormatIndexClient/0.1 (andrew.jackson@dpconline.org)'
site = mw.Site(coptr_host, path='/', clients_useragent=user_agent)

#for tool_page in site.allpages():
# pass

#category = site.categories[u"Tool Grid"]
#for page in category:
# print(page.name)


# {{Infobox tool
# |image=JHOVE.gif
# |purpose=JHOVE provides functions to perform format-specific identification, validation, and characterization of digital objects.
# |homepage=http://jhove.openpreservation.org/
# |license=GNU Lesser General Public License (LGPL)
# |platforms=JHOVE should be usable on any UNIX, Windows, or OS X platform with an appropriate J2SE installation. It should run on any operating system that supports Java 1.5 and has a directory-based file system.
# |formats_in=EPUB, GIF, JP2, JPEG, PDF, PNG, PREMIS (Preservation Metadata Implementation Strategies), TIFF, WARC, XML, AIFF, WAVE, GZIP, ASCII, UTF-8, HTML, MP3
# |function=Encryption Detection, File Format Identification, Metadata Extraction, Validation
# }}

# FIXME this does both at once! One should write the page info needed to JSON. The other should use it.
# But, we don't know everything we need yet, I guess?

category: PageList = site.categories[u"Tools"]
for page in category:
print(page.name)
text = page.text()
wikicode = mwparserfromhell.parse(text)
templates = wikicode.filter_templates(matches='infobox tool')
template = templates[0]
formats = template.get("formats_in", None)
if formats:
formats = [f.strip() for f in formats.value.split(",")]
print(f" < {formats}")
formats = template.get("formats_out", None)
if formats:
formats = [f.strip() for f in formats.value.split(",")]
print(f" > {formats}")
print(page.pageid)
if isinstance(page, Category):
for member in page.members():
print(f"{page.name} > {member.name}")
else:
pass
s = Software(
id=f"coptr:pageid:{page.pageid}",
name=page.name,
version=None,
license=None,
registry_url=f"https://{coptr_host}/Special:Redirect/page/{page.pageid}"
)
license = template.get('license', None)
if license:
s.license = license.value.strip()
print(s)



# Workflows in Workflow namespace
# Formats is another potential category, but needs patching in via external IDs.
Empty file added foreging/db/__init__.py
Empty file.
43 changes: 43 additions & 0 deletions foreging/db/extension_sets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import json
import sqlite3
import logging
import argparse
from collections import defaultdict

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def generate_ext_sets(db):
con = sqlite3.connect(db)

cur = con.cursor()

ext_sets = defaultdict(set)
ext_counts = defaultdict(int)
for row in cur.execute("SELECT registry_id, format.id, e.value FROM format, json_each(extensions) AS e ORDER BY e.value ASC"):
ext_sets[row[0]].add(row[2].lower().strip())
ext_counts[row[0]] += 1

for source, ext_set in ext_sets.items():
ext_sets[source] = list(ext_set)
logger.info(f"Registry {source} has {ext_counts[source]} extensions, of which {len(ext_set)} are unique. Ratio: {ext_counts[source]/len(ext_set)}")
return ext_sets, ext_counts


if __name__ == "__main__":
# Args setup:
parser = argparse.ArgumentParser()
parser.add_argument('input_db')
parser.add_argument('output_json')
args = parser.parse_args()

# Query and return the sets of extensions:
ext_sets, ext_counts = generate_ext_sets(args.input_db)

# Output the sets of extensions:
with open(args.output_json, 'w') as f:
json.dump(ext_sets, f)



117 changes: 117 additions & 0 deletions foreging/db/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
from datetime import date
from sqlmodel import Field, Relationship, Session, SQLModel, create_engine, JSON, Column


class Registry(SQLModel, table=True):
id: str | None = Field(default=None, primary_key=True)
name: str = Field(index=True)
url: str | None = Field()
id_prefix: str | None = Field()
index_data_url: str | None = Field()

data_log: list["RegistryDataLogEntry"] = Relationship()


class RegistryDataLogEntry(SQLModel, table=True):
__tablename__ = 'registry_data_log'
id: int | None = Field(default=None, primary_key=True)
level: str = Field(index=True)
message: str = Field()
url: str | None = Field()

registry_id: str | None = Field(default=None, foreign_key="registry.id")
registry: Registry | None = Relationship(back_populates="data_log")

# Define how to spot unique entries in a set
def __hash__(self):
return hash(self.message)
def __eq__(self,other):
return self.message == other.message

class SoftwareReadsFormatLink(SQLModel, table=True):
__tablename__ = "formats_read_by_software"
format_id: str | None = Field(default=None, foreign_key="format.id", primary_key=True)
software_id: str | None = Field(default=None, foreign_key="software.id", primary_key=True)

class SoftwareWritesFormatLink(SQLModel, table=True):
__tablename__ = "formats_written_by_software"
format_id: str | None = Field(default=None, foreign_key="format.id", primary_key=True)
software_id: str | None = Field(default=None, foreign_key="software.id", primary_key=True)

class Software(SQLModel, table=True):
id: str | None = Field(default=None, primary_key=True)
name: str = Field(index=True)
version: str | None = Field(index=True)
summary: str | None = Field(index=True)
license: str | None = Field(index=True)
registry_url: str | None = Field(index=True)

reads: list["Format"] = Relationship(back_populates="readers", link_model=SoftwareReadsFormatLink)
writes: list["Format"] = Relationship(back_populates="writers", link_model=SoftwareWritesFormatLink)

registry_id: str | None = Field(default=None, foreign_key="registry.id")
registry: Registry | None = Relationship()

# Define how to spot unique entries in a set
def __hash__(self):
return hash(self.id)
def __eq__(self,other):
return self.id == other.id

class FormatGenresLink(SQLModel, table=True):
__tablename__ = "format_genres"
format_id: str | None = Field(default=None, foreign_key="format.id", primary_key=True)
genre_id: str | None = Field(default=None, foreign_key="genre.id", primary_key=True)

class Genre(SQLModel, table=True):
id: int | None = Field(default=None, primary_key=True)
name: str = Field(index=True)
#
formats: list["Format"] = Relationship(back_populates="genres", link_model=FormatGenresLink)

# Define how to spot unique entries in a set
def __hash__(self):
return hash(self.name)
def __eq__(self,other):
return self.name == other.name

class MediaTypesFormatsLink(SQLModel, table=True):
__tablename__ = "format_media_types"
format_id: str | None = Field(default=None, foreign_key="format.id", primary_key=True)
media_type_id: str | None = Field(default=None, foreign_key="media_type.id", primary_key=True)

class MediaType(SQLModel, table=True):
__tablename__ = "media_type"
id: str | None = Field(default=None, primary_key=True)
#
formats: list["Format"] = Relationship(back_populates="media_types", link_model=MediaTypesFormatsLink)

# Define how to spot unique entries in a set
def __hash__(self):
return hash(self.id)
def __eq__(self,other):
return self.id == other.id

class Format(SQLModel, table=True):
id: str | None = Field(default=None, primary_key=True)
name: str | None = Field(index=True)
version: str | None = Field(index=True)
summary: str | None = Field(index=True)
genres: list["Genre"] = Relationship(back_populates="formats", link_model=FormatGenresLink)
extensions: list[str] | None = Field(default=None, sa_column=Column(JSON))
media_types: list["MediaType"] = Relationship(back_populates="formats", link_model=MediaTypesFormatsLink)
has_magic: bool = Field(default=False)
primary_media_type: str | None = Field(index=True)
parent_media_type: str | None = Field(index=True)
registry_url: str | None = Field(index=True)
registry_source_data_url: str | None = Field(index=True)
registry_index_data_url: str | None = Field(index=True)
created: date | None = Field(index=True)
last_modified: date | None = Field(index=True)

readers: list["Software"] = Relationship(back_populates="reads", link_model=SoftwareReadsFormatLink)
writers: list["Software"] = Relationship(back_populates="writes", link_model=SoftwareWritesFormatLink)

registry_id: str | None = Field(default=None, foreign_key="registry.id")
registry: Registry | None = Relationship()

Loading