Move recommended models into a Hugging Face dataset#262
Conversation
There was a problem hiding this comment.
Pull request overview
This PR migrates the “recommended models” catalog from a bundled JSON file to a Hugging Face dataset, and adds CLI support to publish/share recommended entries back to that dataset.
Changes:
- Replaces the bundled
catalog.jsonloader with a dataset-backed loader (plus index/metadata serialization helpers). - Adds
mesh-llm models recommended shareto publish a recommendation entry and updateindex.jsonvia the Hugging Face datasets commit API. - Updates catalog resolution to match recommended models by stored
idas well as display name, and removes the trackedmesh-llm/src/models/catalog.jsonfile.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| mesh-llm/src/models/resolve.rs | Expands exact-catalog lookup to match by id and exposes GGUF quant selector helper within the crate. |
| mesh-llm/src/models/catalog.rs | Implements dataset-backed recommended catalog loading + index/metadata schema and serialization utilities. |
| mesh-llm/src/models/catalog.json | Removes the previously bundled recommended catalog entries. |
| mesh-llm/src/cli/models.rs | Introduces models recommended subcommands (including share) in the CLI schema. |
| mesh-llm/src/cli/commands/models/mod.rs | Implements dataset preload, recommended listing, and the dataset “share” workflow (NDJSON commit w/ PR creation). |
| pub static MODEL_CATALOG: LazyLock<Vec<CatalogModel>> = LazyLock::new(load_catalog); | ||
|
|
||
| fn load_catalog() -> Vec<CatalogModel> { | ||
| let raw: Vec<CatalogModelJson> = | ||
| serde_json::from_str(include_str!("catalog.json")).expect("parse bundled catalog.json"); | ||
| raw.into_iter().map(CatalogModel::from_json).collect() | ||
| load_catalog_from_dataset(DEFAULT_RECOMMENDED_MODELS_DATASET_REPO, None).unwrap_or_else(|err| { | ||
| eprintln!( | ||
| "⚠️ Failed to load recommended model dataset {}: {err:#}", | ||
| DEFAULT_RECOMMENDED_MODELS_DATASET_REPO | ||
| ); | ||
| Vec::new() | ||
| }) |
There was a problem hiding this comment.
MODEL_CATALOG is now populated by downloading/parsing a Hugging Face dataset during static initialization (LazyLock::new(load_catalog)). This makes unit tests that call find_model(...)/find_catalog_model_exact(...) depend on network/HF hub availability and can cause nondeterministic failures (the repo already has tests that unwrap catalog lookups, e.g. in this file’s test module). Consider adding a test-time/offline override (e.g. load from an embedded fixture under cfg(test) or env var), and/or avoid doing network I/O in a global initializer so tests and non-CLI code paths don’t unexpectedly block on remote fetches.
| Ok(index) => index | ||
| .entries | ||
| .into_iter() | ||
| .map(|entry| entry.path) | ||
| .collect::<Vec<_>>(), |
There was a problem hiding this comment.
When index.json is present, the loader trusts entry.path values and downloads them without validating they are actually recommended-model metadata paths (e.g. start with models/ and end with metadata.json). Validating paths before download would prevent accidentally (or maliciously) pulling unrelated/large files and makes the loader more robust to a corrupted index.
| Ok(index) => index | |
| .entries | |
| .into_iter() | |
| .map(|entry| entry.path) | |
| .collect::<Vec<_>>(), | |
| Ok(index) => { | |
| let mut metadata_paths = Vec::with_capacity(index.entries.len()); | |
| for entry in index.entries { | |
| if !is_recommended_model_metadata_path(&entry.path) { | |
| return Err(anyhow::anyhow!( | |
| "recommended model index contains invalid metadata path: {}", | |
| entry.path | |
| )); | |
| } | |
| metadata_paths.push(entry.path); | |
| } | |
| metadata_paths | |
| } |
| let _ = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?; | ||
| None |
There was a problem hiding this comment.
dataset_entry_id_from_metadata_path() is currently a stub that always returns None. As a result, legacy metadata entries with an empty id will be assigned an empty string (unwrap_or_default()), which can break id-based lookups and produce non-unique/blank ids in the in-memory catalog. Implement id extraction from the hashed metadata path (or remove this fallback and require id to be present).
| let _ = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?; | |
| None | |
| let relative_path = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?; | |
| let entry_id = relative_path.strip_suffix(&format!("/{}", RECOMMENDED_MODELS_METADATA_FILE))?; | |
| if entry_id.is_empty() || entry_id.contains('/') { | |
| return None; | |
| } | |
| Some(entry_id.to_string()) |
| fn parse_catalog_metadata(text: &str, metadata_path: &str, path: &Path) -> Result<CatalogModel> { | ||
| let value: serde_json::Value = | ||
| serde_json::from_str(text).with_context(|| format!("parse {}", path.display()))?; | ||
| if value | ||
| .get("schema_version") | ||
| .and_then(|value| value.as_u64()) | ||
| .is_some() | ||
| { | ||
| let raw: CatalogMetadataV1 = | ||
| serde_json::from_value(value).with_context(|| format!("parse {}", path.display()))?; | ||
| return CatalogModel::from_metadata_v1(raw); | ||
| } |
There was a problem hiding this comment.
parse_catalog_metadata() treats the mere presence of schema_version as “V1”, but it doesn’t check the actual value. If a future dataset entry uses schema_version: 2, this will attempt to deserialize into CatalogMetadataV1 and fail (or misinterpret fields). Consider explicitly validating schema_version == 1 and returning a clear error for unsupported versions.
| pub fn run_model_recommended(json_output: bool) -> Result<()> { | ||
| if !json_output { | ||
| eprintln!( | ||
| "🔎 Fetching recommended models from Hugging Face dataset {}...", | ||
| catalog::DEFAULT_RECOMMENDED_MODELS_DATASET_REPO | ||
| ); | ||
| catalog::preload_catalog_dataset_with_progress( | ||
| catalog::DEFAULT_RECOMMENDED_MODELS_DATASET_REPO, | ||
| |progress| match progress { | ||
| catalog::CatalogLoadProgress::ListingEntries => {} | ||
| catalog::CatalogLoadProgress::LoadingEntry { completed, total } => { | ||
| if total == 0 { | ||
| return; | ||
| } | ||
| eprint!("\r Loaded {completed}/{total} recommended entries..."); | ||
| let _ = std::io::stderr().flush(); | ||
| if completed == total { | ||
| eprintln!(); | ||
| } | ||
| } | ||
| }, | ||
| )?; | ||
| } | ||
| let formatter = models_formatter(json_output); | ||
| let models: Vec<_> = catalog::MODEL_CATALOG.iter().collect(); | ||
| formatter.render_recommended(&models) | ||
| } |
There was a problem hiding this comment.
run_model_recommended() preloads the dataset with progress, but it then reads from catalog::MODEL_CATALOG, which triggers load_catalog() and re-parses the entire dataset again. Even if hf_hub hits the local cache, this doubles the work and can noticeably slow the command for larger catalogs. Consider having the preload path populate the global catalog (so it’s reused), or restructure so the progress-capable load is the one used to produce the final models vector.
| let metadata_path = catalog::dataset_metadata_path_for_model_id(model); | ||
| let metadata_body = catalog::serialize_recommended_model_metadata(&catalog_model)?; | ||
| let mut index_entries = catalog::load_catalog_index(dataset_repo).unwrap_or_default(); | ||
| if index_entries.iter().any(|entry| entry.id == model) { | ||
| if json_output { | ||
| formatters::print_json(json!({ | ||
| "status": "already_published", | ||
| "dataset_repo": dataset_repo, | ||
| "path": metadata_path, | ||
| "id": model, | ||
| }))?; | ||
| } else { | ||
| println!("✅ Already published"); | ||
| println!(" repo: {dataset_repo}"); | ||
| println!(" id: {model}"); | ||
| } | ||
| return Ok(()); | ||
| } | ||
| index_entries.push(catalog::build_catalog_index_entry(&catalog_model)); | ||
| let index_body = catalog::serialize_recommended_catalog_index(&index_entries)?; | ||
|
|
There was a problem hiding this comment.
load_catalog_index(dataset_repo).unwrap_or_default() will silently treat any error (network issues, auth, parse failure) as “empty index”, and the subsequent commit will write an index.json containing only the new entry. That can unintentionally clobber the dataset’s index in the contribution PR. It’d be safer to fail on index load errors (or handle only a specific “not found” case) before generating and committing a replacement index.
| @@ -87,27 +103,258 @@ struct CatalogModelJson { | |||
| mmproj: Option<CatalogAsset>, | |||
| } | |||
There was a problem hiding this comment.
LegacyCatalogModelJson requires an id field (id: String) but legacy catalog entries (including the removed catalog.json) did not have an id property. As written, serde_json::from_value::<LegacyCatalogModelJson> will fail before the later if raw.id.trim().is_empty() fallback runs. If you intend to support legacy entries, make id optional or add #[serde(default)] so missing ids deserialize as an empty string and can be filled from the metadata path.
|
@copilot apply changes based on the comments in this thread |
Summary
Users can now read and publish the recommended model catalog from the Hugging Face dataset meshllm/recommended-models instead of relying on a bundled
catalog.jsonfile.This change:
mesh-llm models recommended shareto publish recommended entries and maintainindex.jsonidas well as display namemesh-llm/src/models/catalog.jsonfileDataset
The recommended model catalog now lives in:
Current layout:
index.jsonmodels/sha256-<hash>/metadata.jsonCLI Examples
List recommended models:
Publish a recommended model entry:
mesh-llm models recommended share unsloth/Qwen3-8B-GGUF:Q4_K_M \ --description "Qwen3 mid-tier, strong for its size"Simulated Output
Console mode:
JSON mode:
{ "status": "opened_pr", "dataset_repo": "meshllm/recommended-models", "path": "models/sha256-ce097731219bfb95091ea1767631e2c718b1585e3392353326fe24c8e7ec1bcb/metadata.json", "id": "unsloth/Qwen3-8B-GGUF:Q4_K_M", "commit_oid": "0123456789abcdef0123456789abcdef01234567", "commit_url": "https://huggingface.co/datasets/meshllm/recommended-models/commit/0123456789abcdef0123456789abcdef01234567", "pull_request_url": "https://huggingface.co/datasets/meshllm/recommended-models/discussions/12" }Already published:
Testing
cargo check -p mesh-llmcargo test -p mesh-llm --lib --tests