Skip to content

Move recommended models into a Hugging Face dataset#262

Open
i386 wants to merge 2 commits intomainfrom
move-catalogue-to-hf-dataset
Open

Move recommended models into a Hugging Face dataset#262
i386 wants to merge 2 commits intomainfrom
move-catalogue-to-hf-dataset

Conversation

@i386
Copy link
Copy Markdown
Collaborator

@i386 i386 commented Apr 13, 2026

Summary

Users can now read and publish the recommended model catalog from the Hugging Face dataset meshllm/recommended-models instead of relying on a bundled catalog.json file.

This change:

  • replaces the bundled recommended model catalog with a dataset-backed loader
  • adds mesh-llm models recommended share to publish recommended entries and maintain index.json
  • migrates dataset entries to hashed metadata paths with structured recommendation metadata
  • updates catalog resolution to match recommended models by stored id as well as display name
  • removes the tracked mesh-llm/src/models/catalog.json file

Dataset

The recommended model catalog now lives in:

Current layout:

  • index.json
  • models/sha256-<hash>/metadata.json

CLI Examples

List recommended models:

mesh-llm models recommended
mesh-llm models recommended --json

Publish a recommended model entry:

mesh-llm models recommended share unsloth/Qwen3-8B-GGUF:Q4_K_M \
  --description "Qwen3 mid-tier, strong for its size"

Simulated Output

Console mode:

📤 Recommended model share
📦 Qwen3-8B-Q4_K_M
   id: unsloth/Qwen3-8B-GGUF:Q4_K_M
   source: unsloth/Qwen3-8B-GGUF:Q4_K_M
☁️ Dataset contribution
   repo: meshllm/recommended-models
   path: models/sha256-ce097731219bfb95091ea1767631e2c718b1585e3392353326fe24c8e7ec1bcb/metadata.json
⬆️ Opening contribution PR...
✅ Opened recommended model contribution
   commit: 0123456789abcdef0123456789abcdef01234567
   url: https://huggingface.co/datasets/meshllm/recommended-models/commit/0123456789abcdef0123456789abcdef01234567
   pr: https://huggingface.co/datasets/meshllm/recommended-models/discussions/12

JSON mode:

{
  "status": "opened_pr",
  "dataset_repo": "meshllm/recommended-models",
  "path": "models/sha256-ce097731219bfb95091ea1767631e2c718b1585e3392353326fe24c8e7ec1bcb/metadata.json",
  "id": "unsloth/Qwen3-8B-GGUF:Q4_K_M",
  "commit_oid": "0123456789abcdef0123456789abcdef01234567",
  "commit_url": "https://huggingface.co/datasets/meshllm/recommended-models/commit/0123456789abcdef0123456789abcdef01234567",
  "pull_request_url": "https://huggingface.co/datasets/meshllm/recommended-models/discussions/12"
}

Already published:

✅ Already published
   repo: meshllm/recommended-models
   id: unsloth/Qwen3-8B-GGUF:Q4_K_M

Testing

  • cargo check -p mesh-llm
  • cargo test -p mesh-llm --lib --tests

Copilot AI review requested due to automatic review settings April 13, 2026 01:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the “recommended models” catalog from a bundled JSON file to a Hugging Face dataset, and adds CLI support to publish/share recommended entries back to that dataset.

Changes:

  • Replaces the bundled catalog.json loader with a dataset-backed loader (plus index/metadata serialization helpers).
  • Adds mesh-llm models recommended share to publish a recommendation entry and update index.json via the Hugging Face datasets commit API.
  • Updates catalog resolution to match recommended models by stored id as well as display name, and removes the tracked mesh-llm/src/models/catalog.json file.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
mesh-llm/src/models/resolve.rs Expands exact-catalog lookup to match by id and exposes GGUF quant selector helper within the crate.
mesh-llm/src/models/catalog.rs Implements dataset-backed recommended catalog loading + index/metadata schema and serialization utilities.
mesh-llm/src/models/catalog.json Removes the previously bundled recommended catalog entries.
mesh-llm/src/cli/models.rs Introduces models recommended subcommands (including share) in the CLI schema.
mesh-llm/src/cli/commands/models/mod.rs Implements dataset preload, recommended listing, and the dataset “share” workflow (NDJSON commit w/ PR creation).

Comment on lines 177 to +186
pub static MODEL_CATALOG: LazyLock<Vec<CatalogModel>> = LazyLock::new(load_catalog);

fn load_catalog() -> Vec<CatalogModel> {
let raw: Vec<CatalogModelJson> =
serde_json::from_str(include_str!("catalog.json")).expect("parse bundled catalog.json");
raw.into_iter().map(CatalogModel::from_json).collect()
load_catalog_from_dataset(DEFAULT_RECOMMENDED_MODELS_DATASET_REPO, None).unwrap_or_else(|err| {
eprintln!(
"⚠️ Failed to load recommended model dataset {}: {err:#}",
DEFAULT_RECOMMENDED_MODELS_DATASET_REPO
);
Vec::new()
})
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_CATALOG is now populated by downloading/parsing a Hugging Face dataset during static initialization (LazyLock::new(load_catalog)). This makes unit tests that call find_model(...)/find_catalog_model_exact(...) depend on network/HF hub availability and can cause nondeterministic failures (the repo already has tests that unwrap catalog lookups, e.g. in this file’s test module). Consider adding a test-time/offline override (e.g. load from an embedded fixture under cfg(test) or env var), and/or avoid doing network I/O in a global initializer so tests and non-CLI code paths don’t unexpectedly block on remote fetches.

Copilot uses AI. Check for mistakes.
Comment on lines +210 to +214
Ok(index) => index
.entries
.into_iter()
.map(|entry| entry.path)
.collect::<Vec<_>>(),
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When index.json is present, the loader trusts entry.path values and downloads them without validating they are actually recommended-model metadata paths (e.g. start with models/ and end with metadata.json). Validating paths before download would prevent accidentally (or maliciously) pulling unrelated/large files and makes the loader more robust to a corrupted index.

Suggested change
Ok(index) => index
.entries
.into_iter()
.map(|entry| entry.path)
.collect::<Vec<_>>(),
Ok(index) => {
let mut metadata_paths = Vec::with_capacity(index.entries.len());
for entry in index.entries {
if !is_recommended_model_metadata_path(&entry.path) {
return Err(anyhow::anyhow!(
"recommended model index contains invalid metadata path: {}",
entry.path
));
}
metadata_paths.push(entry.path);
}
metadata_paths
}

Copilot uses AI. Check for mistakes.
Comment on lines +283 to +284
let _ = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?;
None
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset_entry_id_from_metadata_path() is currently a stub that always returns None. As a result, legacy metadata entries with an empty id will be assigned an empty string (unwrap_or_default()), which can break id-based lookups and produce non-unique/blank ids in the in-memory catalog. Implement id extraction from the hashed metadata path (or remove this fallback and require id to be present).

Suggested change
let _ = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?;
None
let relative_path = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?;
let entry_id = relative_path.strip_suffix(&format!("/{}", RECOMMENDED_MODELS_METADATA_FILE))?;
if entry_id.is_empty() || entry_id.contains('/') {
return None;
}
Some(entry_id.to_string())

Copilot uses AI. Check for mistakes.
Comment on lines +439 to +450
fn parse_catalog_metadata(text: &str, metadata_path: &str, path: &Path) -> Result<CatalogModel> {
let value: serde_json::Value =
serde_json::from_str(text).with_context(|| format!("parse {}", path.display()))?;
if value
.get("schema_version")
.and_then(|value| value.as_u64())
.is_some()
{
let raw: CatalogMetadataV1 =
serde_json::from_value(value).with_context(|| format!("parse {}", path.display()))?;
return CatalogModel::from_metadata_v1(raw);
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_catalog_metadata() treats the mere presence of schema_version as “V1”, but it doesn’t check the actual value. If a future dataset entry uses schema_version: 2, this will attempt to deserialize into CatalogMetadataV1 and fail (or misinterpret fields). Consider explicitly validating schema_version == 1 and returning a clear error for unsupported versions.

Copilot uses AI. Check for mistakes.
Comment on lines 95 to 121
pub fn run_model_recommended(json_output: bool) -> Result<()> {
if !json_output {
eprintln!(
"🔎 Fetching recommended models from Hugging Face dataset {}...",
catalog::DEFAULT_RECOMMENDED_MODELS_DATASET_REPO
);
catalog::preload_catalog_dataset_with_progress(
catalog::DEFAULT_RECOMMENDED_MODELS_DATASET_REPO,
|progress| match progress {
catalog::CatalogLoadProgress::ListingEntries => {}
catalog::CatalogLoadProgress::LoadingEntry { completed, total } => {
if total == 0 {
return;
}
eprint!("\r Loaded {completed}/{total} recommended entries...");
let _ = std::io::stderr().flush();
if completed == total {
eprintln!();
}
}
},
)?;
}
let formatter = models_formatter(json_output);
let models: Vec<_> = catalog::MODEL_CATALOG.iter().collect();
formatter.render_recommended(&models)
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_model_recommended() preloads the dataset with progress, but it then reads from catalog::MODEL_CATALOG, which triggers load_catalog() and re-parses the entire dataset again. Even if hf_hub hits the local cache, this doubles the work and can noticeably slow the command for larger catalogs. Consider having the preload path populate the global catalog (so it’s reused), or restructure so the progress-capable load is the one used to produce the final models vector.

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +165
let metadata_path = catalog::dataset_metadata_path_for_model_id(model);
let metadata_body = catalog::serialize_recommended_model_metadata(&catalog_model)?;
let mut index_entries = catalog::load_catalog_index(dataset_repo).unwrap_or_default();
if index_entries.iter().any(|entry| entry.id == model) {
if json_output {
formatters::print_json(json!({
"status": "already_published",
"dataset_repo": dataset_repo,
"path": metadata_path,
"id": model,
}))?;
} else {
println!("✅ Already published");
println!(" repo: {dataset_repo}");
println!(" id: {model}");
}
return Ok(());
}
index_entries.push(catalog::build_catalog_index_entry(&catalog_model));
let index_body = catalog::serialize_recommended_catalog_index(&index_entries)?;

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_catalog_index(dataset_repo).unwrap_or_default() will silently treat any error (network issues, auth, parse failure) as “empty index”, and the subsequent commit will write an index.json containing only the new entry. That can unintentionally clobber the dataset’s index in the contribution PR. It’d be safer to fail on index load errors (or handle only a specific “not found” case) before generating and committing a replacement index.

Copilot uses AI. Check for mistakes.
Comment on lines 91 to 104
@@ -87,27 +103,258 @@ struct CatalogModelJson {
mmproj: Option<CatalogAsset>,
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LegacyCatalogModelJson requires an id field (id: String) but legacy catalog entries (including the removed catalog.json) did not have an id property. As written, serde_json::from_value::<LegacyCatalogModelJson> will fail before the later if raw.id.trim().is_empty() fallback runs. If you intend to support legacy entries, make id optional or add #[serde(default)] so missing ids deserialize as an empty string and can be filled from the metadata path.

Copilot uses AI. Check for mistakes.
@i386
Copy link
Copy Markdown
Collaborator Author

i386 commented Apr 13, 2026

@copilot apply changes based on the comments in this thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants