Move recommended models into a Hugging Face dataset by i386 · Pull Request #262 · Mesh-LLM/mesh-llm

i386 · 2026-04-13T01:39:42Z

Summary

Users can now read and publish the recommended model catalog from the Hugging Face dataset meshllm/recommended-models instead of relying on a bundled catalog.json file.

This change:

replaces the bundled recommended model catalog with a dataset-backed loader
adds mesh-llm models recommended share to publish recommended entries and maintain index.json
migrates dataset entries to hashed metadata paths with structured recommendation metadata
updates catalog resolution to match recommended models by stored id as well as display name
removes the tracked mesh-llm/src/models/catalog.json file

Dataset

The recommended model catalog now lives in:

https://huggingface.co/datasets/meshllm/recommended-models

Current layout:

index.json
models/sha256-<hash>/metadata.json

CLI Examples

List recommended models:

mesh-llm models recommended
mesh-llm models recommended --json

Publish a recommended model entry:

mesh-llm models recommended share unsloth/Qwen3-8B-GGUF:Q4_K_M \
  --description "Qwen3 mid-tier, strong for its size"

Simulated Output

Console mode:

📤 Recommended model share
📦 Qwen3-8B-Q4_K_M
   id: unsloth/Qwen3-8B-GGUF:Q4_K_M
   source: unsloth/Qwen3-8B-GGUF:Q4_K_M
☁️ Dataset contribution
   repo: meshllm/recommended-models
   path: models/sha256-ce097731219bfb95091ea1767631e2c718b1585e3392353326fe24c8e7ec1bcb/metadata.json
⬆️ Opening contribution PR...
✅ Opened recommended model contribution
   commit: 0123456789abcdef0123456789abcdef01234567
   url: https://huggingface.co/datasets/meshllm/recommended-models/commit/0123456789abcdef0123456789abcdef01234567
   pr: https://huggingface.co/datasets/meshllm/recommended-models/discussions/12

JSON mode:

{
  "status": "opened_pr",
  "dataset_repo": "meshllm/recommended-models",
  "path": "models/sha256-ce097731219bfb95091ea1767631e2c718b1585e3392353326fe24c8e7ec1bcb/metadata.json",
  "id": "unsloth/Qwen3-8B-GGUF:Q4_K_M",
  "commit_oid": "0123456789abcdef0123456789abcdef01234567",
  "commit_url": "https://huggingface.co/datasets/meshllm/recommended-models/commit/0123456789abcdef0123456789abcdef01234567",
  "pull_request_url": "https://huggingface.co/datasets/meshllm/recommended-models/discussions/12"
}

Already published:

✅ Already published
   repo: meshllm/recommended-models
   id: unsloth/Qwen3-8B-GGUF:Q4_K_M

Testing

cargo check -p mesh-llm
cargo test -p mesh-llm --lib --tests

Copilot

Pull request overview

This PR migrates the “recommended models” catalog from a bundled JSON file to a Hugging Face dataset, and adds CLI support to publish/share recommended entries back to that dataset.

Changes:

Replaces the bundled catalog.json loader with a dataset-backed loader (plus index/metadata serialization helpers).
Adds mesh-llm models recommended share to publish a recommendation entry and update index.json via the Hugging Face datasets commit API.
Updates catalog resolution to match recommended models by stored id as well as display name, and removes the tracked mesh-llm/src/models/catalog.json file.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
mesh-llm/src/models/resolve.rs	Expands exact-catalog lookup to match by `id` and exposes GGUF quant selector helper within the crate.
mesh-llm/src/models/catalog.rs	Implements dataset-backed recommended catalog loading + index/metadata schema and serialization utilities.
mesh-llm/src/models/catalog.json	Removes the previously bundled recommended catalog entries.
mesh-llm/src/cli/models.rs	Introduces `models recommended` subcommands (including `share`) in the CLI schema.
mesh-llm/src/cli/commands/models/mod.rs	Implements dataset preload, recommended listing, and the dataset “share” workflow (NDJSON commit w/ PR creation).

Copilot · 2026-04-13T01:44:24Z

 pub static MODEL_CATALOG: LazyLock<Vec<CatalogModel>> = LazyLock::new(load_catalog);

 fn load_catalog() -> Vec<CatalogModel> {
-    let raw: Vec<CatalogModelJson> =
-        serde_json::from_str(include_str!("catalog.json")).expect("parse bundled catalog.json");
-    raw.into_iter().map(CatalogModel::from_json).collect()
+    load_catalog_from_dataset(DEFAULT_RECOMMENDED_MODELS_DATASET_REPO, None).unwrap_or_else(|err| {
+        eprintln!(
+            "⚠️ Failed to load recommended model dataset {}: {err:#}",
+            DEFAULT_RECOMMENDED_MODELS_DATASET_REPO
+        );
+        Vec::new()
+    })


MODEL_CATALOG is now populated by downloading/parsing a Hugging Face dataset during static initialization (LazyLock::new(load_catalog)). This makes unit tests that call find_model(...)/find_catalog_model_exact(...) depend on network/HF hub availability and can cause nondeterministic failures (the repo already has tests that unwrap catalog lookups, e.g. in this file’s test module). Consider adding a test-time/offline override (e.g. load from an embedded fixture under cfg(test) or env var), and/or avoid doing network I/O in a global initializer so tests and non-CLI code paths don’t unexpectedly block on remote fetches.

Copilot · 2026-04-13T01:44:25Z

+        Ok(index) => index
+            .entries
+            .into_iter()
+            .map(|entry| entry.path)
+            .collect::<Vec<_>>(),


When index.json is present, the loader trusts entry.path values and downloads them without validating they are actually recommended-model metadata paths (e.g. start with models/ and end with metadata.json). Validating paths before download would prevent accidentally (or maliciously) pulling unrelated/large files and makes the loader more robust to a corrupted index.

Suggested change

Ok(index) => index

.entries

.into_iter()

.map(|entry| entry.path)

.collect::<Vec<_>>(),

Ok(index) => {

let mut metadata_paths = Vec::with_capacity(index.entries.len());

for entry in index.entries {

if !is_recommended_model_metadata_path(&entry.path) {

return Err(anyhow::anyhow!(

"recommended model index contains invalid metadata path: {}",

entry.path

));

}

metadata_paths.push(entry.path);

}

metadata_paths

}

Copilot · 2026-04-13T01:44:25Z

+    let _ = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?;
+    None


dataset_entry_id_from_metadata_path() is currently a stub that always returns None. As a result, legacy metadata entries with an empty id will be assigned an empty string (unwrap_or_default()), which can break id-based lookups and produce non-unique/blank ids in the in-memory catalog. Implement id extraction from the hashed metadata path (or remove this fallback and require id to be present).

Suggested change

let _ = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?;

None

let relative_path = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?;

let entry_id = relative_path.strip_suffix(&format!("/{}", RECOMMENDED_MODELS_METADATA_FILE))?;

if entry_id.is_empty() || entry_id.contains('/') {

return None;

}

Some(entry_id.to_string())

Copilot · 2026-04-13T01:44:25Z

+fn parse_catalog_metadata(text: &str, metadata_path: &str, path: &Path) -> Result<CatalogModel> {
+    let value: serde_json::Value =
+        serde_json::from_str(text).with_context(|| format!("parse {}", path.display()))?;
+    if value
+        .get("schema_version")
+        .and_then(|value| value.as_u64())
+        .is_some()
+    {
+        let raw: CatalogMetadataV1 =
+            serde_json::from_value(value).with_context(|| format!("parse {}", path.display()))?;
+        return CatalogModel::from_metadata_v1(raw);
+    }


parse_catalog_metadata() treats the mere presence of schema_version as “V1”, but it doesn’t check the actual value. If a future dataset entry uses schema_version: 2, this will attempt to deserialize into CatalogMetadataV1 and fail (or misinterpret fields). Consider explicitly validating schema_version == 1 and returning a clear error for unsupported versions.

Copilot · 2026-04-13T01:44:25Z

 pub fn run_model_recommended(json_output: bool) -> Result<()> {
+    if !json_output {
+        eprintln!(
+            "🔎 Fetching recommended models from Hugging Face dataset {}...",
+            catalog::DEFAULT_RECOMMENDED_MODELS_DATASET_REPO
+        );
+        catalog::preload_catalog_dataset_with_progress(
+            catalog::DEFAULT_RECOMMENDED_MODELS_DATASET_REPO,
+            |progress| match progress {
+                catalog::CatalogLoadProgress::ListingEntries => {}
+                catalog::CatalogLoadProgress::LoadingEntry { completed, total } => {
+                    if total == 0 {
+                        return;
+                    }
+                    eprint!("\r   Loaded {completed}/{total} recommended entries...");
+                    let _ = std::io::stderr().flush();
+                    if completed == total {
+                        eprintln!();
+                    }
+                }
+            },
+        )?;
+    }
    let formatter = models_formatter(json_output);
    let models: Vec<_> = catalog::MODEL_CATALOG.iter().collect();
    formatter.render_recommended(&models)
 }


run_model_recommended() preloads the dataset with progress, but it then reads from catalog::MODEL_CATALOG, which triggers load_catalog() and re-parses the entire dataset again. Even if hf_hub hits the local cache, this doubles the work and can noticeably slow the command for larger catalogs. Consider having the preload path populate the global catalog (so it’s reused), or restructure so the progress-capable load is the one used to produce the final models vector.

Copilot · 2026-04-13T01:44:26Z

+    let metadata_path = catalog::dataset_metadata_path_for_model_id(model);
+    let metadata_body = catalog::serialize_recommended_model_metadata(&catalog_model)?;
+    let mut index_entries = catalog::load_catalog_index(dataset_repo).unwrap_or_default();
+    if index_entries.iter().any(|entry| entry.id == model) {
+        if json_output {
+            formatters::print_json(json!({
+                "status": "already_published",
+                "dataset_repo": dataset_repo,
+                "path": metadata_path,
+                "id": model,
+            }))?;
+        } else {
+            println!("✅ Already published");
+            println!("   repo: {dataset_repo}");
+            println!("   id: {model}");
+        }
+        return Ok(());
+    }
+    index_entries.push(catalog::build_catalog_index_entry(&catalog_model));
+    let index_body = catalog::serialize_recommended_catalog_index(&index_entries)?;
+


load_catalog_index(dataset_repo).unwrap_or_default() will silently treat any error (network issues, auth, parse failure) as “empty index”, and the subsequent commit will write an index.json containing only the new entry. That can unintentionally clobber the dataset’s index in the contribution PR. It’d be safer to fail on index load errors (or handle only a specific “not found” case) before generating and committing a replacement index.

Copilot · 2026-04-13T01:44:26Z

@@ -87,27 +103,258 @@ struct CatalogModelJson {
    mmproj: Option<CatalogAsset>,
 }


LegacyCatalogModelJson requires an id field (id: String) but legacy catalog entries (including the removed catalog.json) did not have an id property. As written, serde_json::from_value::<LegacyCatalogModelJson> will fail before the later if raw.id.trim().is_empty() fallback runs. If you intend to support legacy entries, make id optional or add #[serde(default)] so missing ids deserialize as an empty string and can be filled from the metadata path.

i386 · 2026-04-13T02:00:10Z

@copilot apply changes based on the comments in this thread

Move recommended models catalog to dataset-backed schema

1dd92c0

Copilot AI review requested due to automatic review settings April 13, 2026 01:39

Copilot started reviewing on behalf of i386 April 13, 2026 01:40 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Copilot started work on behalf of i386 April 13, 2026 02:00 View session

Fix catalog tests for dataset-backed model metadata

394a050

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move recommended models into a Hugging Face dataset#262

Move recommended models into a Hugging Face dataset#262
i386 wants to merge 2 commits intomainfrom
move-catalogue-to-hf-dataset

i386 commented Apr 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

i386 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        Ok(index) => index
-            .entries
-            .into_iter()
-            .map(|entry| entry.path)
-            .collect::<Vec<_>>(),
+        Ok(index) => {
+            let mut metadata_paths = Vec::with_capacity(index.entries.len());
+            for entry in index.entries {
+                if !is_recommended_model_metadata_path(&entry.path) {
+                    return Err(anyhow::anyhow!(
+                        "recommended model index contains invalid metadata path: {}",
+                        entry.path
+                    ));
+                }
+                metadata_paths.push(entry.path);
+            }
+            metadata_paths
+        }

-    let _ = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?;
-    None
+    let relative_path = path.strip_prefix(RECOMMENDED_MODELS_PREFIX)?;
+    let entry_id = relative_path.strip_suffix(&format!("/{}", RECOMMENDED_MODELS_METADATA_FILE))?;
+    if entry_id.is_empty() || entry_id.contains('/') {
+        return None;
+    }
+    Some(entry_id.to_string())

		@@ -87,27 +103,258 @@ struct CatalogModelJson {
		mmproj: Option<CatalogAsset>,
		}

Conversation

i386 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dataset

CLI Examples

Simulated Output

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

i386 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

i386 commented Apr 13, 2026 •

edited

Loading