An evaluation framework for legal judgment prediction that integrates multi-agent graph retrieval and supports reproducible comparisons across multiple models and baselines.
- ✅ Automated Evaluation: Computes
Accuracy (Acc)andMicro-F1automatically for legal judgment prediction tasks. - ✅ Multi-Model Support: Supports Qwen, DeepSeek, GPT, InternLM, GLM, Gemma, and more.
- ✅ Dataset Coverage: Includes legal datasets such as CAIL and CMDL.
- ✅ Baseline Comparison: Enables direct comparison with
HippoRAG2,RAPTOR,LightRAG,LegalΔ, andADAPT.
LegalGraphRAG/
├── core/ # Core modules
│ ├── LegalGraphRAG.py # Main LegalGraphRAG class
│ ├── models/ # Model implementations
│ │ ├── transformers/ # Transformers-based models (Qwen, InternLM, GLM, Gemma)
│ │ └── openai/ # OpenAI-compatible models (DeepSeek, GPT)
│ ├── graph_construct/ # Graph construction and management
│ ├── judge/ # Legal judgment modules
│ ├── preprocess/ # Data preprocessing
│ ├── prompt/ # Prompt templates
│ └── utils/ # Utility functions
├── scripts/ # Data preparation scripts
├── raw_data/ # User-provided source files for preprocessing
├── datas/ # Generated preprocessing outputs
├── run.py # Main evaluation script
├── env.example # Configuration file template
└── README.md # Project documentation
# Install dependencies
pip install -r requirements.txt
# Copy and configure environment file
cp env.example .env
# Edit .env with model paths, API keys, and runtime settingsPut these source files under ./raw_data/:
final_test.json: raw CAIL case records used to build the case corpus.law_to_crime.json: base mapping from law article ids to candidate crimes.criminal_law_processed.json: structured criminal law articles (article id + item texts).judicial_explanations.json: judicial interpretation snippets linked to law article ids.law_corpus.jsonl: full law text corpus used as fallback when law text is missing.
Use one command to prepare all required data:
python scripts/prepare_data.py --dotenv-path .env --raw-data-dir ./raw_dataThis pipeline does four things in order:
- Builds sampled CAIL cases from raw records.
- Generates evaluation input file under
datasets/. - Uses an LLM to extract structured case features.
- Uses an LLM to generate law judgment dependency hints.
- Merges law resources into final project-ready law mapping data.
After these steps, make sure both files exist:
datas/cases_with_feature.jsondatasets/crime_data_CAIL_small.jsondatas/law_to_crime.json
python run.py --model qwen3 --datasets CAIL --devices cuda:2 cuda:3Main arguments
--model:qwen3,qwen2_5,gemma3,internlm3,glm4,deepseek_v3,gpt4o_mini--datasets: dataset name, e.g.CAIL,CMDL--dotenv_path: path to.env(default:.env)--datasets_path: path to datasets (default:./datasets)--devices: GPU devices, e.g.cuda:0 cuda:1--no-build-graph: skip graph construction when graph already exists--force-rebuild: force graph rebuild even if artifacts already exist
- Prediction outputs:
{output_dir}/{dataset}/{model}_results_combined.json
- Statistics:
{output_dir}/{dataset}/{model}_stats.json
Example output summary:
{
"model_name": "qwen3",
"dataset": "CAIL",
"total_cases": 1000,
"correct_count": 850,
"elapsed_time": 3600.0,
"output_file": "./outputs/CAIL/qwen3_results_combined.json"
}Configuration is managed via .env. Key groups include:
- Model Configuration: model names, devices, API keys, generation parameters
- Data Configuration: dataset paths and output directory
- Graph Configuration: graph construction and retrieval settings
See env.example for the full configuration list.
- Qwen3-8B
- Qwen2.5-7B-Instruct
- DeepSeek-V3
- GPT-4o-mini
- InternLM3
- GLM-4
Run on multiple GPUs by passing several devices:
python run.py --model qwen3 --datasets CAIL --devices cuda:0 cuda:1 cuda:2 cuda:3Cases are automatically distributed across the selected devices.
