This directory contains examples demonstrating various features of the statqa library, including multimodal Q/A database generation with rich visual metadata and publication-quality plots.
Start with basic_usage.py for a simple introduction to the core concepts.
A standalone script showing the fundamental statqa workflow:
- Creating codebooks from text
- Running univariate and bivariate analyses
- Generating insights and Q/A pairs
Best for: First-time users, understanding core concepts
Generate and analyze synthetic survey data.
- Quick setup (no downloads required)
- Demonstrates end-to-end pipeline
- Good for testing and experimentation
Best for: Learning the full pipeline, prototyping
Classic dataset with flower measurements.
- Small, well-understood dataset
- Continuous numeric variables
- Multi-class classification context
- 39 multimodal Q/A pairs with 15 visualizations
Best for: Testing statistical analyses, visualization, multimodal Q/A generation
Synthetic employee workplace survey.
- Mix of categorical and numeric variables
- Realistic survey structure
- Common workplace metrics
- 35 multimodal Q/A pairs with 15 visualizations
Best for: Understanding survey analysis, practicing with mixed data types, multimodal datasets
Historical passenger survival data.
- Binary outcome (survival)
- Mix of demographics and ticket information
- Well-known dataset for comparison
- 28 multimodal Q/A pairs with 15 visualizations
Best for: Classification analysis, survival analysis, CLIP-style training data
American National Election Studies political survey (1948-2020).
- Large-scale real-world dataset
- Complex codebook with 1000+ variables
- Demonstrates advanced features
Best for: Production use cases, large-scale analysis, research applications
python examples/basic_usage.pycd examples/simple_survey
python quick_start.pyQuick Analysis:
# Iris
cd examples/iris
python run_analysis.pyCustom Analysis:
cd examples/iris
python -c "
import pandas as pd
import json
from statqa.analysis.univariate import UnivariateAnalyzer
from statqa.interpretation.formatter import InsightFormatter
from statqa.metadata.model import Codebook
data = pd.read_csv('data.csv')
with open('codebook.json') as f:
codebook = Codebook.from_dict(json.load(f))
analyzer = UnivariateAnalyzer()
formatter = InsightFormatter()
for var_name, variable in codebook.variables.items():
if var_name in data.columns:
result = analyzer.analyze(data[var_name], variable)
print(formatter.format_univariate(result))
"Generated Output: Each example produces:
qa_pairs.jsonl: Enhanced Q/A pairs with visual metadataplots/: Publication-quality visualizations (PNG files)insights.json: Statistical analysis results
cd examples/anes
# Step 1: Parse metadata (requires PDF files)
python parse_metadata.py \
--codebook data/raw/anes_timeseries_cdf_codebook_var_20220916.pdf \
--output-metadata data/anes_metadata.csv \
--skip-questions
# Step 2: Extract insights
python extract_insights.py \
--data-zip data/raw/anes_timeseries_cdf_csv_20220916.csv.zip \
--metadata data/anes_metadata.csv \
--output-dir output \
--max-vars 50| Example | Complexity | Setup Time | Best For |
|---|---|---|---|
basic_usage.py |
Beginner | < 1 min | Learning basics |
simple_survey/ |
Beginner | < 1 min | Full pipeline |
iris/ |
Beginner | < 1 min | Statistics |
employee/ |
Intermediate | < 1 min | Survey analysis |
titanic/ |
Intermediate | < 1 min | Classification |
anes/ |
Advanced | ~10 min | Research |
from statqa.qa.generator import QAGenerator
qa_gen = QAGenerator(use_llm=False) # Template-based
qa_pairs = qa_gen.generate_qa_pairs(result, answer)
for qa in qa_pairs:
print(f"Q: {qa['question']}")
print(f"A: {qa['answer']}\n")import json
from statqa.utils.io import export_insights
# Save to JSON
with open('insights.json', 'w') as f:
json.dump(insights, f, indent=2)from statqa.visualization.plots import PlotFactory
plotter = PlotFactory(style='seaborn')
plotter.plot_univariate(result, output_path='distribution.png')
plotter.plot_bivariate(result, output_path='relationship.png')examples/
├── README.md # This file
├── basic_usage.py # Simple starting point
├── iris/ # Iris flowers dataset
│ ├── README.md
│ ├── data.csv
│ ├── codebook.json
│ ├── run_analysis.py # Complete multimodal analysis
│ ├── qa_pairs.jsonl # 39 Q/A pairs with visual metadata
│ ├── insights.json # Statistical results
│ └── plots/ # 15 visualizations (PNG files)
├── employee/ # Employee survey data
│ ├── README.md
│ ├── data.csv
│ ├── codebook.json
│ ├── run_analysis.py # Complete multimodal analysis
│ ├── qa_pairs.jsonl # 35 Q/A pairs with visual metadata
│ ├── insights.json # Statistical results
│ └── plots/ # 15 visualizations (PNG files)
├── titanic/ # Titanic survival data
│ ├── README.md
│ ├── data.csv
│ ├── codebook.json
│ ├── run_analysis.py # Complete multimodal analysis
│ ├── qa_pairs.jsonl # 28 Q/A pairs with visual metadata
│ ├── insights.json # Statistical results
│ └── plots/ # 15 visualizations (PNG files)
├── anes/ # ANES political survey (advanced)
│ ├── README.md
│ ├── parse_metadata.py
│ ├── extract_insights.py
│ ├── data/
│ └── output/
└── simple_survey/ # Synthetic survey generation
├── README.md
└── quick_start.py
- Documentation: https://gojiplus.github.io/statqa
- Issues: https://github.com/gojiplus/statqa/issues
- Discussions: https://github.com/gojiplus/statqa/discussions
- Start with
basic_usage.pyto learn the fundamentals - Try
simple_survey/for a complete workflow - Run multimodal examples:
cd iris && python run_analysis.pyfor enhanced Q/A generation - Explore visual metadata: Check the
plots/directories andqa_pairs.jsonlformat - Explore domain-specific examples (iris, employee, titanic) for different data types
- Build CLIP-style datasets: Use the multimodal Q/A pairs for AI training
- Tackle
anes/for advanced, real-world usage - Adapt examples to your own datasets with visualization requirements