Documentación en español: README_SPANISH.md
Lightweight library to convert your SQL database schema into a vector database, so LLMs can use real table/column context to generate more accurate SQL from natural language.
I wanted to interact with SQL databases through an LLM without pulling in large frameworks with many unrelated features.
NaturalSQL extracts your database schema, vectorizes it using a configurable backend (chroma or sqlite) and configurable embedding provider (local or gemini), then performs semantic retrieval to return relevant schema context for your LLM.
# Base installation
pip install naturalsql
# Chroma + local embeddings
pip install "naturalsql[chroma-local]"
# SQLite + local embeddings
pip install "naturalsql[sqlite-local]"
# SQLite + Gemini embeddings
pip install "naturalsql[sqlite-gemini]"
# Chroma + Gemini embeddings
pip install "naturalsql[chroma-gemini]"
# PostgreSQL driver support
pip install naturalsql[postgresql]
# MySQL driver support
pip install naturalsql[mysql]
# SQL Server driver support
pip install naturalsql[sqlserver]
# All DB drivers
pip install naturalsql[all-db]SQLite is included in the Python standard library. For Gemini, the current SDK is
google-genai(import path:google.genai). Use environment variables for API keys (e.g.GEMINI_API_KEY), never hardcode secrets.
- PostgreSQL
- MySQL
- SQL Server
- SQLite
from naturalsql import NaturalSQL
# 1) Create an instance
nsql = NaturalSQL(
db_url="postgresql://user:password@localhost:5432/mydb",
db_type="postgresql",
)
# 2) Build vector DB from schema
result = nsql.build_vector_db()
print(f"Indexed tables: {result['indexed_documents']}")
print(f"From cache: {result['from_cache']}")
# 3) Retrieve relevant tables for a question
tables = nsql.search("Show me sales from last month")
# 4) Use returned tables as LLM context
for table in tables:
print(table)The first search() call loads the embedding model (~2-5s). Subsequent calls reuse the in-memory instance and typically run in ~10-15ms.
Similarly, build_vector_db() reuses existing vector storage when available (forced_reset=False) to avoid unnecessary reindexing.
from naturalsql import NaturalSQL
# A) Chroma + local (practical tuning for Chroma 1.5.x)
nsql = NaturalSQL(
db_url="postgresql://user:pass@localhost:5432/mydb",
db_type="postgresql",
vector_backend="chroma",
embedding_provider="local",
vector_distance_threshold=1.6, # If search() returns [], try 1.4-1.6
)
nsql.build_vector_db(storage_path="./metadata_vdb", forced_reset=False)
tables = nsql.search("sales for the last month", limit=3)
print(tables)import os
from naturalsql import NaturalSQL
# B) SQLite + Gemini
nsql = NaturalSQL(
db_url="sqlite:///./app.db",
db_type="sqlite",
vector_backend="sqlite",
embedding_provider="gemini",
gemini_api_key=os.environ["GEMINI_API_KEY"],
gemini_embedding_model="gemini-embedding-2-preview",
)
nsql.build_vector_db(storage_path="./metadata_vdb_sqlite", forced_reset=False)
tables = nsql.search("users with recent purchases", limit=3)
print(tables)Creates an instance with DB and embedding configuration.
| Parameter | Type | Default | Description |
|---|---|---|---|
db_url |
str | None |
None |
Database connection URL |
db_type |
str |
"" |
Engine: postgresql, mysql, sqlite, sqlserver |
db_normalize_embeddings |
bool |
True |
Normalize embedding vectors |
device |
str |
"cpu" |
Embedding device: cpu or cuda |
vector_backend |
Literal["chroma", "sqlite"] |
"chroma" |
Vector backend |
embedding_provider |
Literal["local", "gemini"] |
"local" |
Embedding provider |
gemini_api_key |
str | None |
None |
Required when embedding_provider="gemini" |
gemini_embedding_model |
str |
"gemini-embedding-2-preview" |
Gemini embedding model |
vector_distance_threshold |
float |
1.0 |
Max distance threshold used by search() filtering. For Chroma 1.5.x, if you get [], try 1.4-1.6 (validated with 1.6). |
Connects to the DB, extracts schema, and indexes it in the configured vector backend (chroma or sqlite).
| Parameter | Type | Default | Description |
|---|---|---|---|
storage_path |
str |
"./metadata_vdb" |
Vector storage path |
forced_reset |
bool |
False |
Rebuild vector collection from scratch |
Returns:
| Key | Type | Description |
|---|---|---|
storage_path |
str |
Storage path used |
indexed_documents |
int |
Number of indexed tables |
from_cache |
bool |
True if existing vector store was reused |
Retrieves semantically relevant tables for a natural-language question.
| Parameter | Type | Default | Description |
|---|---|---|---|
request |
str |
required | Natural-language question |
storage_path |
str |
"./metadata_vdb" |
Vector storage path |
limit |
int |
3 |
Maximum number of tables to return |
Helper function in naturalsql.utils.prompt to create an LLM prompt using relevant tables + user question.
from naturalsql.utils.prompt import build_prompt
prompt = build_prompt(tables, "Show me sales from last month")Think of NaturalSQL as a 5-step pipeline:
- Read DB structure with simple SQL/system queries.
- Normalize that structure into one common Python dictionary.
- Convert dictionary entries into semantic text documents.
- Turn those texts into embedding vectors.
- Store vectors in Chroma or SQLite and retrieve best matches for RAG.
NaturalSQL does not parse SQL files. It asks the live database for metadata.
- PostgreSQL:
information_schema.columnsfor columns + joins overinformation_schema.table_constraints,key_column_usage,constraint_column_usagefor foreign keys. - MySQL:
information_schema.columnsandinformation_schema.key_column_usage(scoped byDATABASE()). - SQL Server:
INFORMATION_SCHEMA.COLUMNSfor columns +sys.foreign_key_columnswithsys.tables,sys.schemas,sys.columnsfor FK relationships. - SQLite:
sqlite_master(table list),PRAGMA table_info('<table>')for columns,PRAGMA foreign_key_list('<table>')for relationships.
Essential idea: most engines use information_schema; SQLite uses PRAGMA.
After extraction, data is normalized to one shape:
{
"tables": {
"public.users": {
"schema": "public",
"table": "users",
"columns": [("id", "integer"), ("email", "text")]
}
},
"relationships": [
{
"from_schema": "public",
"from_table": "orders",
"from_column": "user_id",
"to_schema": "public",
"to_table": "users",
"to_column": "id"
}
]
}This is what makes the next steps backend-agnostic.
The extractor then generates documents for two kinds of knowledge:
kind=table: table + columns + data typeskind=relationship: FK direction between tables
Example payload item:
{
"id": "table::public.users",
"content": "Schema: public. Table name: users. It has the following columns: id (integer), email (text)",
"metadata": {
"kind": "table",
"schema": "public",
"table": "users"
}
}Relationship example:
{
"id": "rel::public.orders.user_id->public.users.id",
"content": "Relationship: public.orders.user_id -> public.users.id",
"metadata": {
"kind": "relationship"
}
}NaturalSQL supports two embedding providers:
local:sentence-transformers(all-MiniLM-L6-v2)gemini:google-genai(gemini-embedding-2-previewby default)
The system embeds:
- documents with retrieval-document mode
- user query with retrieval-query mode
This gives a vector for each schema document and one vector for each user question.
- Uses
chromadb.PersistentClient(path=storage_path). - Stores
ids,documents,embeddings,metadatasin collectiondb_schema. - Retrieval uses native vector query + metadata filter, for example
where={"kind": "table"}.
- Creates
vectors.dbunderstorage_path. - Table schema:
id,content,embedding(JSON text),metadata_json(JSON text). - Retrieval loads rows and computes cosine distance in Python (NumPy if available, pure Python fallback if not).
- Also filters by
kindfrommetadata_json.
When you call search("...", limit=N), the flow is:
- Embed query text.
- Query
kind=tableandkind=relationshipseparately. - Merge both result sets.
- Sort by distance (smaller is better).
- Apply
vector_distance_threshold. - Return top
Nschema contexts for your prompt.
So the LLM receives real schema context instead of guessing table/column names.
from naturalsql.sql.sqlschema import SQLSchemaExtractor
from naturalsql.controller.controllervector import VectorManager
# 1) extract
extractor = SQLSchemaExtractor(connection, db_type="postgresql")
schema_bundle = extractor.extract_schema()
# 2) normalize -> semantic docs
documents_payload = extractor.formated_for_ia(schema_bundle)
# 3) embed + persist (chroma or sqlite depending on config)
vm = VectorManager(storage_path="./metadata_vdb", force_reset=False, config=config)
vm.index_documents(documents_payload)
# 4) retrieve for RAG
context_docs = vm.search_relevant_tables("sales from last month", limit=3)| Aspect | Chroma | SQLite |
|---|---|---|
| Storage files | Chroma persistent directory | vectors.db file |
| Vector query | Native Chroma ANN query | Python cosine distance over stored vectors |
| Metadata filtering | Native where filter |
JSON metadata filter in Python |
| Dependency profile | Requires chromadb |
Uses stdlib sqlite3 + optional NumPy |
| Operational style | Dedicated vector DB behavior | Lightweight embedded local store |
These points are based on current implementation behavior:
- No hybrid ranking strategy yet: retrieval is distance-based over embeddings only (no lexical/BM25 rerank).
- Relationship text is concise (
A.col -> B.col), so very complex semantics are not explicitly encoded. - SQLite backend computes similarity in Python, so very large corpora may be slower than native vector engines.
build_vector_db()returnsindexed_documentsas total indexed documents (tables + relationships), while wording may be interpreted as only tables.- Schema extraction is structural (tables/columns/FKs). It does not ingest business definitions, comments, or curated ontology unless you add them as extra documents.
In practice, this still provides strong RAG context for SQL generation because table fields and foreign-key topology are the highest-value grounding signals.
See test/README.md for testing instructions.
