Skip to content

rewrite application for RAG using Langchain#1

Open
demonkillerr wants to merge 15 commits intomainfrom
langchain
Open

rewrite application for RAG using Langchain#1
demonkillerr wants to merge 15 commits intomainfrom
langchain

Conversation

@demonkillerr
Copy link
Copy Markdown
Owner

@demonkillerr demonkillerr commented Jan 27, 2026

Overview of this pull request

Implemented sentiment analysis for academic documents with a 3 stage Retrieval-Augmented Generation (RAG) pipeline using Langchain.

  1. First stage: PDFs are extracted and split into semantic chunks with metadata using Langchain's text splitter. (RecursiveCharacterTextSplitter)
  2. Second stage: Chunks are embedded with Google's embedding model and stored in a vector database (ChromaDB) via Langchain. (GoogleGenerativeAIEmbeddings)
  3. Third Stage: When the user queries for a keyword, a semantic search is performed to retrieve the top 5 most relevant chunks, then pass them with a structure prompt to Gemini. (analyze_sentiment())

The model used is Gemini-3-flash. The model's code is: gemini-3-flash-preview

Key Changes from older NLP implementation

  • Backend: rewrite from Flask to FastAPI
  • PDF parsing: GROBID to Langchain textsplitting and processing

Why Langchain?

Langchain abstracts away the complexity of the RAG pipeline. It handles chunking with smart separators, manages embedding API calls, integrates with ChromaDB for vector operations, provides similarity search with filtering and formats prompts with templates. Without it, all these things must be manually implemented.

image

@demonkillerr demonkillerr requested a review from Copilot January 27, 2026 19:12
@demonkillerr demonkillerr added the enhancement New feature or request label Jan 27, 2026
@demonkillerr demonkillerr self-assigned this Jan 27, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request represents a complete rewrite of the sentiment analysis application, transitioning from a Flask + custom RAG implementation to a FastAPI + Langchain-based architecture. The changes include a 3-stage RAG pipeline for processing academic PDFs, embedding them using Google's embedding model, and performing sentiment analysis via Gemini.

Changes:

  • Complete backend rewrite: Flask → FastAPI with Langchain integration for RAG pipeline
  • Frontend simplification: Removed complex multi-page Next.js app, replaced with simple 2-tab interface
  • Infrastructure removal: Deleted all Kubernetes deployment configurations and GROBID integration

Reviewed changes

Copilot reviewed 100 out of 111 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
apps/backend/main.py New FastAPI entry point replacing Flask app
apps/backend/services/rag_service.py Langchain-based RAG implementation with ChromaDB and Gemini
apps/backend/services/pdf_processor.py PDF text extraction using PyMuPDF instead of GROBID
apps/backend/routers/*.py New API route handlers for upload and chat
apps/backend/config.py New centralized configuration module
apps/backend/requirements.txt Updated dependencies (FastAPI, Langchain, etc.)
apps/frontend/components/*.tsx New simplified upload and chat components
apps/frontend/tsconfig.json Modified TypeScript configuration
apps/frontend/package.json Updated Next.js and dependencies
k8s/* All Kubernetes files deleted

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/backend/config.py
load_dotenv()

class Settings:
GOOGLE_API_KEY: str = os.getenv("GOOGLE_API_KEY", "")
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable name "GOOGLE_API_KEY" is used throughout the codebase, but Google's official SDK expects "GEMINI_API_KEY" or uses the google.generativeai.configure() method. This inconsistency could lead to authentication failures. Ensure the environment variable name matches what the LangChain Google integration expects.

Suggested change
GOOGLE_API_KEY: str = os.getenv("GOOGLE_API_KEY", "")
GOOGLE_API_KEY: str = os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY", "")

Copilot uses AI. Check for mistakes.
Comment on lines +139 to +157
sentiment = "Neutral"
analysis_lower = analysis_text.lower() if isinstance(analysis_text, str) else ''

# Look for sentiment indicators throughout the text, with priority to earlier mentions
if "sentiment: positive" in analysis_lower or "sentiment:** positive" in analysis_lower:
sentiment = "Positive"
elif "sentiment: negative" in analysis_lower or "sentiment:** negative" in analysis_lower:
sentiment = "Negative"
elif "sentiment: mixed" in analysis_lower or "sentiment:** mixed" in analysis_lower:
sentiment = "Mixed"
elif "sentiment: neutral" in analysis_lower or "sentiment:** neutral" in analysis_lower:
sentiment = "Neutral"
# Fallback: check first 500 chars for keywords
elif "positive" in analysis_lower[:500] and "negative" not in analysis_lower[:500]:
sentiment = "Positive"
elif "negative" in analysis_lower[:500] and "positive" not in analysis_lower[:500]:
sentiment = "Negative"
elif "positive" in analysis_lower[:500] and "negative" in analysis_lower[:500]:
sentiment = "Mixed"
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentiment extraction logic in lines 139-157 uses fragile string parsing to extract sentiment from LLM output. This approach is unreliable as the LLM response format may vary. Consider using structured output from the LLM (JSON mode or function calling) or more robust parsing with regex patterns that match the actual prompt structure defined in line 116.

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +24

# Save uploaded file
file_path = os.path.join(settings.UPLOAD_DIR, file.filename)
with open(file_path, "wb") as buffer:
shutil.copyfileobj(file.file, buffer)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for file operations. If the file save fails due to disk space, permissions, or other I/O errors, the exception will propagate but the uploaded file data may be lost. Consider adding specific error handling with appropriate cleanup and user-friendly error messages.

Suggested change
# Save uploaded file
file_path = os.path.join(settings.UPLOAD_DIR, file.filename)
with open(file_path, "wb") as buffer:
shutil.copyfileobj(file.file, buffer)
# Ensure upload directory exists
try:
os.makedirs(settings.UPLOAD_DIR, exist_ok=True)
except OSError as e:
raise HTTPException(
status_code=500,
detail=f"Failed to prepare upload directory: {str(e)}",
)
# Save uploaded file
file_path = os.path.join(settings.UPLOAD_DIR, file.filename)
try:
with open(file_path, "wb") as buffer:
shutil.copyfileobj(file.file, buffer)
except (OSError, shutil.Error) as e:
# Best-effort cleanup of any partially written file
if os.path.exists(file_path):
try:
os.remove(file_path)
except OSError:
# Ignore cleanup errors to avoid masking the original issue
pass
raise HTTPException(
status_code=500,
detail=f"Failed to save uploaded file: {str(e)}",
)

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +41
@router.post("/upload")
async def upload_pdf(file: UploadFile = File(...)):
"""Upload and process a PDF document"""
try:
# Validate file type
if not file.filename.endswith('.pdf'):
raise HTTPException(status_code=400, detail="Only PDF files are allowed")

# Save uploaded file
file_path = os.path.join(settings.UPLOAD_DIR, file.filename)
with open(file_path, "wb") as buffer:
shutil.copyfileobj(file.file, buffer)

# Process PDF
documents = pdf_processor.extract_text_with_metadata(file_path)
chunks = pdf_processor.chunk_documents(documents)

# Create/update vector store
rag_service.create_vectorstore(chunks, file.filename)

return {
"message": "PDF uploaded and processed successfully",
"filename": file.filename,
"pages": len(documents),
"chunks": len(chunks)
}

except Exception as e:
raise HTTPException(status_code=500, detail=f"Error processing PDF: {str(e)}")
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security vulnerability: No file size validation before processing. An attacker could upload extremely large PDF files that consume excessive memory and CPU resources during processing, potentially causing denial of service. Add MAX_FILE_SIZE validation before processing.

Copilot uses AI. Check for mistakes.
}
{
"compilerOptions": {
"target": "es5",
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tsconfig.json target has been downgraded from "ES2017" to "es5", which may cause compatibility issues with modern JavaScript features used by Next.js 14 and its dependencies. ES5 is quite old (2009) and Next.js 14 requires at least ES2017. This configuration is likely incorrect and could cause build failures or runtime issues.

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +22
"paths": {
"@/*": ["./*"]
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path mapping has changed from "@/": ["./src/"] to "@/": ["./"] which means imports like "@/components/..." will now resolve to the root directory instead of the src directory. This is inconsistent with the actual file structure shown in the PR where components are in the root. However, the new components (UploadTab.tsx, ChatTab.tsx) use "@/components/..." imports which won't work with this path configuration since there's no "components" folder at the root - components are actually in the "components/" directory. This will cause module resolution failures.

Copilot uses AI. Check for mistakes.
Comment thread apps/backend/main.py
Comment on lines +8 to +14
app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:3000"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CORS configuration only allows requests from "http://localhost:3000", which will break in production or when the frontend is deployed elsewhere. Consider using an environment variable to configure allowed origins for different deployment environments.

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +38

# Use PyMuPDF for text extraction
doc = fitz.open(pdf_path)

for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text()

if text.strip():
# Try to identify section headers
section = self._identify_section(text)

documents.append({
"text": text,
"page_number": page_num + 1,
"section": section
})

doc.close()
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PDF processor lacks validation for malformed or encrypted PDFs. If fitz.open() encounters an encrypted or corrupted PDF, it will raise an exception that should be handled gracefully with informative error messages to the user.

Suggested change
# Use PyMuPDF for text extraction
doc = fitz.open(pdf_path)
for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text()
if text.strip():
# Try to identify section headers
section = self._identify_section(text)
documents.append({
"text": text,
"page_number": page_num + 1,
"section": section
})
doc.close()
# Use PyMuPDF for text extraction with basic validation and error handling
try:
doc = fitz.open(pdf_path)
except fitz.FileDataError as e:
# Raised for corrupted or invalid PDF data
raise ValueError(
f"Failed to open PDF file '{pdf_path}': file is corrupted or not a valid PDF."
) from e
except RuntimeError as e:
# PyMuPDF may raise RuntimeError for encrypted or otherwise problematic PDFs
raise ValueError(
f"Failed to open PDF file '{pdf_path}': the file may be encrypted or corrupted."
) from e
# Check if the PDF is encrypted and requires a password
if getattr(doc, "needs_pass", False):
doc.close()
raise ValueError(
f"Failed to process PDF file '{pdf_path}': the file is encrypted and a password was not provided."
)
try:
for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text()
if text.strip():
# Try to identify section headers
section = self._identify_section(text)
documents.append({
"text": text,
"page_number": page_num + 1,
"section": section
})
finally:
doc.close()

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,85 @@
import fitz # PyMuPDF
import pdfplumber
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'pdfplumber' is not used.

Suggested change
import pdfplumber

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants