rewrite application for RAG using Langchain#1
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request represents a complete rewrite of the sentiment analysis application, transitioning from a Flask + custom RAG implementation to a FastAPI + Langchain-based architecture. The changes include a 3-stage RAG pipeline for processing academic PDFs, embedding them using Google's embedding model, and performing sentiment analysis via Gemini.
Changes:
- Complete backend rewrite: Flask → FastAPI with Langchain integration for RAG pipeline
- Frontend simplification: Removed complex multi-page Next.js app, replaced with simple 2-tab interface
- Infrastructure removal: Deleted all Kubernetes deployment configurations and GROBID integration
Reviewed changes
Copilot reviewed 100 out of 111 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| apps/backend/main.py | New FastAPI entry point replacing Flask app |
| apps/backend/services/rag_service.py | Langchain-based RAG implementation with ChromaDB and Gemini |
| apps/backend/services/pdf_processor.py | PDF text extraction using PyMuPDF instead of GROBID |
| apps/backend/routers/*.py | New API route handlers for upload and chat |
| apps/backend/config.py | New centralized configuration module |
| apps/backend/requirements.txt | Updated dependencies (FastAPI, Langchain, etc.) |
| apps/frontend/components/*.tsx | New simplified upload and chat components |
| apps/frontend/tsconfig.json | Modified TypeScript configuration |
| apps/frontend/package.json | Updated Next.js and dependencies |
| k8s/* | All Kubernetes files deleted |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| load_dotenv() | ||
|
|
||
| class Settings: | ||
| GOOGLE_API_KEY: str = os.getenv("GOOGLE_API_KEY", "") |
There was a problem hiding this comment.
The environment variable name "GOOGLE_API_KEY" is used throughout the codebase, but Google's official SDK expects "GEMINI_API_KEY" or uses the google.generativeai.configure() method. This inconsistency could lead to authentication failures. Ensure the environment variable name matches what the LangChain Google integration expects.
| GOOGLE_API_KEY: str = os.getenv("GOOGLE_API_KEY", "") | |
| GOOGLE_API_KEY: str = os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY", "") |
| sentiment = "Neutral" | ||
| analysis_lower = analysis_text.lower() if isinstance(analysis_text, str) else '' | ||
|
|
||
| # Look for sentiment indicators throughout the text, with priority to earlier mentions | ||
| if "sentiment: positive" in analysis_lower or "sentiment:** positive" in analysis_lower: | ||
| sentiment = "Positive" | ||
| elif "sentiment: negative" in analysis_lower or "sentiment:** negative" in analysis_lower: | ||
| sentiment = "Negative" | ||
| elif "sentiment: mixed" in analysis_lower or "sentiment:** mixed" in analysis_lower: | ||
| sentiment = "Mixed" | ||
| elif "sentiment: neutral" in analysis_lower or "sentiment:** neutral" in analysis_lower: | ||
| sentiment = "Neutral" | ||
| # Fallback: check first 500 chars for keywords | ||
| elif "positive" in analysis_lower[:500] and "negative" not in analysis_lower[:500]: | ||
| sentiment = "Positive" | ||
| elif "negative" in analysis_lower[:500] and "positive" not in analysis_lower[:500]: | ||
| sentiment = "Negative" | ||
| elif "positive" in analysis_lower[:500] and "negative" in analysis_lower[:500]: | ||
| sentiment = "Mixed" |
There was a problem hiding this comment.
The sentiment extraction logic in lines 139-157 uses fragile string parsing to extract sentiment from LLM output. This approach is unreliable as the LLM response format may vary. Consider using structured output from the LLM (JSON mode or function calling) or more robust parsing with regex patterns that match the actual prompt structure defined in line 116.
|
|
||
| # Save uploaded file | ||
| file_path = os.path.join(settings.UPLOAD_DIR, file.filename) | ||
| with open(file_path, "wb") as buffer: | ||
| shutil.copyfileobj(file.file, buffer) |
There was a problem hiding this comment.
Missing error handling for file operations. If the file save fails due to disk space, permissions, or other I/O errors, the exception will propagate but the uploaded file data may be lost. Consider adding specific error handling with appropriate cleanup and user-friendly error messages.
| # Save uploaded file | |
| file_path = os.path.join(settings.UPLOAD_DIR, file.filename) | |
| with open(file_path, "wb") as buffer: | |
| shutil.copyfileobj(file.file, buffer) | |
| # Ensure upload directory exists | |
| try: | |
| os.makedirs(settings.UPLOAD_DIR, exist_ok=True) | |
| except OSError as e: | |
| raise HTTPException( | |
| status_code=500, | |
| detail=f"Failed to prepare upload directory: {str(e)}", | |
| ) | |
| # Save uploaded file | |
| file_path = os.path.join(settings.UPLOAD_DIR, file.filename) | |
| try: | |
| with open(file_path, "wb") as buffer: | |
| shutil.copyfileobj(file.file, buffer) | |
| except (OSError, shutil.Error) as e: | |
| # Best-effort cleanup of any partially written file | |
| if os.path.exists(file_path): | |
| try: | |
| os.remove(file_path) | |
| except OSError: | |
| # Ignore cleanup errors to avoid masking the original issue | |
| pass | |
| raise HTTPException( | |
| status_code=500, | |
| detail=f"Failed to save uploaded file: {str(e)}", | |
| ) |
| @router.post("/upload") | ||
| async def upload_pdf(file: UploadFile = File(...)): | ||
| """Upload and process a PDF document""" | ||
| try: | ||
| # Validate file type | ||
| if not file.filename.endswith('.pdf'): | ||
| raise HTTPException(status_code=400, detail="Only PDF files are allowed") | ||
|
|
||
| # Save uploaded file | ||
| file_path = os.path.join(settings.UPLOAD_DIR, file.filename) | ||
| with open(file_path, "wb") as buffer: | ||
| shutil.copyfileobj(file.file, buffer) | ||
|
|
||
| # Process PDF | ||
| documents = pdf_processor.extract_text_with_metadata(file_path) | ||
| chunks = pdf_processor.chunk_documents(documents) | ||
|
|
||
| # Create/update vector store | ||
| rag_service.create_vectorstore(chunks, file.filename) | ||
|
|
||
| return { | ||
| "message": "PDF uploaded and processed successfully", | ||
| "filename": file.filename, | ||
| "pages": len(documents), | ||
| "chunks": len(chunks) | ||
| } | ||
|
|
||
| except Exception as e: | ||
| raise HTTPException(status_code=500, detail=f"Error processing PDF: {str(e)}") |
There was a problem hiding this comment.
Security vulnerability: No file size validation before processing. An attacker could upload extremely large PDF files that consume excessive memory and CPU resources during processing, potentially causing denial of service. Add MAX_FILE_SIZE validation before processing.
| } | ||
| { | ||
| "compilerOptions": { | ||
| "target": "es5", |
There was a problem hiding this comment.
The tsconfig.json target has been downgraded from "ES2017" to "es5", which may cause compatibility issues with modern JavaScript features used by Next.js 14 and its dependencies. ES5 is quite old (2009) and Next.js 14 requires at least ES2017. This configuration is likely incorrect and could cause build failures or runtime issues.
| "paths": { | ||
| "@/*": ["./*"] |
There was a problem hiding this comment.
Path mapping has changed from "@/": ["./src/"] to "@/": ["./"] which means imports like "@/components/..." will now resolve to the root directory instead of the src directory. This is inconsistent with the actual file structure shown in the PR where components are in the root. However, the new components (UploadTab.tsx, ChatTab.tsx) use "@/components/..." imports which won't work with this path configuration since there's no "components" folder at the root - components are actually in the "components/" directory. This will cause module resolution failures.
| app.add_middleware( | ||
| CORSMiddleware, | ||
| allow_origins=["http://localhost:3000"], | ||
| allow_credentials=True, | ||
| allow_methods=["*"], | ||
| allow_headers=["*"], | ||
| ) |
There was a problem hiding this comment.
The CORS configuration only allows requests from "http://localhost:3000", which will break in production or when the frontend is deployed elsewhere. Consider using an environment variable to configure allowed origins for different deployment environments.
|
|
||
| # Use PyMuPDF for text extraction | ||
| doc = fitz.open(pdf_path) | ||
|
|
||
| for page_num in range(len(doc)): | ||
| page = doc[page_num] | ||
| text = page.get_text() | ||
|
|
||
| if text.strip(): | ||
| # Try to identify section headers | ||
| section = self._identify_section(text) | ||
|
|
||
| documents.append({ | ||
| "text": text, | ||
| "page_number": page_num + 1, | ||
| "section": section | ||
| }) | ||
|
|
||
| doc.close() |
There was a problem hiding this comment.
The PDF processor lacks validation for malformed or encrypted PDFs. If fitz.open() encounters an encrypted or corrupted PDF, it will raise an exception that should be handled gracefully with informative error messages to the user.
| # Use PyMuPDF for text extraction | |
| doc = fitz.open(pdf_path) | |
| for page_num in range(len(doc)): | |
| page = doc[page_num] | |
| text = page.get_text() | |
| if text.strip(): | |
| # Try to identify section headers | |
| section = self._identify_section(text) | |
| documents.append({ | |
| "text": text, | |
| "page_number": page_num + 1, | |
| "section": section | |
| }) | |
| doc.close() | |
| # Use PyMuPDF for text extraction with basic validation and error handling | |
| try: | |
| doc = fitz.open(pdf_path) | |
| except fitz.FileDataError as e: | |
| # Raised for corrupted or invalid PDF data | |
| raise ValueError( | |
| f"Failed to open PDF file '{pdf_path}': file is corrupted or not a valid PDF." | |
| ) from e | |
| except RuntimeError as e: | |
| # PyMuPDF may raise RuntimeError for encrypted or otherwise problematic PDFs | |
| raise ValueError( | |
| f"Failed to open PDF file '{pdf_path}': the file may be encrypted or corrupted." | |
| ) from e | |
| # Check if the PDF is encrypted and requires a password | |
| if getattr(doc, "needs_pass", False): | |
| doc.close() | |
| raise ValueError( | |
| f"Failed to process PDF file '{pdf_path}': the file is encrypted and a password was not provided." | |
| ) | |
| try: | |
| for page_num in range(len(doc)): | |
| page = doc[page_num] | |
| text = page.get_text() | |
| if text.strip(): | |
| # Try to identify section headers | |
| section = self._identify_section(text) | |
| documents.append({ | |
| "text": text, | |
| "page_number": page_num + 1, | |
| "section": section | |
| }) | |
| finally: | |
| doc.close() |
| @@ -0,0 +1,85 @@ | |||
| import fitz # PyMuPDF | |||
| import pdfplumber | |||
There was a problem hiding this comment.
Import of 'pdfplumber' is not used.
| import pdfplumber |
Overview of this pull request
Implemented sentiment analysis for academic documents with a 3 stage Retrieval-Augmented Generation (RAG) pipeline using Langchain.
RecursiveCharacterTextSplitter)GoogleGenerativeAIEmbeddings)analyze_sentiment())The model used is Gemini-3-flash. The model's code is:
gemini-3-flash-previewKey Changes from older NLP implementation
Why Langchain?
Langchain abstracts away the complexity of the RAG pipeline. It handles chunking with smart separators, manages embedding API calls, integrates with ChromaDB for vector operations, provides similarity search with filtering and formats prompts with templates. Without it, all these things must be manually implemented.