rewrite application for RAG using Langchain by demonkillerr · Pull Request #1 · demonkillerr/scholar_sense

demonkillerr · 2026-01-27T18:54:05Z

Overview of this pull request

Implemented sentiment analysis for academic documents with a 3 stage Retrieval-Augmented Generation (RAG) pipeline using Langchain.

First stage: PDFs are extracted and split into semantic chunks with metadata using Langchain's text splitter. (RecursiveCharacterTextSplitter)
Second stage: Chunks are embedded with Google's embedding model and stored in a vector database (ChromaDB) via Langchain. (GoogleGenerativeAIEmbeddings)
Third Stage: When the user queries for a keyword, a semantic search is performed to retrieve the top 5 most relevant chunks, then pass them with a structure prompt to Gemini. (analyze_sentiment())

The model used is Gemini-3-flash. The model's code is: gemini-3-flash-preview

Key Changes from older NLP implementation

Backend: rewrite from Flask to FastAPI
PDF parsing: GROBID to Langchain textsplitting and processing

Why Langchain?

Langchain abstracts away the complexity of the RAG pipeline. It handles chunking with smart separators, manages embedding API calls, integrates with ChromaDB for vector operations, provides similarity search with filtering and formats prompts with templates. Without it, all these things must be manually implemented.

Copilot

Pull request overview

This pull request represents a complete rewrite of the sentiment analysis application, transitioning from a Flask + custom RAG implementation to a FastAPI + Langchain-based architecture. The changes include a 3-stage RAG pipeline for processing academic PDFs, embedding them using Google's embedding model, and performing sentiment analysis via Gemini.

Changes:

Complete backend rewrite: Flask → FastAPI with Langchain integration for RAG pipeline
Frontend simplification: Removed complex multi-page Next.js app, replaced with simple 2-tab interface
Infrastructure removal: Deleted all Kubernetes deployment configurations and GROBID integration

Reviewed changes

Copilot reviewed 100 out of 111 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
apps/backend/main.py	New FastAPI entry point replacing Flask app
apps/backend/services/rag_service.py	Langchain-based RAG implementation with ChromaDB and Gemini
apps/backend/services/pdf_processor.py	PDF text extraction using PyMuPDF instead of GROBID
apps/backend/routers/*.py	New API route handlers for upload and chat
apps/backend/config.py	New centralized configuration module
apps/backend/requirements.txt	Updated dependencies (FastAPI, Langchain, etc.)
apps/frontend/components/*.tsx	New simplified upload and chat components
apps/frontend/tsconfig.json	Modified TypeScript configuration
apps/frontend/package.json	Updated Next.js and dependencies
k8s/*	All Kubernetes files deleted

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T19:17:50Z

+load_dotenv()
+
+class Settings:
+    GOOGLE_API_KEY: str = os.getenv("GOOGLE_API_KEY", "")


The environment variable name "GOOGLE_API_KEY" is used throughout the codebase, but Google's official SDK expects "GEMINI_API_KEY" or uses the google.generativeai.configure() method. This inconsistency could lead to authentication failures. Ensure the environment variable name matches what the LangChain Google integration expects.

Suggested change

GOOGLE_API_KEY: str = os.getenv("GOOGLE_API_KEY", "")

GOOGLE_API_KEY: str = os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY", "")

Copilot · 2026-01-27T19:17:51Z

+        sentiment = "Neutral"
+        analysis_lower = analysis_text.lower() if isinstance(analysis_text, str) else ''
+
+        # Look for sentiment indicators throughout the text, with priority to earlier mentions
+        if "sentiment: positive" in analysis_lower or "sentiment:** positive" in analysis_lower:
+            sentiment = "Positive"
+        elif "sentiment: negative" in analysis_lower or "sentiment:** negative" in analysis_lower:
+            sentiment = "Negative"
+        elif "sentiment: mixed" in analysis_lower or "sentiment:** mixed" in analysis_lower:
+            sentiment = "Mixed"
+        elif "sentiment: neutral" in analysis_lower or "sentiment:** neutral" in analysis_lower:
+            sentiment = "Neutral"
+        # Fallback: check first 500 chars for keywords
+        elif "positive" in analysis_lower[:500] and "negative" not in analysis_lower[:500]:
+            sentiment = "Positive"
+        elif "negative" in analysis_lower[:500] and "positive" not in analysis_lower[:500]:
+            sentiment = "Negative"
+        elif "positive" in analysis_lower[:500] and "negative" in analysis_lower[:500]:
+            sentiment = "Mixed"


The sentiment extraction logic in lines 139-157 uses fragile string parsing to extract sentiment from LLM output. This approach is unreliable as the LLM response format may vary. Consider using structured output from the LLM (JSON mode or function calling) or more robust parsing with regex patterns that match the actual prompt structure defined in line 116.

Copilot · 2026-01-27T19:17:51Z

+
+        # Save uploaded file
+        file_path = os.path.join(settings.UPLOAD_DIR, file.filename)
+        with open(file_path, "wb") as buffer:
+            shutil.copyfileobj(file.file, buffer)


Missing error handling for file operations. If the file save fails due to disk space, permissions, or other I/O errors, the exception will propagate but the uploaded file data may be lost. Consider adding specific error handling with appropriate cleanup and user-friendly error messages.

Suggested change

# Save uploaded file

file_path = os.path.join(settings.UPLOAD_DIR, file.filename)

with open(file_path, "wb") as buffer:

shutil.copyfileobj(file.file, buffer)

# Ensure upload directory exists

try:

os.makedirs(settings.UPLOAD_DIR, exist_ok=True)

except OSError as e:

raise HTTPException(

status_code=500,

detail=f"Failed to prepare upload directory: {str(e)}",

)

# Save uploaded file

file_path = os.path.join(settings.UPLOAD_DIR, file.filename)

try:

with open(file_path, "wb") as buffer:

shutil.copyfileobj(file.file, buffer)

except (OSError, shutil.Error) as e:

# Best-effort cleanup of any partially written file

if os.path.exists(file_path):

try:

os.remove(file_path)

except OSError:

# Ignore cleanup errors to avoid masking the original issue

pass

raise HTTPException(

status_code=500,

detail=f"Failed to save uploaded file: {str(e)}",

)

Copilot · 2026-01-27T19:17:51Z

+@router.post("/upload")
+async def upload_pdf(file: UploadFile = File(...)):
+    """Upload and process a PDF document"""
+    try:
+        # Validate file type
+        if not file.filename.endswith('.pdf'):
+            raise HTTPException(status_code=400, detail="Only PDF files are allowed")
+
+        # Save uploaded file
+        file_path = os.path.join(settings.UPLOAD_DIR, file.filename)
+        with open(file_path, "wb") as buffer:
+            shutil.copyfileobj(file.file, buffer)
+
+        # Process PDF
+        documents = pdf_processor.extract_text_with_metadata(file_path)
+        chunks = pdf_processor.chunk_documents(documents)
+
+        # Create/update vector store
+        rag_service.create_vectorstore(chunks, file.filename)
+
+        return {
+            "message": "PDF uploaded and processed successfully",
+            "filename": file.filename,
+            "pages": len(documents),
+            "chunks": len(chunks)
+        }
+
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error processing PDF: {str(e)}")


Security vulnerability: No file size validation before processing. An attacker could upload extremely large PDF files that consume excessive memory and CPU resources during processing, potentially causing denial of service. Add MAX_FILE_SIZE validation before processing.

Copilot · 2026-01-27T19:17:52Z

-}
+{
+  "compilerOptions": {
+    "target": "es5",


The tsconfig.json target has been downgraded from "ES2017" to "es5", which may cause compatibility issues with modern JavaScript features used by Next.js 14 and its dependencies. ES5 is quite old (2009) and Next.js 14 requires at least ES2017. This configuration is likely incorrect and could cause build failures or runtime issues.

Copilot · 2026-01-27T19:17:52Z

+    "paths": {
+      "@/*": ["./*"]


Path mapping has changed from "@/": ["./src/"] to "@/": ["./"] which means imports like "@/components/..." will now resolve to the root directory instead of the src directory. This is inconsistent with the actual file structure shown in the PR where components are in the root. However, the new components (UploadTab.tsx, ChatTab.tsx) use "@/components/..." imports which won't work with this path configuration since there's no "components" folder at the root - components are actually in the "components/" directory. This will cause module resolution failures.

Copilot · 2026-01-27T19:17:52Z

+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["http://localhost:3000"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)


The CORS configuration only allows requests from "http://localhost:3000", which will break in production or when the frontend is deployed elsewhere. Consider using an environment variable to configure allowed origins for different deployment environments.

Copilot · 2026-01-27T19:17:52Z

+
+        # Use PyMuPDF for text extraction
+        doc = fitz.open(pdf_path)
+
+        for page_num in range(len(doc)):
+            page = doc[page_num]
+            text = page.get_text()
+
+            if text.strip():
+                # Try to identify section headers
+                section = self._identify_section(text)
+
+                documents.append({
+                    "text": text,
+                    "page_number": page_num + 1,
+                    "section": section
+                })
+
+        doc.close()


The PDF processor lacks validation for malformed or encrypted PDFs. If fitz.open() encounters an encrypted or corrupted PDF, it will raise an exception that should be handled gracefully with informative error messages to the user.

Suggested change

# Use PyMuPDF for text extraction

doc = fitz.open(pdf_path)

for page_num in range(len(doc)):

page = doc[page_num]

text = page.get_text()

if text.strip():

# Try to identify section headers

section = self._identify_section(text)

documents.append({

"text": text,

"page_number": page_num + 1,

"section": section

})

doc.close()

# Use PyMuPDF for text extraction with basic validation and error handling

try:

doc = fitz.open(pdf_path)

except fitz.FileDataError as e:

# Raised for corrupted or invalid PDF data

raise ValueError(

f"Failed to open PDF file '{pdf_path}': file is corrupted or not a valid PDF."

) from e

except RuntimeError as e:

# PyMuPDF may raise RuntimeError for encrypted or otherwise problematic PDFs

raise ValueError(

f"Failed to open PDF file '{pdf_path}': the file may be encrypted or corrupted."

) from e

# Check if the PDF is encrypted and requires a password

if getattr(doc, "needs_pass", False):

doc.close()

raise ValueError(

f"Failed to process PDF file '{pdf_path}': the file is encrypted and a password was not provided."

)

try:

for page_num in range(len(doc)):

page = doc[page_num]

text = page.get_text()

if text.strip():

# Try to identify section headers

section = self._identify_section(text)

documents.append({

"text": text,

"page_number": page_num + 1,

"section": section

})

finally:

doc.close()

Copilot · 2026-01-27T19:17:53Z

@@ -0,0 +1,85 @@
+import fitz  # PyMuPDF
+import pdfplumber


Import of 'pdfplumber' is not used.

Suggested change

import pdfplumber

demonkillerr added 7 commits January 27, 2026 17:52

update readme for langchain integration

862af8e

update readme

d74f357

app rewrite

500e6ec

fix react-markdown errors

37fd0b1

startup scripts for backend and frontend

e82961d

remove redundant files

6fee09a

remove turbo folders

a40846c

demonkillerr requested a review from Copilot January 27, 2026 19:12

Copilot started reviewing on behalf of demonkillerr January 27, 2026 19:12 View session

demonkillerr added the enhancement New feature or request label Jan 27, 2026

demonkillerr self-assigned this Jan 27, 2026

docker-compose

d6c8c86

Copilot AI reviewed Jan 27, 2026

View reviewed changes

demonkillerr added 7 commits January 27, 2026 19:19

simplify root package.json

370534d

remove unwanted files

8c3515f

update next config

53710be

example env

9f8238b

ci: build images and push to ghcr

25dc304

ci: remove public folder from path

368cb34

ci: add langchain branch to CI and always push latest tag

9bf6962

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrite application for RAG using Langchain#1

rewrite application for RAG using Langchain#1
demonkillerr wants to merge 15 commits intomainfrom
langchain

demonkillerr commented Jan 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	GOOGLE_API_KEY: str = os.getenv("GOOGLE_API_KEY", "")
	GOOGLE_API_KEY: str = os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY", "")

Conversation

demonkillerr commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview of this pull request

Key Changes from older NLP implementation

Why Langchain?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

demonkillerr commented Jan 27, 2026 •

edited

Loading