Ministry of Statistics and Programme Implementation (MoSPI), Government of India
Government

AI-Based Legacy Data Extraction & Processing

Automated extraction and structuring of legacy statistical data from PDFs, CSVs, and Excel files — with a human-in-the-loop Feeder system, semantic table discovery, and natural language data analytics via Text2SQL.

Data EngineeringOCRText2SQLGovernmentLegacy Data
Build Something Similar

Impact Metrics

490+
Tables Indexed
Statistical tables extracted, verified, and made searchable
>80%
Extraction Accuracy
Docling-powered OCR with human verification
4-stage
Query Pipeline
Doc Search → Table Retrieval → Semantic Filter → SQL Gen
4
Formats Supported
PDF, Excel, CSV, and images with merged cell handling
3
Languages
English, Hindi, and Kannada for queries and responses
172 tables
Stress Test
Successfully processed a single report with ~172 tables
The Challenge

Vast Legacy Data Locked in Unstructured Formats

MoSPI holds vast amounts of legacy data in PDFs, CSVs, and Excel files. Extracting meaningful insights currently requires extensive manual effort, specialized coding knowledge, and handling complex formats including merged cells and Hindi text.

OCR engines are powerful but not infallible — noise, split headers, and garbage characters can corrupt a database. The challenge was not just extraction, but ensuring only verified, accurate data enters the system. A naive pipeline would produce garbage-in, garbage-out results that undermine analyst confidence.

Even after extraction, identical tables appear across different years with the same names (e.g., 'Table 2' in 2022 vs 2023), making it impossible for AI to distinguish them without human-curated metadata. Cross-table analysis, trend comparison, and statistical operations were impossible without manual data wrangling.

Key Pain Points

Extensive manual effort to extract data from PDFs, CSVs, and Excel files
Standard OCR produces noise, split headers, and garbage characters corrupting databases
Identical table names across years make AI disambiguation impossible
No way to query structured data using natural language
Complex formats (merged cells, Hindi text) break standard extraction tools
No audit trail for data modifications during extraction
The Solution

Human-in-the-Loop Data Pipeline with Semantic Discovery and Text2SQL

We built a custom pipeline that automatically extracts tables from Excel, CSV, and PDF files and stores them in SQL (relational database). The Feeder system implements a human-in-the-loop approach — after automated OCR extraction, admins can edit, merge tables with common headers, rename captions for disambiguation, and approve data before it enters the vector store. Every change is tracked in audit logs.

For data discovery, we built the MoSPI Data Intelligence Hub — a semantic search layer over 490+ indexed tables. Users describe what they're looking for in natural language, and a 4-stage retrieval pipeline (Doc Search → Table Retrieval → Semantic Filter → SQL Gen) identifies the most relevant tables from thousands of candidates.

The system enables natural language analytics via Text2SQL — users can ask questions like 'mean production of coke plants in November 2024' and get precise SQL-backed answers with chart generation. We enriched each table with metadata (caption, column names, Q&A pairs) and built a dedicated table catalog for table-level semantic search.

Our Approach

1
Custom extraction pipeline for PDFs, CSVs, and Excel to SQL
2
Human-in-the-loop Feeder with admin edit, merge, and approval workflows
3
Docling OCR with table structure preservation
4
Audit logs tracking every data modification
5
Dedicated table catalog with semantic summaries for each table
6
4-stage query pipeline: Doc Search → Table Retrieval → Semantic Filter → SQL Gen
7
Text2SQL for natural language data analytics
8
Chart mode for visual data analysis
9
Multilingual chat support (English, Hindi, Kannada)

Key Features Delivered

Automated data extraction from PDFs, Excel, CSV, and images
Human-in-the-loop Feeder with admin approval workflow
MoSPI Data Intelligence Hub with 490+ indexed tables
Natural language querying via Text2SQL
4-stage semantic retrieval pipeline for precise table identification
Chart generation and cross-table comparison
Table caption editing for AI disambiguation across years
Audit logs for all data modifications
Export to SQL dump
Multilingual support (English, Hindi, Kannada)
Technology Stack

Built With

SvelteKit (Frontend)Tailwind CSS + shadcn/uiFastAPI + Django (Backend)Socket.IO (Real-time)Qdrant (Vector DB)PostgreSQLDocling (OCR)BGE Large (Embeddings)OSS 120B LLM (Inference)LangChainOllama + DeepSeekLiteLLMDocker + NginxAzure CloudRedis
Results

Outcomes Achieved

The system transformed MoSPI's legacy data into a searchable, queryable intelligence hub — enabling analysts to find, compare, and analyze statistical tables across decades of reports using natural language, with complete audit trails and human-verified data accuracy.

490+
Tables Indexed
>80%
Extraction Accuracy
4-stage
Query Pipeline
4
Formats Supported
3
Languages
172 tables
Stress Test

Want Similar Results?

Let's discuss how we can build a similar solution for your organization — with the same certified quality and production-grade delivery.