Ministry of Statistics and Programme Implementation (MoSPI), Government of India

Government

AI-Based Legacy Data Extraction & Processing

Automated extraction and structuring of legacy statistical data from PDFs, CSVs, and Excel files — with a human-in-the-loop Feeder system, semantic table discovery, and natural language data analytics via Text2SQL.

Data EngineeringOCRText2SQLGovernmentLegacy Data

Build Something Similar

Impact Metrics

490+

Tables Indexed

Statistical tables extracted, verified, and made searchable

>80%

Extraction Accuracy

Docling-powered OCR with human verification

4-stage

Query Pipeline

Doc Search → Table Retrieval → Semantic Filter → SQL Gen

Formats Supported

PDF, Excel, CSV, and images with merged cell handling

Languages

English, Hindi, and Kannada for queries and responses

172 tables

Stress Test

Successfully processed a single report with ~172 tables

The Challenge

Vast Legacy Data Locked in Unstructured Formats

MoSPI holds vast amounts of legacy data in PDFs, CSVs, and Excel files. Extracting meaningful insights currently requires extensive manual effort, specialized coding knowledge, and handling complex formats including merged cells and Hindi text.

OCR engines are powerful but not infallible — noise, split headers, and garbage characters can corrupt a database. The challenge was not just extraction, but ensuring only verified, accurate data enters the system. A naive pipeline would produce garbage-in, garbage-out results that undermine analyst confidence.

Even after extraction, identical tables appear across different years with the same names (e.g., 'Table 2' in 2022 vs 2023), making it impossible for AI to distinguish them without human-curated metadata. Cross-table analysis, trend comparison, and statistical operations were impossible without manual data wrangling.

Key Pain Points

Extensive manual effort to extract data from PDFs, CSVs, and Excel files

Standard OCR produces noise, split headers, and garbage characters corrupting databases

Identical table names across years make AI disambiguation impossible

No way to query structured data using natural language

Complex formats (merged cells, Hindi text) break standard extraction tools

No audit trail for data modifications during extraction

The Solution

Human-in-the-Loop Data Pipeline with Semantic Discovery and Text2SQL

We built a custom pipeline that automatically extracts tables from Excel, CSV, and PDF files and stores them in SQL (relational database). The Feeder system implements a human-in-the-loop approach — after automated OCR extraction, admins can edit, merge tables with common headers, rename captions for disambiguation, and approve data before it enters the vector store. Every change is tracked in audit logs.

For data discovery, we built the MoSPI Data Intelligence Hub — a semantic search layer over 490+ indexed tables. Users describe what they're looking for in natural language, and a 4-stage retrieval pipeline (Doc Search → Table Retrieval → Semantic Filter → SQL Gen) identifies the most relevant tables from thousands of candidates.

The system enables natural language analytics via Text2SQL — users can ask questions like 'mean production of coke plants in November 2024' and get precise SQL-backed answers with chart generation. We enriched each table with metadata (caption, column names, Q&A pairs) and built a dedicated table catalog for table-level semantic search.

Our Approach

Custom extraction pipeline for PDFs, CSVs, and Excel to SQL

Human-in-the-loop Feeder with admin edit, merge, and approval workflows

Docling OCR with table structure preservation

Audit logs tracking every data modification

Dedicated table catalog with semantic summaries for each table

4-stage query pipeline: Doc Search → Table Retrieval → Semantic Filter → SQL Gen

Text2SQL for natural language data analytics

Chart mode for visual data analysis

Multilingual chat support (English, Hindi, Kannada)

Key Features Delivered

Automated data extraction from PDFs, Excel, CSV, and images

Human-in-the-loop Feeder with admin approval workflow

MoSPI Data Intelligence Hub with 490+ indexed tables

Natural language querying via Text2SQL

4-stage semantic retrieval pipeline for precise table identification

Chart generation and cross-table comparison

Table caption editing for AI disambiguation across years

Audit logs for all data modifications

Export to SQL dump

Multilingual support (English, Hindi, Kannada)

Technology Stack

Built With

SvelteKit (Frontend)Tailwind CSS + shadcn/uiFastAPI + Django (Backend)Socket.IO (Real-time)Qdrant (Vector DB)PostgreSQLDocling (OCR)BGE Large (Embeddings)OSS 120B LLM (Inference)LangChainOllama + DeepSeekLiteLLMDocker + NginxAzure CloudRedis

Results

Outcomes Achieved

The system transformed MoSPI's legacy data into a searchable, queryable intelligence hub — enabling analysts to find, compare, and analyze statistical tables across decades of reports using natural language, with complete audit trails and human-verified data accuracy.

490+

Tables Indexed

>80%

Extraction Accuracy

4-stage

Query Pipeline

Formats Supported

Languages

172 tables

Stress Test

Related Case Studies

Ministry of Statistics and Programme Implementation (MoSPI), Government of India

AI-enabled Intelligent Search Solutions for Documents

RAG-powered semantic search and Q&A system for MoSPI's vast document archive — enabling natural language querying with voice and text input, multilingual support (Hindi, English, Kannada), and deep citations linking directly to source PDF pages.

View case study

Cleo

Intelligent Enterprise AI Platform — Knowledge Management at Scale

Enterprise-grade intelligent knowledge management platform unifying organizational knowledge with AI-powered search, synthesis, and collaborative workflows.

View case study

RAG Past Ticket Resolution

AI System for Historical Ticket Knowledge — Instant L1 Resolution

Specialized RAG system using historical support ticket databases as a living knowledge base, dramatically reducing agent research time and improving resolution quality.

View case study

Want Similar Results?

Let's discuss how we can build a similar solution for your organization — with the same certified quality and production-grade delivery.

Start a Conversation View All Case Studies