GST Automation in India: From OCR to Compliance in 5 Weeks
Indian GST compliance sounds straightforward until you're in it. HSN/SAC code classification alone has thousands of entries. GSTIN validation requires checksum logic. Multi-rate invoices — one line item at 12%, another at 18%, a third exempt — are the norm, not the exception.
Building a reliable automation layer on top of this required us to be precise about what we were automating and humble about what still needed human review.
The OCR Problem
Not all invoices are created equal. We receive PDFs generated by modern accounting software, scanned paper invoices from the 1990s, handwritten chalan slips, and everything in between.
We built a pre-processing pipeline that detects invoice quality, applies appropriate denoising, and routes to the right OCR engine. Clean PDFs go to our structured parser. Scanned images go to Google Document AI for layout-aware extraction. Low-quality scans get flagged for human review with a pre-filled partial extraction.
GSTIN Extraction & Validation
Once text is extracted, GSTIN identification uses a regex pattern — GSTINs follow a known 15-character alphanumeric format. But extraction alone isn't enough. We validate each GSTIN against a checksum algorithm and optionally verify against the GST portal API.
Invalid GSTINs are flagged immediately. Input credit claims on invalid GSTINs are a common audit trigger — catching them upstream saves clients significant headaches.
What We Learned
The edge cases are where real systems earn their keep. Invoices with multiple GSTINs (inter-state transactions). Credit notes with negative line items. Reverse charge mechanism invoices. Each required explicit handling logic that no generic OCR vendor provides.
The automation now handles 96% of invoices without human intervention. The remaining 4% get routed to a review queue with extracted data pre-filled and the ambiguity highlighted. That's the right tradeoff.