Skip links

Generative AI for Document Classification: Transforming Automated Tasks

Jump To Section

Generative AI for Documents Classification (i.e. to identify type of documents – e.g. driving license, passport, USA tax form like W2, W9, any business specific documents etc.) is a crucial initial step for a lot of complex automated tasks, such as auto-filling of forms, email classification, information extraction from documents, data labelling for AI model training etc. 

With wide range of traditional methods available for document classification, each having its own pros and cons. the most popular is the use of CNNs (Convolutional Neural Networks), as it excels in capturing an image’s visual and spatial features more effectively than other machine learning algorithms or neural networks. While this is an effective technique, it does have its own set of disadvantages such as the requirement of large volume of training data, longer training time & higher configuration infrastructure for better performance.

Generative AI for Document Classification

Figure 1: CNN architecture

GenAI: Document Classification Beyond CNN

The GenAI based solution offers an interesting alternative route for document classification that aims to find a solution to the potential challenges faced by CNN or other traditional document classifiers. Advantage of GenAI based approach over CNN based method are 

  • Good performance even with small training samples: Good performance with very less data (4-5 samples per class) makes it a perfect choice for few-shot text classification
  • No specific training required:Leveraging state-of-the-art OCR & Large language models eliminates the need for conventional model training as LLMs excel in discerning semantic similarities thanks to extensive pre-training on vast textual datasets.
  • Less false positive due to text and layout focused approach:By analyzing both content and layout, this method significantly reduces false positives, proving especially effective when documents from different classes share a similar layout/ text content.

Also read: Hands-On Generative AI using real-world applications

Working of GenAI-based document classifier

The basic idea is to use OCR (optical character recognition) to extract text from the images, passing all this information to an LLM (large language model – e.g. GPT, LLaMA etc.) to generate embeddings which will then be used for document classification. Let’s look at the implementation of this approach using real world use case.

Problem statement# – Classify scanned copy of USA tax forms

Available classes: W2, W9 and 1040 tax forms. 

Description: USA tax forms (like W2, W9 & 1040) vary in structure and are text heavy, making it difficult to accurately classify using common CV-based image classification algorithms. But using GenAI based approach, we can classify it even with very less training data.


Classify scanned copy of USA tax forms

     Figure 2: Solution Architecture

Technology used: 

  • Optical Character Recognition with Google tesseract (pip install pytesseract==0.3.10)
  • Open-AI API key for embedding model (pip install openai==0.28.0)
  • Python libraries like NumPy, OpenCV (pip install numpy==1.24.4 opencv-python==

Step #1 – Data Collection

First, we need to build training data which will have shape N (Number of classes) * K (Number of training samples for each class). Let’s collect a few samples (e.g. 2 samples) of W2, W9 & 1040 forms from the web.

Data Collection

Figure 5: Dataset collection

In this case, parameters are

  • Classes: [w9, w2, 1040] 
  • N =3 and K = 2

Step #1A – Project Structure

This is the training data directory structure. Training data needs to be pushed to relevant directories – 

Project Structure in GEN AI

Step #2 – Text Extraction (

We will be using Py-tesseract as an OCR engine to extract each word and its bounding box coordinate from document image

import pytesseract, numpy as np
pytesseract.pytesseract.tesseract_cmd = "<path_to_textract.exe>"
def apply_ocr(img):

     ocr_df = pytesseract.image_to_data(img, output_type='data.frame')
     return ocr_df

Note: Output format from this code snippet will be Pandas dataframe: [ text, left, top, width, height] 

Step #3 – Convert Data frame to String Representation (

As we have extracted each word from document with its bounding boxes, let’s convert the data frame representation into string representation. Using this, embeddings can be generated for each document.

Instead of passing only concatenated text string to LLM, we will try to encode it in a format which will retain text context and how these texts are spread across the document. Let’s look at an example (Figure 3): 

Convert Data frame

Figure 3: Represent Text documents as string

let’s write code to convert OCR results into formatted string format:

def text_to_string_encode(coordinates, min_x, min_y, max_x, max_y):
     matrix = [[0] * 50 for _ in range(50)]

# fill matrix with values 
for x, y, z in coordinates:
if z is not np.nan:

matrix_x = int((x - min_x) / (max_x - min_x) * (50 - 1)) if max_x != min_x else '-'
matrix_y = int((y - min_y) / (max_y - min_y) * (50 - 1)) if max_y != min_y else '-' 
matrix[matrix_y][matrix_x] = z

# return matrix as string 
return '/'.join([''.join(map(lambda x: str(x), line)) for line in matrix])

Note: Considering input length constraint for GPT models, we are always scaling our document to 50 * 50 grid layout, assuming words length for a page of document will not exceed 2500 

Step #4 – Generate Embeddings (

Let’s write code to generate embeddings in both training phase and inference phase for a given document’s text string

import openai
# setup open ai key

api_key = '<your api key>' 

openai.api_key = api_key

def get_embedding(formatted_string):
return openai.Embedding.create(

Step #5 – AI Model Training (

Now with everything in place, we are ready to write a model training script and train our model using 

import os, sys

from embeddings import *

from text_extraction import *

import pandas as pd

def fit_model(train_data='support_set/'):

    class_names = os.listdir(train_data)

    support_set = dict.fromkeys(class_names)

    for cls in support_set:

        support_set[cls] = os.listdir(train_data + cls)

Let’s process individual files present in our training dataset using “apply_ocr”, “text_to_string_encode” and “get_embedding function”. Please note that this is also a part of the fit_model function.

    label = []

      embds = []

      for cls in support_set:

          for file in support_set[cls]:

              print(f'Processing file {file} of class {cls}')

              ocr_data = apply_ocr(train_data + cls + '/' + file)

              min_x = ocr_data['left'].min()

              min_y = ocr_data['top'].min()

              max_x = (ocr_data['left'] + ocr_data['width']).max()

              max_y = (ocr_data['top'] + ocr_data['height']).max()

              formatted_string = text_to_string_encode(

              zip(ocr_data['left'], ocr_data['top'], ocr_data['text']),

              min_x, min_y, max_x, max_y)

              embd = get_embedding(formatted_string)



Note: This will generate class label and embedding mapping for all our training files, which will serve as a trained model for us. We can store this mapping as a csv file to refer in inference phase

# let’s export our csv file which will serve as a trained model
pd.DataFrame({'Label': label, 'Embedding': embds}).to_csv('support_set.csv')

Step #6 – Getting Inference (

Now we are all ready to proceed with inference on test image. Flow for test image will be, image -> apply_ocr -> text_to_string_encode -> get embedding

import os, sys, numpy as np

from numpy.linalg import norm

from embeddings import *
from text_extraction import *
from train import *
import pandas as pd

def test(image):

     ocr_data = apply_ocr(image)

     min_x = ocr_data['left'].min()

     min_y = ocr_data['top'].min()

     max_x = (ocr_data['left'] + ocr_data['width']).max()

     max_y = (ocr_data['top'] + ocr_data['height']).max()

     formatted_string = text_to_string_encode(

           zip(ocr_data['left'], ocr_data['top'], ocr_data['text']),

           min_x, min_y, max_x, max_y)

     test_embd = get_embedding(formatted_string)

     return test_embd

Let’s calculate cosine similarity of test image with all training set images and predict the class based on mean similarity score of all classes. The higher the similarity score with a document, the more similar is the test document with respective class document.

# train the mode and save as support_set.csv



# load test image and predict

test_embd = test('w9-test-image.png')
model = pd.read_csv('support_set.csv')
for n in range(len(model)):
    label = model.loc[n, 'Label']
    i = eval(model.loc[n, 'Embedding'])
    print(, test_embd)/(norm(i)*norm(test_embd)), label)

Step #7 – Analyze Inference output

Analyze Inference output

Figure 4 – Inference on test image, which is a W9 form

Note: Above is the similarity score of test image with each image in training set

In Figure 4, we can clearly see the mean score of W9 class label is higher than W2 and f1040 class labels, meaning our model is working perfectly fine. 


In this blog we have shared an approach for text rich document’s classification using a combination of OCR engine and text embedding. We hope that the post was helpful to a Generative AI enthusiast, like you.

Picture of Aman Savaria & Vinay Verma

Aman Savaria & Vinay Verma

Suggested Reading

Ready to Unlock Your Enterprise's Full Potential?

Michael Woodall

Chief Growth Officer of Financial Services

Michael Woodall, as the Chief Growth Officer of Financial Services at Altimetrik, spearheads the identification of new growth avenues and revenue streams within the financial services sector. With a robust background and extensive expertise, Michael brings invaluable insights to his role.

Previously, Michael served as the Chief of Operations and President of the Trust Company at Putnam Investments, where he orchestrated strategic developments and continuous operational enhancements. Leveraging strategic partnerships and data analytics, he revolutionized capabilities across investments, retail and institutional distribution, and client services. Under his leadership, Putnam received numerous accolades, including the DALBAR Mutual Fund Service Award for over 30 consecutive years.

Michael’s dedication to industry evolution is evident through his involvement with prestigious organizations such as the DTCC Senior Wealth Advisory Board, ICI Operations Committee, and NICSA, where he served as Chairman and now holds the position of Director Emeritus. Widely recognized as an industry luminary, Michael frequently shares his expertise with various divisions of the SEC, solidifying his reputation as a seasoned presenter.

At Altimetrik, Michael plays a pivotal role in driving expansion within financial services, leveraging his expertise and Altimetrik’s Digital Business Methodology to ensure clients navigate their digital journey seamlessly, achieving tangible outcomes and exponential growth.

Beyond his corporate roles, Michael serves as Chair of the Boston Water & Sewer Commission, appointed by the Mayor of Boston, and is actively involved in various philanthropic endeavors, including serving on the board of the nonprofit Inspire Arts & Music.

Michael holds a distinguished business degree from Northeastern University, graduating with distinction as a member of the Sigma Epsilon Rho Honor Society.

Anguraj Kumar Arumugam

Chief Digital Business Officer for the U.S. West region

Anguraj is an accomplished business executive with an extensive leadership experience in the services industry and strong background across digital transformation, engineering services, data and analytics, cloud and consulting.

Prior to joining Altimetrik, Anguraj has served in various positions and roles at Globant, GlobalLogic, Wipro and TechMahindra. Over his 25 years career, he has led many strategic and large-scale digital engineering and transformation programs for some of world’s best-known brands. His clients represent a range of industry sectors including Automotive, Technology and Software Platforms. Anguraj has built and guided all-star teams throughout his tenure, bringing together the best of the techno-functional capabilities to address critical client challenges and deliver value.

Anguraj holds a bachelor’s degree in mechanical engineering from Anna University and a master’s degree in software systems from Birla Institute of Technology, Pilani.

In his spare time, he enjoys long walks, hiking, gardening, and listening to music.

Vikas Krishan

Chief Digital Business Officer and Head of the EMEA region

Vikas (Vik) Krishan serves as the Chief Digital Business Officer and Head of the EMEA region for Altimetrik. He is responsible for leading and growing the company’s presence across new and existing client relationships within the region.

Vik is a seasoned executive and brings over 25 years of global experience in Financial Services, Digital, Management Consulting, Pre- and Post-deal services and large/ strategic transformational programmes, gained in a variety of senior global leadership roles at firms such as Globant, HCL, Wipro, Logica and EDS and started his career within Investment Banking. He has developed significant cross industry experience across a wide variety of verticals, with a particular focus on working with and advising the C-Suite of Financial Institutions, Private Equity firms and FinTech’s on strategy and growth, operational excellence, performance improvement and digital adoption.

He has served as the engagement lead on multiple global transactions to enable the orchestration of business, technology, and operational change to drive growth and client retention.

Vik, who is based in London, serves as a trustee for the Burma Star Memorial Fund, is a keen photographer and an avid sportsman.

Megan Farrell Herrmanns

Chief Digital Officer, US Central

Megan is a senior business executive with a passion for empowering customers to reach their highest potential. She has depth and breadth of experience working across large enterprise and commercial customers, and across technical and industry domains. With a track record of driving measurable results, she develops trusted relationships with client executives to drive organizational growth, unlock business value, and internalize the use of digital business as a differentiator.

At Altimetrik, Megan is responsible for expanding client relationships and developing new business opportunities in the US Central region. Her focus is on digital business and utilizing her experience to create high growth opportunities for clients. Moreover, she leads the company’s efforts in cultivating and enhancing our partnership with Salesforce, strategically positioning our business to capitalize on new business opportunities.

Prior to Altimetrik, Megan spent 10 years leading Customer Success at Salesforce, helping customers maximize the value of their investments across their technology stack. Prior to Salesforce, Megan spent over 15 years with Accenture, leading large transformational projects for enterprise customers.

Megan earned a Bachelor of Science in Mechanical Engineering from Marquette University. Beyond work, Megan enjoys playing sand volleyball, traveling, watching her kids soccer games, and is actively involved in a philanthropy (Advisory Council for Cradles to Crayons).

Adaptive Clinical Trial Designs: Modify trials based on interim results for faster identification of effective drugs.Identify effective drugs faster with data analytics and machine learning algorithms to analyze interim trial results and modify.
Real-World Evidence (RWE) Integration: Supplement trial data with real-world insights for drug effectiveness and safety.Supplement trial data with real-world insights for drug effectiveness and safety.
Biomarker Identification and Validation: Validate biomarkers predicting treatment response for targeted therapies.Utilize bioinformatics and computational biology to validate biomarkers predicting treatment response for targeted therapies.
Collaborative Clinical Research Networks: Establish networks for better patient recruitment and data sharing.Leverage cloud-based platforms and collaborative software to establish networks for better patient recruitment and data sharing.
Master Protocols and Basket Trials: Evaluate multiple drugs in one trial for efficient drug development.Implement electronic data capture systems and digital platforms to efficiently manage and evaluate multiple drugs or drug combinations within a single trial, enabling more streamlined drug development
Remote and Decentralized Trials: Embrace virtual trials for broader patient participation.Embrace telemedicine, virtual monitoring, and digital health tools to conduct remote and decentralized trials, allowing patients to participate from home and reducing the need for frequent in-person visits
Patient-Centric Trials: Design trials with patient needs in mind for better recruitment and retention.Develop patient-centric mobile apps and web portals that provide trial information, virtual support groups, and patient-reported outcome tracking to enhance patient engagement, recruitment, and retention
Regulatory Engagement and Expedited Review Pathways: Engage regulators early for faster approvals.Utilize digital communication tools to engage regulatory agencies early in the drug development process, enabling faster feedback and exploration of expedited review pathways for accelerated approvals
Companion Diagnostics Development: Develop diagnostics for targeted recruitment and personalized treatment.Implement bioinformatics and genomics technologies to develop companion diagnostics that can identify patient subpopulations likely to benefit from the drug, aiding in targeted recruitment and personalized treatment
Data Standardization and Interoperability: Ensure seamless data exchange among research sites.Utilize interoperable electronic health record systems and health data standards to ensure seamless data exchange among different research sites, promoting efficient data aggregation and analysis
Use of AI and Predictive Analytics: Apply AI for drug candidate identification and data analysis.Leverage AI algorithms and predictive analytics to analyze large datasets, identify potential drug candidates, optimize trial designs, and predict treatment outcomes, accelerating the drug development process
R&D Investments: Improve the drug or expand indicationsUtilize computational modelling and simulation techniques to accelerate drug discovery and optimize drug development processes