Pretrained Models for Document Extraction

Estimated reading time: 2 minutes

Pretrained Models for Document Extraction

-Based Pretrained Models

  • Google Cloud Document AI: Offers pretrained models for various document types (invoices, receipts, IDs, etc.) for key-value pair, table extraction, and classification.
  • Textract: Provides pretrained models for OCR, key-value pair extraction, and table extraction from documents and images.
  • Form Recognizer (now Document Intelligence): Offers pretrained models for common document types and supports custom model training.
  • Oracle Document Understanding: Provides pretrained models for invoice , document analysis, and HR document processing.

Open-Source Libraries and Models (Suitable for Non-Cloud Implementation)

  • Tesseract OCR: Open-source OCR engine for extracting text from images.
  • spaCy: library for NLP with NER capabilities that can be used for information extraction from OCR output. Offers pretrained language models.
  • Transformers (Hugging Face): Library with thousands of pretrained transformer models that can be fine-tuned for document processing tasks.
  • Unstructured.io: Open-source library for preprocessing and extracting elements from various document types.
  • DeepPavlov: Open-source NLP library with components for entity recognition and information extraction.
  • GROBID (Gene and Research Object Bibliographic Data Extraction): Specifically for extracting bibliographic information from academic PDFs.
  • Camelot: Python library for extracting tables from PDF files.
  • PDF-Extract-Kit: Toolkit for PDF content extraction, including layout detection, OCR, and table recognition models.
  • Docling: Library for parsing diverse document formats for GenAI applications.
  • LLMAIx: GitHub project for document information extraction and anonymization using local .
  • Zerox: SDK for OCR and structured data extraction from documents using vision models.
  • deepdoctection: Python library orchestrating document extraction and layout analysis using deep learning models.
  • DocQuery: Library and CLI tool using LLMs to analyze semi-structured and unstructured documents.
  • Infosys/Document-Extraction-Libraries: Python/ libraries for extracting information from various document formats.
  • Parsee: Open-source framework for data extraction and structuring from unstructured data using LLMs and custom models.

Choosing for Non-Cloud

When working without cloud services, focus on the open-source options. The best choice depends on the document types, complexity of extraction, computational resources, and your skills.

Non-Cloud Example

  1. OCR: Use Tesseract or a local OCR engine.
  2. Preprocessing: Clean and prepare the extracted text.
  3. Model Application: Load and run a pretrained or fine-tuned local model (e.g., Transformers for NER).
  4. Post-processing: Structure the model’s output.

Agentic AI (13) AI Agent (14) airflow (4) Algorithm (21) Algorithms (46) apache (28) apex (2) API (89) Automation (44) Autonomous (24) auto scaling (5) AWS (49) Azure (35) BigQuery (14) bigtable (8) blockchain (1) Career (4) Chatbot (17) cloud (94) cosmosdb (3) cpu (38) cuda (17) Cybersecurity (6) database (78) Databricks (6) Data structure (13) Design (66) dynamodb (23) ELK (2) embeddings (36) emr (7) flink (9) gcp (23) Generative AI (11) gpu (8) graph (36) graph database (13) graphql (3) image (39) indexing (26) interview (7) java (39) json (31) Kafka (21) LLM (16) LLMs (31) Mcp (1) monitoring (85) Monolith (3) mulesoft (1) N8n (3) Networking (12) NLU (4) node.js (20) Nodejs (2) nosql (22) Optimization (62) performance (175) Platform (78) Platforms (57) postgres (3) productivity (15) programming (47) pseudo code (1) python (54) pytorch (31) RAG (36) rasa (4) rdbms (5) ReactJS (4) redis (13) Restful (8) rust (2) salesforce (10) Spark (14) spring boot (5) sql (53) tensor (17) time series (12) tips (7) tricks (4) use cases (35) vector (49) vector db (2) Vertex AI (16) Workflow (35) xpu (1)

Leave a Reply