Pretrained Models for Document Extraction

Estimated reading time: 2 minutes

Pretrained Models for Document Extraction

Cloud-Based Pretrained Models

Google Cloud Document AI: Offers pretrained models for various document types (invoices, receipts, IDs, etc.) for key-value pair, table extraction, and classification.
AWS Textract: Provides pretrained models for OCR, key-value pair extraction, and table extraction from documents and images.
Azure Form Recognizer (now Document Intelligence): Offers pretrained models for common document types and supports custom model training.
Oracle Document Understanding: Provides pretrained models for invoice automation, document analysis, and HR document processing.

Open-Source Libraries and Models (Suitable for Non-Cloud Implementation)

Tesseract OCR: Open-source OCR engine for extracting text from images.
spaCy: Python library for NLP with NER capabilities that can be used for information extraction from OCR output. Offers pretrained language models.
Transformers (Hugging Face): Library with thousands of pretrained transformer models that can be fine-tuned for document processing tasks.
Unstructured.io: Open-source library for preprocessing and extracting elements from various document types.
DeepPavlov: Open-source NLP library with components for entity recognition and information extraction.
GROBID (Gene and Research Object Bibliographic Data Extraction): Specifically for extracting bibliographic information from academic PDFs.
Camelot: Python library for extracting tables from PDF files.
PDF-Extract-Kit: Toolkit for PDF content extraction, including layout detection, OCR, and table recognition models.
Docling: Library for parsing diverse document formats for GenAI applications.
LLMAIx: GitHub project for document information extraction and anonymization using local LLMs.
Zerox: Node.js SDK for OCR and structured data extraction from documents using vision models.
deepdoctection: Python library orchestrating document extraction and layout analysis using deep learning models.
DocQuery: Library and CLI tool using LLMs to analyze semi-structured and unstructured documents.
Infosys/Document-Extraction-Libraries: Python/Java libraries for extracting information from various document formats.
Parsee: Open-source framework for data extraction and structuring from unstructured data using LLMs and custom models.

Choosing for Non-Cloud

When working without cloud services, focus on the open-source options. The best choice depends on the document types, complexity of extraction, computational resources, and your programming skills.

Non-Cloud Workflow Example

OCR: Use Tesseract or a local OCR engine.
Preprocessing: Clean and prepare the extracted text.
Model Application: Load and run a pretrained or fine-tuned local model (e.g., Transformers for NER).
Post-processing: Structure the model’s output.

Pretrained Models for Document Extraction

Cloud-Based Pretrained Models

Open-Source Libraries and Models (Suitable for Non-Cloud Implementation)

Choosing for Non-Cloud

Non-Cloud Workflow Example

Like this:

Related Posts

Leave a ReplyCancel reply

Pretrained Models for Document Extraction

Cloud-Based Pretrained Models

Open-Source Libraries and Models (Suitable for Non-Cloud Implementation)

Choosing for Non-Cloud

Non-Cloud Workflow Example

Share this:

Like this:

Related Posts

Leave a ReplyCancel reply