Estimated reading time: 2 minutes
Cloud-Based Pretrained Models
- Google Cloud Document AI: Offers pretrained models for various document types (invoices, receipts, IDs, etc.) for key-value pair, table extraction, and classification.
- AWS Textract: Provides pretrained models for OCR, key-value pair extraction, and table extraction from documents and images.
- Azure Form Recognizer (now Document Intelligence): Offers pretrained models for common document types and supports custom model training.
- Oracle Document Understanding: Provides pretrained models for invoice automation, document analysis, and HR document processing.
Open-Source Libraries and Models (Suitable for Non-Cloud Implementation)
- Tesseract OCR: Open-source OCR engine for extracting text from images.
- spaCy: Python library for NLP with NER capabilities that can be used for information extraction from OCR output. Offers pretrained language models.
- Transformers (Hugging Face): Library with thousands of pretrained transformer models that can be fine-tuned for document processing tasks.
- Unstructured.io: Open-source library for preprocessing and extracting elements from various document types.
- DeepPavlov: Open-source NLP library with components for entity recognition and information extraction.
- GROBID (Gene and Research Object Bibliographic Data Extraction): Specifically for extracting bibliographic information from academic PDFs.
- Camelot: Python library for extracting tables from PDF files.
- PDF-Extract-Kit: Toolkit for PDF content extraction, including layout detection, OCR, and table recognition models.
- Docling: Library for parsing diverse document formats for GenAI applications.
- LLMAIx: GitHub project for document information extraction and anonymization using local LLMs.
- Zerox: Node.js SDK for OCR and structured data extraction from documents using vision models.
- deepdoctection: Python library orchestrating document extraction and layout analysis using deep learning models.
- DocQuery: Library and CLI tool using LLMs to analyze semi-structured and unstructured documents.
- Infosys/Document-Extraction-Libraries: Python/Java libraries for extracting information from various document formats.
- Parsee: Open-source framework for data extraction and structuring from unstructured data using LLMs and custom models.
Choosing for Non-Cloud
When working without cloud services, focus on the open-source options. The best choice depends on the document types, complexity of extraction, computational resources, and your programming skills.
Non-Cloud Workflow Example
- OCR: Use Tesseract or a local OCR engine.
- Preprocessing: Clean and prepare the extracted text.
- Model Application: Load and run a pretrained or fine-tuned local model (e.g., Transformers for NER).
- Post-processing: Structure the model’s output.
Leave a Reply