Estimated reading time: 5 minutes
Non-Cloud LLM for Document Extraction
This guide explains how to use a non-cloud version of a pretrained Large Language Model (LLM) for document extraction, focusing on open-source models and local execution.
Phase 1: Setting Up Your Local Environment
1. Hardware Requirements
Ensure your system meets the following recommendations:
- CPU/GPU: An NVIDIA GPU with sufficient VRAM (8GB+) is highly recommended for faster inference. Otherwise, a powerful multi-core CPU is necessary.
- RAM: Adequate system RAM to load the model and process documents. Requirements vary by model size.
- Storage: Sufficient disk space to store the LLM weights (can range from GBs to hundreds of GBs).
2. Software Installation
Install the necessary software and Python libraries:
pip install pip --upgrade
pip install torch torchvision torchaudio # For PyTorch (if your model uses it)
pip install tensorflow # For TensorFlow (if your model uses it)
pip install transformers sentencepiece accelerate # Hugging Face Transformers
pip install PyPDF2 Pillow opencv-python # For document handling
pip install pytesseract # For OCR (if needed)
# Optional:
# Follow instructions for llama.cpp or Ollama from their respective sites.
3. Downloading a Pretrained LLM
Download a suitable open-source LLM from the Hugging Face Hub. Consider models like Mixtral, Llama 2, or smaller models depending on your resources.
Using transformers
:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1" # Replace with your chosen model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
if torch.cuda.is_available():
model = model.to("cuda")
print(f"Model '{model_name}' loaded successfully.")
Using Ollama:
ollama pull mistral:latest
echo "Mistral model pulled successfully (if Ollama is installed)."
Phase 2: Implementing the Document Extraction Workflow
1. Loading Your Document
Load the document content based on its format:
PDF:
from PyPDF2 import PdfReader
def extract_text_from_pdf(pdf_path):
text = ""
with open(pdf_path, 'rb') as file:
reader = PdfReader(file)
for page in reader.pages:
text += page.extract_text()
return text
pdf_text = extract_text_from_pdf("your_document.pdf")
print("PDF text extracted.")
Image (Scanned Document):
from PIL import Image
import pytesseract
def extract_text_from_image(image_path):
img = Image.open(image_path)
text = pytesseract.image_to_string(img)
return text
image_text = extract_text_from_image("scanned_document.png")
print("Text extracted from image.")
Text File:
with open("document.txt", "r", encoding="utf-8") as f:
text_file_content = f.read()
print("Text file content loaded.")
2. Preparing the Input Prompt
Create a clear and concise prompt to instruct the LLM on what information to extract and the desired output format (e.g., JSON).
document_text = pdf_text if 'pdf_text' in locals() else (image_text if 'image_text' in locals() else (text_file_content if 'text_file_content' in locals() else ""))
prompt = f"""Extract the following information from the document:
- Invoice Number
- Date
- Customer Name
- Total Amount
Document:
{document_text}
Output the information as a JSON object.
"""
print("Prompt created.")
3. Interacting with the Local LLM
Send the prompt to your loaded LLM for processing.
Using transformers
:
inputs = tokenizer(prompt, return_tensors="pt")
if torch.cuda.is_available():
inputs = inputs.to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500)
extracted_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nLLM Output:")
print(extracted_text)
import json
try:
extracted_data = json.loads(extracted_text.split("```json")[-1].split("```")[0].strip())
print("\nExtracted Data (JSON):")
print(extracted_data)
except (json.JSONDecodeError, IndexError):
print("\nCould not parse JSON from the output.")
print("Raw output:", extracted_text)
Using Ollama:
import requests
import json
data = {
"prompt": prompt,
"model": "mistral:latest", # Ensure Ollama is running and this model is pulled
"stream": False
}
response = requests.post('http://localhost:11434/api/generate', json=data)
if response.status_code == 200:
extracted_text = response.json()['response'].strip()
print("\nLLM Output:")
print(extracted_text)
try:
extracted_data = json.loads(extracted_text)
print("\nExtracted Data (JSON):")
print(extracted_data)
except json.JSONDecodeError:
print("\nCould not parse JSON from the output.")
print("Raw output:", extracted_text)
else:
print(f"Ollama API error: {response.status_code} - {response.text}")
Phase 3: Considerations for Non-Cloud LLM Usage
- Resource Management: Monitor your system resources, especially GPU and RAM usage.
- Model Selection: Experiment with different open-source models to find the best balance of performance and resource consumption.
- Prompt Engineering: Refine your prompts for better accuracy and desired output format.
- Fine-tuning: Consider fine-tuning on your specific document types for improved results.
- Chunking: For large documents, implement chunking strategies to fit within the model’s context window.
- Error Handling: Add error handling for file operations, OCR, and JSON parsing.
Leave a Reply