Using local LLM for Document Extraction

API, cloud, cpu, cuda, gpu, image, json, LLM, performance, python, pytorch, Workflow

Estimated reading time: 5 minutes

Non-Cloud LLM for Document Extraction

This guide explains how to use a non-cloud version of a pretrained Large Language Model (LLM) for document extraction, focusing on open-source models and local execution.

Phase 1: Setting Up Your Local Environment

1. Hardware Requirements

Ensure your system meets the following recommendations:

CPU/GPU: An NVIDIA GPU with sufficient VRAM (8GB+) is highly recommended for faster inference. Otherwise, a powerful multi-core CPU is necessary.
RAM: Adequate system RAM to load the model and process documents. Requirements vary by model size.
Storage: Sufficient disk space to store the LLM weights (can range from GBs to hundreds of GBs).

2. Software Installation

Install the necessary software and Python libraries:


        pip install pip --upgrade
        pip install torch torchvision torchaudio  # For PyTorch (if your model uses it)
        pip install tensorflow  # For TensorFlow (if your model uses it)
        pip install transformers sentencepiece accelerate  # Hugging Face Transformers
        pip install PyPDF2 Pillow opencv-python  # For document handling
        pip install pytesseract  # For OCR (if needed)
        # Optional:
        # Follow instructions for llama.cpp or Ollama from their respective sites.

3. Downloading a Pretrained LLM

Download a suitable open-source LLM from the Hugging Face Hub. Consider models like Mixtral, Llama 2, or smaller models depending on your resources.

Using transformers:


        from transformers import AutoTokenizer, AutoModelForCausalLM
        import torch

        model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"  # Replace with your chosen model
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name)

        if torch.cuda.is_available():
            model = model.to("cuda")
        print(f"Model '{model_name}' loaded successfully.")

Using Ollama:


        ollama pull mistral:latest
        echo "Mistral model pulled successfully (if Ollama is installed)."

Phase 2: Implementing the Document Extraction Workflow

1. Loading Your Document

Load the document content based on its format:

PDF:


        from PyPDF2 import PdfReader

        def extract_text_from_pdf(pdf_path):
            text = ""
            with open(pdf_path, 'rb') as file:
                reader = PdfReader(file)
                for page in reader.pages:
                    text += page.extract_text()
            return text

        pdf_text = extract_text_from_pdf("your_document.pdf")
        print("PDF text extracted.")

Image (Scanned Document):


        from PIL import Image
        import pytesseract

        def extract_text_from_image(image_path):
            img = Image.open(image_path)
            text = pytesseract.image_to_string(img)
            return text

        image_text = extract_text_from_image("scanned_document.png")
        print("Text extracted from image.")

Text File:


        with open("document.txt", "r", encoding="utf-8") as f:
            text_file_content = f.read()
        print("Text file content loaded.")

2. Preparing the Input Prompt

Create a clear and concise prompt to instruct the LLM on what information to extract and the desired output format (e.g., JSON).


        document_text = pdf_text if 'pdf_text' in locals() else (image_text if 'image_text' in locals() else (text_file_content if 'text_file_content' in locals() else ""))

        prompt = f"""Extract the following information from the document:
        - Invoice Number
        - Date
        - Customer Name
        - Total Amount

        Document:
        {document_text}

        Output the information as a JSON object.
        """
        print("Prompt created.")

3. Interacting with the Local LLM

Send the prompt to your loaded LLM for processing.

Using transformers:


        inputs = tokenizer(prompt, return_tensors="pt")
        if torch.cuda.is_available():
            inputs = inputs.to("cuda")

        outputs = model.generate(**inputs, max_new_tokens=500)
        extracted_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print("\nLLM Output:")
        print(extracted_text)

        import json
        try:
            extracted_data = json.loads(extracted_text.split("```json")[-1].split("```")[0].strip())
            print("\nExtracted Data (JSON):")
            print(extracted_data)
        except (json.JSONDecodeError, IndexError):
            print("\nCould not parse JSON from the output.")
            print("Raw output:", extracted_text)

Using Ollama:


        import requests
        import json

        data = {
            "prompt": prompt,
            "model": "mistral:latest",  # Ensure Ollama is running and this model is pulled
            "stream": False
        }

        response = requests.post('http://localhost:11434/api/generate', json=data)
        if response.status_code == 200:
            extracted_text = response.json()['response'].strip()
            print("\nLLM Output:")
            print(extracted_text)
            try:
                extracted_data = json.loads(extracted_text)
                print("\nExtracted Data (JSON):")
                print(extracted_data)
            except json.JSONDecodeError:
                print("\nCould not parse JSON from the output.")
                print("Raw output:", extracted_text)
        else:
            print(f"Ollama API error: {response.status_code} - {response.text}")

Phase 3: Considerations for Non-Cloud LLM Usage

Resource Management: Monitor your system resources, especially GPU and RAM usage.
Model Selection: Experiment with different open-source models to find the best balance of performance and resource consumption.
Prompt Engineering: Refine your prompts for better accuracy and desired output format.
Fine-tuning: Consider fine-tuning on your specific document types for improved results.
Chunking: For large documents, implement chunking strategies to fit within the model’s context window.
Error Handling: Add error handling for file operations, OCR, and JSON parsing.

Latest Posts

Using local LLM for Document Extraction

Non-Cloud LLM for Document Extraction

Phase 1: Setting Up Your Local Environment

1. Hardware Requirements

2. Software Installation

3. Downloading a Pretrained LLM

Phase 2: Implementing the Document Extraction Workflow

1. Loading Your Document

2. Preparing the Input Prompt

3. Interacting with the Local LLM

Phase 3: Considerations for Non-Cloud LLM Usage

Like this:

Related Posts

Leave a ReplyCancel reply

Using local LLM for Document Extraction

Non-Cloud LLM for Document Extraction

Phase 1: Setting Up Your Local Environment

1. Hardware Requirements

2. Software Installation

3. Downloading a Pretrained LLM

Phase 2: Implementing the Document Extraction Workflow

1. Loading Your Document

2. Preparing the Input Prompt

3. Interacting with the Local LLM

Phase 3: Considerations for Non-Cloud LLM Usage

Share this:

Like this:

Related Posts

Leave a ReplyCancel reply