by Lazy Sloth
Extract Text from PDF Images
import logging
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import HTMLResponse
from pydantic import BaseModel
from io import BytesIO
import fitz # PyMuPDF
from PIL import Image
import pytesseract
from abilities import upload_file_to_storage
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI()
# Ensure tesseract is installed in the system and available in the PATH
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'
def perform_ocr(image_bytes):
try:
image = Image.open(BytesIO(image_bytes))
text = pytesseract.image_to_string(image)
Frequently Asked Questions
What are some business applications for the Extract Text from PDF Images app?
The Extract Text from PDF Images app has several business applications, including: - Document digitization for easier searching and archiving - Converting scanned contracts or invoices into editable text - Extracting data from financial reports or statements - Automating data entry from paper forms - Improving accessibility of PDF documents for visually impaired users
How can this app improve efficiency in a legal or administrative setting?
In legal or administrative settings, the Extract Text from PDF Images app can significantly improve efficiency by: - Quickly converting scanned legal documents into searchable text - Enabling faster review and analysis of large volumes of PDF files - Facilitating easier extraction of key information from contracts or agreements - Streamlining the process of creating digital archives of paper documents - Reducing manual data entry errors and saving time in document processing
What industries could benefit most from using the Extract Text from PDF Images app?
Several industries can benefit from using this app, including: - Legal firms for processing case documents and contracts - Healthcare organizations for digitizing patient records and medical reports - Financial institutions for analyzing statements and reports - Government agencies for processing forms and official documents - Academic institutions for digitizing research papers and archival materials
How can I modify the Extract Text from PDF Images app to save the extracted text to a database instead of cloud storage?
To save the extracted text to a database, you can modify the app by replacing the upload_file_to_storage
function call with a database insertion. Here's an example using SQLAlchemy with PostgreSQL:
```python from sqlalchemy import create_engine, Column, Integer, String, Text from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import sessionmaker
Base = declarative_base() engine = create_engine('postgresql://username:password@localhost/dbname') Session = sessionmaker(bind=engine)
class ExtractedText(Base): tablename = 'extracted_texts' id = Column(Integer, primary_key=True) filename = Column(String) content = Column(Text)
Base.metadata.create_all(engine)
# In the upload_pdf function, replace the storage upload with: session = Session() new_entry = ExtractedText(filename=file.filename, content=text_from_pdf) session.add(new_entry) session.commit() session.close() ```
This code creates a table to store the extracted text and inserts a new record for each processed PDF.
Can the Extract Text from PDF Images app be modified to support multiple languages for OCR?
Yes, the app can be modified to support multiple languages for OCR. You can use Tesseract's language support by modifying the perform_ocr
function. Here's an example:
```python def perform_ocr(image_bytes, lang='eng'): try: image = Image.open(BytesIO(image_bytes)) text = pytesseract.image_to_string(image, lang=lang) return text except Exception as e: logger.error(f"OCR failed: {e}") raise HTTPException(status_code=500, detail="OCR failed.")
# In the upload_pdf function, add a language parameter: @app.post("/upload_pdf") async def upload_pdf(file: UploadFile = File(...), lang: str = 'eng'): # ... existing code ... text_from_pdf += perform_ocr(image_bytes, lang) # ... rest of the function ... ```
This modification allows users to specify the language for OCR processing. Make sure to install the appropriate language data files for Tesseract OCR to support additional languages.
Created: | Last Updated:
Introduction to the PDF Text Extractor Template
Welcome to the PDF Text Extractor template guide. This template is designed to help you build an application that allows users to upload PDF files and extract text from both images and text content within the PDF. The extracted text can then be downloaded as a plain text file. This is particularly useful for digitizing documents and automating data extraction processes.
Getting Started
To begin using this template, simply click on "Start with this Template" on the Lazy platform. This will pre-populate the code in the Lazy Builder interface, so you won't need to copy, paste, or delete any code manually.
Test: Deploying the App
Once you have started with the template, you can deploy the app by pressing the "Test" button. This will initiate the deployment process and launch the Lazy CLI. The Lazy platform handles all the deployment details, so you don't need to worry about installing libraries or setting up your environment.
Using the App
After deployment, the Lazy CLI will provide you with a dedicated server link to access the app's interface. If you're using the FastAPI framework, you will also receive a link to the API documentation.
The app's interface is a simple web page with a form where users can upload their PDF files. Once a file is uploaded, the app processes the PDF and extracts text from it. If the PDF contains images, the app will use OCR (Optical Character Recognition) to convert the image text to digital text.
Upon successful text extraction, the user will be presented with a download link to retrieve the extracted text as a plain text file.
Integrating the App
If you wish to integrate this app into an external service or frontend, you can use the server link provided by Lazy. For example, you could embed the link in another web application to allow users to access the PDF Text Extractor directly from that application.
Additionally, if you want to use the extracted text in other applications or services, you can download the text file and then upload it to the desired platform, or you could automate this process by using the app's API endpoints.
That's all there is to it! With these simple steps, you can deploy and use the PDF Text Extractor app on the Lazy platform, making it easy to extract text from PDF files without any technical setup or deployment concerns.
Here are 5 key business benefits for this template:
Template Benefits
-
Efficient Document Digitization: This template enables businesses to quickly convert image-based PDFs into searchable text, streamlining document management processes and improving information accessibility.
-
Enhanced Data Extraction: By combining OCR technology with direct text extraction, the template ensures comprehensive data capture from various PDF types, supporting better data analysis and decision-making.
-
Improved Compliance and Auditing: The ability to extract and store text from PDFs facilitates easier compliance checks and auditing processes, particularly useful in industries with strict documentation requirements.
-
Cost-Effective Information Processing: By automating the text extraction process, businesses can significantly reduce manual data entry costs and minimize human errors associated with transcription.
-
Seamless Integration Potential: The template's API-based approach allows for easy integration with existing business systems, enabling automated workflows for document processing and information management across various departments.