Extract Text from PDF Images

Test this app for free
356
import logging
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import HTMLResponse
from pydantic import BaseModel
from io import BytesIO
import fitz  # PyMuPDF
from PIL import Image
import pytesseract
from abilities import upload_file_to_storage

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

# Ensure tesseract is installed in the system and available in the PATH
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

def perform_ocr(image_bytes):
    try:
        image = Image.open(BytesIO(image_bytes))
        text = pytesseract.image_to_string(image)
Get full code

Created: | Last Updated:

This app allows users to upload a PDF file and extract text from both images and text content within the PDF. The extracted text is then uploaded to cloud storage, and a link to the file is returned. The app has been updated to prioritize text extraction from images and prevent duplication of content. The HTML response has also been enhanced with CSS to create a visually appealing page. Users can now download the result file using the link provided after the PDF file is processed.

Introduction to the PDF Text Extractor Template

Welcome to the PDF Text Extractor template guide. This template is designed to help you build an application that allows users to upload PDF files and extract text from both images and text content within the PDF. The extracted text can then be downloaded as a plain text file. This is particularly useful for digitizing documents and automating data extraction processes.

Getting Started

To begin using this template, simply click on "Start with this Template" on the Lazy platform. This will pre-populate the code in the Lazy Builder interface, so you won't need to copy, paste, or delete any code manually.

Test: Deploying the App

Once you have started with the template, you can deploy the app by pressing the "Test" button. This will initiate the deployment process and launch the Lazy CLI. The Lazy platform handles all the deployment details, so you don't need to worry about installing libraries or setting up your environment.

Using the App

After deployment, the Lazy CLI will provide you with a dedicated server link to access the app's interface. If you're using the FastAPI framework, you will also receive a link to the API documentation.

The app's interface is a simple web page with a form where users can upload their PDF files. Once a file is uploaded, the app processes the PDF and extracts text from it. If the PDF contains images, the app will use OCR (Optical Character Recognition) to convert the image text to digital text.

Upon successful text extraction, the user will be presented with a download link to retrieve the extracted text as a plain text file.

Integrating the App

If you wish to integrate this app into an external service or frontend, you can use the server link provided by Lazy. For example, you could embed the link in another web application to allow users to access the PDF Text Extractor directly from that application.

Additionally, if you want to use the extracted text in other applications or services, you can download the text file and then upload it to the desired platform, or you could automate this process by using the app's API endpoints.

That's all there is to it! With these simple steps, you can deploy and use the PDF Text Extractor app on the Lazy platform, making it easy to extract text from PDF files without any technical setup or deployment concerns.

Technologies

Python App Templates for Scraping, Machine Learning, Data Science and More Python App Templates for Scraping, Machine Learning, Data Science and More
PDF PDF