Skip to content

Commit

Permalink
Implement reading from pdf on s3 bucket
Browse files Browse the repository at this point in the history
  • Loading branch information
anna-lybid committed Nov 26, 2023
1 parent b88fb36 commit 6690336
Show file tree
Hide file tree
Showing 6 changed files with 69 additions and 2 deletions.
Binary file added Images/s3-bucket-with-cv.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 23 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,25 @@
# ocr-textbook
Optical Character Recognition using Python

# Technologies

- Python
- Amazon S3 - AWS
- Git
- ТeceractOCR
- Pillow
- PyMuPDF

# Installation

Python 3.7+ is required.

```
git clone https://github.com/anna-lybid/planetarium-api-project.git
cd planetarium-api-project
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python manage.py migrate
python manage.py runserver
```
2 changes: 0 additions & 2 deletions app/main.py

This file was deleted.

20 changes: 20 additions & 0 deletions main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import cv2 as cv
from PIL import Image
import pytesseract

image_file = "Images/s3-bucket-with-cv.jpg"

image = cv.imread(image_file)

if image is not None:
cv.imshow("Image", image)
cv.waitKey(0)
cv.destroyAllWindows()
else:
print("Image is not found or could not be opened.")

img = Image.open(image_file)

text = pytesseract.image_to_string(img)

print(text)
26 changes: 26 additions & 0 deletions read_cv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import boto3
import fitz
import pytesseract
from PIL import Image

bucket_name = "anna-lybid-s3-demo"
key = "My CV/CV. Anna Lybid. Python developer.pdf"

s3 = boto3.client("s3")

response = s3.get_object(Bucket=bucket_name, Key=key)

pdf_data = response["Body"].read()

pdf_document = fitz.open(stream=pdf_data, filetype='pdf')

first_page = pdf_document.load_page(0)
image = first_page.get_pixmap()

image_path = "images/converted_from_pdf.png"
image.save(image_path)

img = Image.open(image_path)
text = pytesseract.image_to_string(img)

print(text)
Binary file modified requirements.txt
Binary file not shown.

0 comments on commit 6690336

Please sign in to comment.