Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Find bounding box for each section in the image #44

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
db381b5
feat: Add prompt for getting segmented markdown
getwithashish Sep 20, 2024
b27294c
refactor: Rename the prompt for segmenting markdown
getwithashish Sep 20, 2024
2a0e133
feat: Append system prompt for getting segmented markdown, if boundin…
getwithashish Sep 20, 2024
1916b16
feat: Clean OCR data from image
getwithashish Sep 20, 2024
6fe805a
feat: Get OCR data from the image
getwithashish Sep 20, 2024
75e9230
feat: Find matching substring from OCR data using Levenshtein distance
getwithashish Sep 20, 2024
a965cd1
feat: Calculate bounding box enclosing the matched substring
getwithashish Sep 20, 2024
c01af47
feat: Specify the section delimiter used in the markdown
getwithashish Sep 20, 2024
c8dce48
feat: Add error messages for OCR and Bounding Box
getwithashish Sep 20, 2024
2821f86
feat: Add Section to include the various sections in the markdown, al…
getwithashish Sep 20, 2024
57fdcc6
feat: Functionality to remove markdown format from the specified text
getwithashish Sep 20, 2024
4944d07
feat: Perform OCR and find the bounding box for each section, if boun…
getwithashish Sep 20, 2024
edc09ef
feat: Add sections to the Page, if the bounding_box param is set to True
getwithashish Sep 20, 2024
327dc89
chore: Add dependencies in pyproject.toml
getwithashish Sep 20, 2024
93b8bab
docs: Update README.md with bounding_box details
getwithashish Sep 20, 2024
72e245e
docs: Update README.md by adding `bounding_box` param
getwithashish Sep 20, 2024
75e42e9
build: Script for pre-installing Tesseract
getwithashish Sep 20, 2024
94bbee6
build: Update package metadata for py-zerox
getwithashish Sep 20, 2024
5403b34
docs: Update README.md to install Tesseract
getwithashish Sep 20, 2024
cd57ff0
feat: Normalize the bounding box coordinates
getwithashish Sep 20, 2024
5585137
feat: Include image dimensions in the OCR data
getwithashish Sep 20, 2024
ac5e0de
docs: Update docstring in bounding_box.py
getwithashish Sep 20, 2024
ab90e2f
docs: Update README.md for normalized bounding box
getwithashish Sep 20, 2024
be1a099
refactor: Specify the correct return type of process_page() method
getwithashish Sep 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat: Clean OCR data from image
  • Loading branch information
getwithashish committed Sep 20, 2024
commit 1916b1681cec689f079f02ee027575c46e0c71ac
62 changes: 62 additions & 0 deletions py_zerox/pyzerox/processor/ocr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
from typing import Dict
from PIL import Image
import pytesseract

from py_zerox.pyzerox.constants.messages import Messages


def enhance_image_for_ocr(image: Image) -> Image:
"""
Enhances the given image for Optical Character Recognition.
Converts the image to grayscale.

Args:
image (Image): The input image to be enhanced.

Returns:
Image: The enhanced grayscale image ready for OCR processing.
"""
image = image.convert("L")
return image


async def _clean_ocr_text(data: Dict[str, list]) -> Dict[str, list]:
"""
Processes the input data dictionary containing OCR results,
filtering out entries with low confidence scores or empty text.

Args:
data (dict): A dictionary containing OCR results:
- 'text': A list of recognized text strings.
- 'conf': A list of confidence scores corresponding to each text.
- 'left': A list of x-coordinates for the text bounding boxes.
- 'top': A list of y-coordinates for the text bounding boxes.
- 'width': A list of widths for the text bounding boxes.
- 'height': A list of heights for the text bounding boxes.

Returns:
dict: A dictionary containing filtered lists of text and attributes:
- 'text_list': A list of valid text strings.
- 'left_list': A list of x-coordinates for the text bounding boxes.
- 'top_list': A list of y-coordinates for the text bounding boxes.
- 'width_list': A list of widths for the text bounding boxes.
- 'height_list': A list of heights for the text bounding boxes.
"""
data_lists = {
"text_list": [],
"left_list": [],
"top_list": [],
"width_list": [],
"height_list": [],
}

for i in range(len(data["text"])):
if int(data["conf"][i]) > 0 and data["text"][i].strip():
data_lists["text_list"].append(data["text"][i])
data_lists["left_list"].append(data["left"][i])
data_lists["top_list"].append(data["top"][i])
data_lists["width_list"].append(data["width"][i])
data_lists["height_list"].append(data["height"][i])

return data_lists