- Date
Before you begin, ensure you have installed Anaconda. If not, download it from Anaconda.
To get started with this project, first clone the repository on your local machine: You have two options for cloning the repository: using the command line or GitHub Desktop.
git clone https://github.com/Emidouni/Invoice-Information-Extraction
For a more graphical interface, you can use GitHub Desktop:
- 1.Download and install GitHub Desktop from desktop.github.com.
- 2.Open GitHub Desktop and sign in to your GitHub account.
- 3.Click on File > Clone Repository.
- In the "URL" tab, enter the URL of the repository https://github.com/Emidouni/Invoice-Information-Extraction and choose the local path where you want to clone the repository.
- 4.Click Clone to start the cloning process.
After installing Anaconda , you can create a new Conda environment specifically for this project. This helps to manage dependencies and avoid conflicts with other projects.
Open the Anaconda Prompt or your terminal (make sure Conda is added to your PATH) and run
conda create --name myenv python=3.9.20
Replace myenv with your preferred name for the environment. This command creates a new Conda environment named myenv with Python version 3.8.18 Activate the environment with:
conda activate myenv
After activating the environment, you can proceed with installing other required packages as mentioned in the project's
pip install -r requirements.txt
In addition to the libraries listed in requirements.txt
, to test different OCR tools and to run the OCR.ipynb Jupyter notebook, you need to install OCR.
Tesseract OCR is an open-source Optical Character Recognition (OCR) engine used for text recognition in images.
-
Download Tesseract OCR:
- Windows Users: Download the installer from UB Mannheim. Follow the installation instructions provided on the website.
-
Locate the Tesseract Installation Path:
- After installation, locate where Tesseract OCR has been installed on your machine. The default installation path on Windows is usually
C:\Program Files\Tesseract-OCR\tesseract.exe
.
- After installation, locate where Tesseract OCR has been installed on your machine. The default installation path on Windows is usually
-
Update the Script with Your Tesseract Path:
-
In your project's script
utils.py
, where Tesseract OCR is utilized, locate the line that sets thetesseract_cmd
property ofpytesseract
. Replace the existing path with the actual path to your Tesseract installation. For example, change the line from:pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
To:
pytesseract.pytesseract.tesseract_cmd = r"C:\Path\To\Your\tesseract.exe"
-
- Ensure to replace
C:\Path\To\Your\tesseract.exe
with the correct path to where Tesseract OCR is installed on your system.
streamlit run app2.py