Create a virtual environment

python3 -m venv env

Install the Python packages

pip install -r requirements.txt

Prepare your training data

Create a folder raw_data

If you have a CSV file containing at least two columns: one listing the filename for each row, and one containing the URL of the file for each row:

Scrap your CSV file and download the files in raw_data while naming them according to the filename column python3 utils/scraping_files.py [-h] [-csv CSV] [-column_url COLUMN_URL] [-column_filename COLUMN_FILENAME]

optional arguments:
  -h, --help            show this help message and exit
  -csv CSV              CSV file containing the URLs of the files
  -column_url COLUMN_URL
                        Column of the CSV corresponding to the URL of each PDF / each row has one url
  -column_filename COLUMN_FILENAME
                        Column of the CSV corresponding to the filename of each PDF / each row has one filename

Create JSON files for the files in raw_data containing the features that you want to train the model on as well as their values, and put the JSON files in train_data folder.

If you have a CSV containing the features, you are free to use utils/retrieve_data.py!

Now, you have your documents / files in raw_data and your JSON files in train_data. What you want to do is to merge the two folders into train_data, meaning that you want in the end to have in the folder two files for each filename: a PDF (the original document), and its JSON (dictionary of features).

You can use utils/delete_json_add_pdf.py to copy the PDF files from raw_data to train_data if there is a JSON associated in train_data, and to delete the JSON files if the PDF file associated doesn't exist in raw_data.

Run the streamlit app via Docker

# Build the docker image
docker build . -t invoicenet

# Verify that the docker image has been correctly built
docker images

# Run the docker image with PORT=8501
docker run -p 8501:8501 app.py

Go to (localhost:8501)[http://localhost:8501] to use the app!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
invoicenet		invoicenet
models/parsers		models/parsers
src		src
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Create a virtual environment

Install the Python packages

Prepare your training data

Run the streamlit app via Docker

About

Releases

Packages

Languages

License

ThomasMonnier/invoicenet

Folders and files

Latest commit

History

Repository files navigation

Create a virtual environment

Install the Python packages

Prepare your training data

Run the streamlit app via Docker

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages