Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature #600]: Use multiprocessing to speed up the parsing #601

Open
wants to merge 27 commits into
base: develop
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0c0c945
Move Mastr initialization
AlexandraImbrisca Jan 22, 2025
4a5e427
Use a ProcessPoolExecutor to process multiple files at once
AlexandraImbrisca Jan 25, 2025
470a90c
Move the tables creation inside of the process_xml_file
AlexandraImbrisca Jan 26, 2025
418a554
Set the maximum number of processes to 3 as it was shown optimal in t…
AlexandraImbrisca Jan 26, 2025
76502cf
Add unit test for new function
AlexandraImbrisca Jan 26, 2025
436bb54
Remove print
AlexandraImbrisca Jan 26, 2025
ccf6b55
Set default number of processes to 1 and add warning for values too l…
AlexandraImbrisca Jan 28, 2025
6719194
Fix PostgreSQL incompatibility and add comments for database options
AlexandraImbrisca Jan 28, 2025
98cff60
Add USE_RECOMMENDED_NUMBER_OF_PROCESSES
AlexandraImbrisca Jan 28, 2025
e1a199d
Add check to introduce the new column only if not defined already
AlexandraImbrisca Jan 28, 2025
ee9b1af
Catch & ignore "duplicate column name" exception
AlexandraImbrisca Jan 28, 2025
5aecc8f
Add message introducing parallelized processing
AlexandraImbrisca Jan 28, 2025
d76db32
Update CHANGELOG.md
AlexandraImbrisca Jan 28, 2025
8efc3d8
Add missing columns in comments
AlexandraImbrisca Jan 28, 2025
2b196e8
Update docs to include new environment variables
AlexandraImbrisca Jan 28, 2025
139bedd
Remove unnecessary import
AlexandraImbrisca Jan 28, 2025
506e270
Separate SQLite-only options
AlexandraImbrisca Jan 28, 2025
1d90469
Adapted timeout option per engine
AlexandraImbrisca Jan 28, 2025
d844c2b
Replace obfuscated password
AlexandraImbrisca Jan 30, 2025
136abe8
Add quotes to comments
AlexandraImbrisca Jan 30, 2025
351faaa
Use regex to generalize password replacement
AlexandraImbrisca Jan 30, 2025
bfe1285
Fix regex in pw replacement
nesnoj Jan 31, 2025
572349c
Fix processing of NUMBER_OF_PROCESSES
nesnoj Jan 31, 2025
fd08980
Add try catch in process_xml_file
AlexandraImbrisca Feb 7, 2025
68ddeac
Use ProcessPoolExecutor only if the user has opted for parallelisation
AlexandraImbrisca Feb 25, 2025
30df437
Add note about if __name__ == "__main__"
AlexandraImbrisca Feb 25, 2025
26a3e82
Escape __ in documentation
AlexandraImbrisca Feb 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Replace obfuscated password
AlexandraImbrisca committed Jan 30, 2025
commit d844c2b64b80c4f11ce1789036a394ce6809aaac
9 changes: 6 additions & 3 deletions open_mastr/xml_download/utils_write_to_database.py
Original file line number Diff line number Diff line change
@@ -48,6 +48,7 @@ def write_mastr_xml_to_database(
xml_table_name,
sql_table_name,
str(engine.url),
engine.url.password,
zipped_xml_file_path,
bulk_download_date,
bulk_cleansing,
@@ -88,14 +89,16 @@ def process_xml_file(
file_name: str,
xml_table_name: str,
sql_table_name: str,
db_connection_url: str,
connection_url: str,
password: str,
zipped_xml_file_path: str,
bulk_cleansing: bool,
bulk_download_date: str,
bulk_cleansing: bool,
) -> None:
"""Process a single xml file and write it to the database."""
# Each process will create its own engine to ensure isolation and efficient resource management.
engine = create_efficient_engine(db_connection_url)
# The connection url obfuscates the password. We must replace the masked password with the actual password.
engine = create_efficient_engine(connection_url.replace("****", password))
with ZipFile(zipped_xml_file_path, "r") as f:
print(f"Processing file '{file_name}'...")
if is_first_file(file_name):