-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature #600]: Use multiprocessing to speed up the parsing #601
base: develop
Are you sure you want to change the base?
[Feature #600]: Use multiprocessing to speed up the parsing #601
Conversation
If it's not done inside of the "if __name__ == "__main__"", it will be recalled inside every new process on Mac/Windows
Since the processing is now async, this print might confuse the users
Thank you @AlexandraImbrisca for the implementation and sending the detailed report which reads coherently! What I stumbled across so far:
The speed is stalling somewhere from 5 cores onward. I can imagine this drop in the speed increase is caused by a) the writing concurrency, b) other running processes on my laptop, c) number of parallel processes decrease once most of the tasks are done?
(The column
|
Thanks a lot for the detailed review and suggestions @nesnoj!
|
Hey @AlexandraImbrisca !
Sounds good to me.
An alternative way could be to create separate SQLite DBs and finally merge them. Dunno if this is a viable option..
It terminates :( |
Sounds good to me as well! |
Instead of "timeout", we can use "connect_timeout" which works for both SQLite and PostgreSQL
Awesome, thanks a lot both! A few updates from my side:
About merging the DBs: that might work, but it might get quite messy with many processes (i.e., we could end up with 10+ temporary DBs) and we have to make sure that we clean everything up eventually 🤔 Using temporary tables performed better than I expected (source) |
Thx for the quick update!
I'll get back to this later
The column issue seems to be solved but now I keep getting an error in PostgreSQL with the privileges, see below for full log. The user has all privileges for the DB (superuser) and the tables are created but no data is written. I think it is not related to the actual privileges but the implementation but I wasn't able to track it further down right now.
Great that you already did some testing in the past! The write-temp-and-merge strategy was just a quick thought, it probably comes with other consequences I cannot estimate and also requires more testing. I'm also fine with the current implementation but open for discussion ;). Click here for full postgres traceback
|
Thanks a bunch for finding this bug! I was using an unauthenticated database and I didn't realise that this could be an issue. The connection_url obfuscates the password so I updated the code to properly set the password. Could you please try again and let me know if you see the same issue? |
Ensure correct type of NUMBER_OF_PROCESSES and add error handling for non-numeric types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two small things needed a fix, I patched..
Now it works fine with psql, thank you!
variable.""" | ||
if "NUMBER_OF_PROCESSES" in os.environ: | ||
number_of_processes = os.environ.get("NUMBER_OF_PROCESSES") | ||
if number_of_processes >= cpu_count(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing I forget to mention in my previous posts:
The env var NUMBER_OF_PROCESSES
is a string causing the comparison to fail.
"""Process a single xml file and write it to the database.""" | ||
# If set, the connection url obfuscates the password. We must replace the masked password with the actual password. | ||
if password: | ||
connection_url = re.sub(r"://[^:]+:\*+@", f"://{password}@", connection_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this regex the pw is not supplied at all.
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "localhost" (127.0.0.1), port 5432 failed: fe_sendauth: no password supplied
It deletes the username and uses the pw as username.
Also, this solution does not allow colons in the username.
Follow up to the previous PR: