-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature #546]: Increase parsing speed by 47% by efficiently reading and writing the XML #598
[Feature #546]: Increase parsing speed by 47% by efficiently reading and writing the XML #598
Conversation
With the new approach, we don't attempt inserting the rows again if they already exist
Thanks a lot for this improvement @AlexandraImbrisca. |
I was thinking about creating a separate pull request to keep the changes easier to review. I unfortunately can't assign the reviewer myself (at least the UI doesn't allow me 🙈 ) |
Pytest is failing as usual for PRs from outside (This is a known error, where the API credentials saved in this github repo are not revealed to code coming from outside due to security reasons). |
I ran the following with the data from 15.01. db = Mastr()
db.download(date="existing", data="deleted_market_actors") and got this error:
When using a postgres engine, this error does not appear. As far as I understand it, the error is catched by the pandas.to_sql function? At least it seems like neither the
By this, the one xml file from deleted_market_actors would be slower to parse, but the |
Could you please try again? I previously removed the call to I agree that it would be great to switch to the non-SQL method in case of any other errors, so I added a second call to that function |
Yes, now it does not crash anymore. |
I would suggest the following:
Do you agree @nesnoj @AlexandraImbrisca ? |
Thanks for checking and yes, we can handle it that way. But before the merge I'd like to have a quick look later today. |
Sounds good to me! Please let me know if you have any other questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I tested with sqlite and postgresql, both seem to work well (I did not check the data though).
(I only added a changelog entry)
I think we can get the postgreSQL inserts even faster. If you prefer we can do this in a separate PR, I'll add some notes to #546.
Thanks a lot for this improvement @AlexandraImbrisca!
PS: Could you please send me the analysis doc via mail? Thx!
Sorry for re-requesting the review again! I resolved a merge conflict which automatically dismissed your most recent approval. @nesnoj Thanks a lot for testing and reviewing the changes! I completely agree that we can make the postgre faster. I'll do that in a separate PR and I'll send you the analysis doc over the email. |
@AlexandraImbrisca this PR is finished, so I can merge it now, right? |
Yes, thank you! |
Summary of the discussion
Improved the parsing speed by addressing 2 of the current bottlenecks:
Over the last months I have analysed multiple alternatives and documented the results here (please feel free to request access! or let me know if you prefer accessing the document in a different way).
Based on our MaStR dataset and its composition, the fastest parsing relies on:
By updating these 2 parts of the parsing logic, the execution time decreases by 47%. As a next step, I'll implement the parallelized version that reduces the overall time even more.
The current changes have been tested locally by using the sqldiff to compare the resulting databases. I also added unit tests for all of the new and existing functions from the
utils_write_to_database
.Fixes #546
Type of change (CHANGELOG.md)
Updated
Workflow checklist
PR-Assignee
Reviewer