Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature #546]: Increase parsing speed by 47% by efficiently reading and writing the XML #598

Conversation

AlexandraImbrisca
Copy link
Contributor

@AlexandraImbrisca AlexandraImbrisca commented Jan 12, 2025

Summary of the discussion

Improved the parsing speed by addressing 2 of the current bottlenecks:

  • Reading the XML files into DataFrames
  • Writing the DataFrames to the SQLite database

Over the last months I have analysed multiple alternatives and documented the results here (please feel free to request access! or let me know if you prefer accessing the document in a different way).

Based on our MaStR dataset and its composition, the fastest parsing relies on:

  • Using the lxml parser in combination with the pandas.read_xml function (linked analysis)
  • Bulk inserting the data using plain SQL statements (linked analysis)

By updating these 2 parts of the parsing logic, the execution time decreases by 47%. As a next step, I'll implement the parallelized version that reduces the overall time even more.

The current changes have been tested locally by using the sqldiff to compare the resulting databases. I also added unit tests for all of the new and existing functions from the utils_write_to_database.

Fixes #546

Type of change (CHANGELOG.md)

Updated

  • Updated the parsing logic (#546)

Workflow checklist

PR-Assignee

Reviewer

  • 🐙 Follow the Reviewer Guidelines
  • 🐙 Provided feedback and show sufficient appreciation for the work done

@nesnoj
Copy link
Collaborator

nesnoj commented Jan 12, 2025

Thanks a lot for this improvement @AlexandraImbrisca.
Will the parallelization be part of this PR or will you open a new one? In the latter case please assign a reviewer, thx!

@nesnoj nesnoj mentioned this pull request Jan 12, 2025
@AlexandraImbrisca
Copy link
Contributor Author

I was thinking about creating a separate pull request to keep the changes easier to review. I unfortunately can't assign the reviewer myself (at least the UI doesn't allow me 🙈 )

@FlorianK13
Copy link
Member

Pytest is failing as usual for PRs from outside (This is a known error, where the API credentials saved in this github repo are not revealed to code coming from outside due to security reasons).

@FlorianK13
Copy link
Member

I ran the following with the data from 15.01.

db = Mastr()
db.download(date="existing", data="deleted_market_actors")

and got this error:

line 455, in add_table_to_sqlite_database: con.connection.executemany(insert_stmt, df.to_numpy())
sqlite3.IntegrityError: NOT NULL constraint failed: deleted_market_actors.MastrNummer

When using a postgres engine, this error does not appear. As far as I understand it, the error is catched by the pandas.to_sql function? At least it seems like neither the delete_wrong_xml_entry nor the write_single_entries_until_not_unique_comes_up is called. Can you maybe implement the following:

  1. If engine=sqlite, use the fast sqlite method
  2. If an error occurs, try to use the slower non_sqlite method also for sqlite databases for this one xml file

By this, the one xml file from deleted_market_actors would be slower to parse, but the db.download would not crash. What do you think @AlexandraImbrisca ?

@AlexandraImbrisca
Copy link
Contributor Author

Could you please try again? I previously removed the call to write_single_entries_until_not_unique_comes_up because I thought that "ON CONFLICT DO NOTHING" should handle this exception completely. I'm curious if the error is caught and handled correctly now by the SQL method

I agree that it would be great to switch to the non-SQL method in case of any other errors, so I added a second call to that function

@FlorianK13
Copy link
Member

Yes, now it does not crash anymore.

@FlorianK13 FlorianK13 self-requested a review January 17, 2025 08:42
FlorianK13
FlorianK13 previously approved these changes Jan 17, 2025
@FlorianK13
Copy link
Member

I would suggest the following:

  • We can now merge this PR
  • Afterwards we will look at the PR about parallelization
  • When both are merged, I will also try the sqldiff stuff to check that the original package on pypi and the new code produce the same databases

Do you agree @nesnoj @AlexandraImbrisca ?

@nesnoj
Copy link
Collaborator

nesnoj commented Jan 17, 2025

I would suggest the following:

  • We can now merge this PR
  • Afterwards we will look at the PR about parallelization
  • When both are merged, I will also try the sqldiff stuff to check that the original package on pypi and the new code produce the same databases

Do you agree @nesnoj @AlexandraImbrisca ?

Thanks for checking and yes, we can handle it that way. But before the merge I'd like to have a quick look later today.

@AlexandraImbrisca
Copy link
Contributor Author

Sounds good to me! Please let me know if you have any other questions

nesnoj
nesnoj previously approved these changes Jan 20, 2025
Copy link
Collaborator

@nesnoj nesnoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I tested with sqlite and postgresql, both seem to work well (I did not check the data though).
(I only added a changelog entry)

I think we can get the postgreSQL inserts even faster. If you prefer we can do this in a separate PR, I'll add some notes to #546.

Thanks a lot for this improvement @AlexandraImbrisca!

PS: Could you please send me the analysis doc via mail? Thx!

@AlexandraImbrisca
Copy link
Contributor Author

Sorry for re-requesting the review again! I resolved a merge conflict which automatically dismissed your most recent approval.

@nesnoj Thanks a lot for testing and reviewing the changes! I completely agree that we can make the postgre faster. I'll do that in a separate PR and I'll send you the analysis doc over the email.

@FlorianK13
Copy link
Member

@AlexandraImbrisca this PR is finished, so I can merge it now, right?

@AlexandraImbrisca
Copy link
Contributor Author

Yes, thank you!

@FlorianK13 FlorianK13 merged commit 11fb568 into OpenEnergyPlatform:develop Jan 22, 2025
0 of 9 checks passed
@nesnoj nesnoj changed the title [Feature #546]: Decrease parsing speed by 47% by efficiently reading and writing the XML [Feature #546]: Increase parsing speed by 47% by efficiently reading and writing the XML Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Increase parsing speed
3 participants