[Feature #546]: Increase parsing speed by 47% by efficiently reading and writing the XML #598

AlexandraImbrisca · 2025-01-12T17:54:00Z

Summary of the discussion

Improved the parsing speed by addressing 2 of the current bottlenecks:

Reading the XML files into DataFrames
Writing the DataFrames to the SQLite database

Over the last months I have analysed multiple alternatives and documented the results here (please feel free to request access! or let me know if you prefer accessing the document in a different way).

Based on our MaStR dataset and its composition, the fastest parsing relies on:

Using the lxml parser in combination with the pandas.read_xml function (linked analysis)
Bulk inserting the data using plain SQL statements (linked analysis)

By updating these 2 parts of the parsing logic, the execution time decreases by 47%. As a next step, I'll implement the parallelized version that reduces the overall time even more.

The current changes have been tested locally by using the sqldiff to compare the resulting databases. I also added unit tests for all of the new and existing functions from the utils_write_to_database.

Fixes #546

Type of change (CHANGELOG.md)

Updated

Updated the parsing logic (#546)

Workflow checklist

PR-Assignee

🐙 Follow the workflow in CONTRIBUTING.md
📝 Update the CHANGELOG.md
📙 Update the documentation

Reviewer

🐙 Follow the Reviewer Guidelines
🐙 Provided feedback and show sufficient appreciation for the work done

…ists

With the new approach, we don't attempt inserting the rows again if they already exist

nesnoj · 2025-01-12T21:17:43Z

Thanks a lot for this improvement @AlexandraImbrisca.
Will the parallelization be part of this PR or will you open a new one? In the latter case please assign a reviewer, thx!

AlexandraImbrisca · 2025-01-13T06:49:32Z

I was thinking about creating a separate pull request to keep the changes easier to review. I unfortunately can't assign the reviewer myself (at least the UI doesn't allow me 🙈 )

FlorianK13 · 2025-01-13T09:08:44Z

Pytest is failing as usual for PRs from outside (This is a known error, where the API credentials saved in this github repo are not revealed to code coming from outside due to security reasons).

open_mastr/xml_download/utils_write_to_database.py

…tations

FlorianK13 · 2025-01-15T15:33:08Z

I ran the following with the data from 15.01.

db = Mastr()
db.download(date="existing", data="deleted_market_actors")

and got this error:

line 455, in add_table_to_sqlite_database: con.connection.executemany(insert_stmt, df.to_numpy())
sqlite3.IntegrityError: NOT NULL constraint failed: deleted_market_actors.MastrNummer

When using a postgres engine, this error does not appear. As far as I understand it, the error is catched by the pandas.to_sql function? At least it seems like neither the delete_wrong_xml_entry nor the write_single_entries_until_not_unique_comes_up is called. Can you maybe implement the following:

If engine=sqlite, use the fast sqlite method
If an error occurs, try to use the slower non_sqlite method also for sqlite databases for this one xml file

By this, the one xml file from deleted_market_actors would be slower to parse, but the db.download would not crash. What do you think @AlexandraImbrisca ?

AlexandraImbrisca · 2025-01-15T17:45:38Z

Could you please try again? I previously removed the call to write_single_entries_until_not_unique_comes_up because I thought that "ON CONFLICT DO NOTHING" should handle this exception completely. I'm curious if the error is caught and handled correctly now by the SQL method

I agree that it would be great to switch to the non-SQL method in case of any other errors, so I added a second call to that function

FlorianK13 · 2025-01-16T07:37:48Z

Yes, now it does not crash anymore.

open_mastr/xml_download/utils_write_to_database.py

FlorianK13 · 2025-01-17T08:46:07Z

I would suggest the following:

We can now merge this PR
Afterwards we will look at the PR about parallelization
When both are merged, I will also try the sqldiff stuff to check that the original package on pypi and the new code produce the same databases

Do you agree @nesnoj @AlexandraImbrisca ?

nesnoj · 2025-01-17T10:50:35Z

I would suggest the following:

We can now merge this PR

Afterwards we will look at the PR about parallelization

When both are merged, I will also try the sqldiff stuff to check that the original package on pypi and the new code produce the same databases

Do you agree @nesnoj @AlexandraImbrisca ?

Thanks for checking and yes, we can handle it that way. But before the merge I'd like to have a quick look later today.

AlexandraImbrisca · 2025-01-17T19:15:51Z

Sounds good to me! Please let me know if you have any other questions

nesnoj

Hey, I tested with sqlite and postgresql, both seem to work well (I did not check the data though).
(I only added a changelog entry)

I think we can get the postgreSQL inserts even faster. If you prefer we can do this in a separate PR, I'll add some notes to #546.

Thanks a lot for this improvement @AlexandraImbrisca!

PS: Could you please send me the analysis doc via mail? Thx!

AlexandraImbrisca · 2025-01-21T07:19:41Z

Sorry for re-requesting the review again! I resolved a merge conflict which automatically dismissed your most recent approval.

@nesnoj Thanks a lot for testing and reviewing the changes! I completely agree that we can make the postgre faster. I'll do that in a separate PR and I'll send you the analysis doc over the email.

FlorianK13 · 2025-01-21T09:23:25Z

@AlexandraImbrisca this PR is finished, so I can merge it now, right?

AlexandraImbrisca · 2025-01-21T17:09:43Z

Yes, thank you!

AlexandraImbrisca added 13 commits January 4, 2025 16:13

Update CITATION.cff

ccb7c63

Use _table_name (instead of _table_name) consistently

a0455ec

Avoid named arguments when using the same names (i.e., a=a)

7c4337b

Restructure write_mastr_xml_to_database to outline main steps

227499f

Update add_table_to_database to always append if the table already ex…

ad4621c

…ists

Use inspect to avoid recreating existing tables

52f42bf

Use etree to parse the XML files faster OpenEnergyPlatform#546

df22e7c

Use SQL statements to bulk insert the data

b2f3a5f

Delete unused method

7760445

Remove write_single_entries_until_not_unique_comes_up

2add074

With the new approach, we don't attempt inserting the rows again if they already exist

Update docstring to include up-to-date parameters

8b0338d

Add unit tests for all of the new and existing methods

5803ed1

Update comments

56014ce

nesnoj mentioned this pull request Jan 12, 2025

XML-parsing slow #595

Closed

FlorianK13 requested review from FlorianK13 and nesnoj January 13, 2025 07:34

FlorianK13 reviewed Jan 13, 2025

View reviewed changes

open_mastr/xml_download/utils_write_to_database.py Show resolved Hide resolved

open_mastr/xml_download/utils_write_to_database.py Outdated Show resolved Hide resolved

AlexandraImbrisca added 3 commits January 14, 2025 19:30

Add previous implementation for non-SQLite databases

507c630

Switch back to using to_sql for non-sqlite databases

efd9560

Reorder functions and use parametrised testing to check both implemen…

1117af7

…tations

AlexandraImbrisca added 2 commits January 15, 2025 18:32

Treat IntegrityError inside add_table_to_sqlite_database

c2264d8

Default to add_table_to_non_sqlite_database for any unexpected errors

6c8cc41

FlorianK13 reviewed Jan 16, 2025

View reviewed changes

open_mastr/xml_download/utils_write_to_database.py Outdated Show resolved Hide resolved

Switch back to previous implementation of create_database_table

f9d605d

FlorianK13 self-requested a review January 17, 2025 08:42

FlorianK13 previously approved these changes Jan 17, 2025

View reviewed changes

Update changelog

3f37f1d

nesnoj dismissed FlorianK13’s stale review via 3f37f1d January 20, 2025 22:56

nesnoj previously approved these changes Jan 20, 2025

View reviewed changes

nesnoj mentioned this pull request Jan 20, 2025

Increase parsing speed #546

Closed

AlexandraImbrisca requested a review from FlorianK13 January 21, 2025 07:11

Merge branch 'develop' into feature-546-increase-parsing-speed

8e8f0b5

AlexandraImbrisca dismissed nesnoj’s stale review via 8e8f0b5 January 21, 2025 07:16

AlexandraImbrisca requested a review from nesnoj January 21, 2025 07:16

FlorianK13 approved these changes Jan 21, 2025

View reviewed changes

nesnoj approved these changes Jan 21, 2025

View reviewed changes

FlorianK13 merged commit 11fb568 into OpenEnergyPlatform:develop Jan 22, 2025
0 of 9 checks passed

AlexandraImbrisca mentioned this pull request Jan 26, 2025

[Feature #600]: Use multiprocessing to speed up the parsing #601

Open

nesnoj changed the title ~~[Feature #546]: Decrease parsing speed by 47% by efficiently reading and writing the XML~~ [Feature #546]: Increase parsing speed by 47% by efficiently reading and writing the XML Jan 31, 2025

nesnoj mentioned this pull request Jan 31, 2025

Verify data integrity after speed improvements #602

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature #546]: Increase parsing speed by 47% by efficiently reading and writing the XML #598

[Feature #546]: Increase parsing speed by 47% by efficiently reading and writing the XML #598

AlexandraImbrisca commented Jan 12, 2025 •

edited by nesnoj

Loading

nesnoj commented Jan 12, 2025

AlexandraImbrisca commented Jan 13, 2025

FlorianK13 commented Jan 13, 2025

FlorianK13 commented Jan 15, 2025

AlexandraImbrisca commented Jan 15, 2025

FlorianK13 commented Jan 16, 2025

FlorianK13 commented Jan 17, 2025

nesnoj commented Jan 17, 2025 •

edited

Loading

AlexandraImbrisca commented Jan 17, 2025

nesnoj left a comment

AlexandraImbrisca commented Jan 21, 2025

FlorianK13 commented Jan 21, 2025

AlexandraImbrisca commented Jan 21, 2025

[Feature #546]: Increase parsing speed by 47% by efficiently reading and writing the XML #598

[Feature #546]: Increase parsing speed by 47% by efficiently reading and writing the XML #598

Conversation

AlexandraImbrisca commented Jan 12, 2025 • edited by nesnoj Loading

Summary of the discussion

Type of change (CHANGELOG.md)

Updated

Workflow checklist

PR-Assignee

Reviewer

nesnoj commented Jan 12, 2025

AlexandraImbrisca commented Jan 13, 2025

FlorianK13 commented Jan 13, 2025

FlorianK13 commented Jan 15, 2025

AlexandraImbrisca commented Jan 15, 2025

FlorianK13 commented Jan 16, 2025

FlorianK13 commented Jan 17, 2025

nesnoj commented Jan 17, 2025 • edited Loading

AlexandraImbrisca commented Jan 17, 2025

nesnoj left a comment

Choose a reason for hiding this comment

AlexandraImbrisca commented Jan 21, 2025

FlorianK13 commented Jan 21, 2025

AlexandraImbrisca commented Jan 21, 2025

AlexandraImbrisca commented Jan 12, 2025 •

edited by nesnoj

Loading

nesnoj commented Jan 17, 2025 •

edited

Loading