Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slow DataFrame creation in CogniteResource.to_pandas #1389

Merged
merged 12 commits into from
Oct 4, 2023

Conversation

gabor-huseb
Copy link
Contributor

@gabor-huseb gabor-huseb commented Sep 29, 2023

Description

Optimize DataFrame creation by initializing with a list of dictionaries

  • Replace row-by-row addition using .loc with a more efficient bulk addition.
  • The new approach improves speed and reduces memory usage int the to_pandas() method, especially for large DataFrames.

Checklist:

  • Tests added/updated.
  • Documentation updated. Documentation is generated from docstrings - these must be updated according to your change.
    If a new method has been added it should be referenced in cognite.rst in order to generate docs based on its docstring.
  • Changelog updated in CHANGELOG.md.
  • Version bumped. If triggering a new release is desired, bump the version number in _version.py and pyproject.toml per semantic versioning.

@gabor-huseb gabor-huseb requested review from a team as code owners September 29, 2023 11:41
@gabor-huseb gabor-huseb reopened this Sep 29, 2023
@codecov
Copy link

codecov bot commented Sep 29, 2023

Codecov Report

Merging #1389 (d5efe27) into master (72b474c) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1389      +/-   ##
==========================================
- Coverage   90.75%   90.68%   -0.07%     
==========================================
  Files         117      117              
  Lines       14250    14247       -3     
==========================================
- Hits        12932    12920      -12     
- Misses       1318     1327       +9     
Files Coverage Δ
cognite/client/data_classes/_base.py 90.80% <100.00%> (-0.08%) ⬇️

... and 3 files with indirect coverage changes

@gabor-huseb gabor-huseb reopened this Sep 29, 2023
Copy link
Contributor

@haakonvt haakonvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start! 😄

df.loc[name] = [value]

data = [{"value": value} for name, value in dumped.items()]
df = pd.DataFrame(data, index=dumped.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way you can feed the dumped dict directly?


result_df = obj.to_pandas()

expected_df = pd.DataFrame({"value": [1]}, index=["id"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the point of the test was to make sure the ordering of rows stays consistent after your change, you gotta need more rows 😉

@gabor-huseb
Copy link
Contributor Author

@haakonvt made altercations based on your comments, does this seem better?

Copy link
Contributor

@haakonvt haakonvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

for name, value in dumped.items():
df.loc[name] = [value]

df = pd.Series(dumped).to_frame(name="value")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! 🚀 Just a minor change:

return pd.Series(dumped).to_frame(name="value")


obj = AssetList([Asset(external_id=f"ext-{i}", name=f"name-{i}") for i in range(5)])

result_df = obj.to_pandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not actually call your code, it will just call dump on each asset. I suggest you just create one test asset with quite a few of the accepted parameter:

1. external_id
2. name
3. parent_id
4. parent_external_id
5. description
6. data_set_id
7. metadata
8. source
9. labels
10. geo_location
11. id
12. created_time
13. last_updated_time
14. root_id
15. aggregates

Apart from that, the test looks great!

@haakonvt haakonvt changed the title Gabor loc fixing Fix slow DataFrame instantiation in CogniteResource.to_pandas Oct 2, 2023
@haakonvt haakonvt changed the title Fix slow DataFrame instantiation in CogniteResource.to_pandas Fix slow DataFrame creation in CogniteResource.to_pandas Oct 2, 2023
@gabor-huseb
Copy link
Contributor Author

@haakonvt I hope i understood the feedback on the test correctly

Copy link
Contributor

@haakonvt haakonvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 🚀

@gabor-huseb gabor-huseb merged commit dbcf6ca into master Oct 4, 2023
7 checks passed
@gabor-huseb gabor-huseb deleted the gabor-loc-fixing branch October 4, 2023 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants