Fix slow `DataFrame` creation in `CogniteResource.to_pandas` #1389

gabor-huseb · 2023-09-29T11:41:46Z

Description

Optimize DataFrame creation by initializing with a list of dictionaries

Replace row-by-row addition using .loc with a more efficient bulk addition.
The new approach improves speed and reduces memory usage int the to_pandas() method, especially for large DataFrames.

Checklist:

Tests added/updated.
Documentation updated. Documentation is generated from docstrings - these must be updated according to your change.
If a new method has been added it should be referenced in cognite.rst in order to generate docs based on its docstring.
Changelog updated in CHANGELOG.md.
Version bumped. If triggering a new release is desired, bump the version number in _version.py and pyproject.toml per semantic versioning.

codecov · 2023-09-29T12:12:09Z

Codecov Report

Merging #1389 (d5efe27) into master (72b474c) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1389      +/-   ##
==========================================
- Coverage   90.75%   90.68%   -0.07%     
==========================================
  Files         117      117              
  Lines       14250    14247       -3     
==========================================
- Hits        12932    12920      -12     
- Misses       1318     1327       +9

Files	Coverage Δ
cognite/client/data_classes/_base.py	`90.80% <100.00%> (-0.08%)`	⬇️

... and 3 files with indirect coverage changes

haakonvt

Great start! 😄

haakonvt · 2023-10-01T05:15:30Z

cognite/client/data_classes/_base.py

-            df.loc[name] = [value]
+
+        data = [{"value": value} for name, value in dumped.items()]
+        df = pd.DataFrame(data, index=dumped.keys())


Is there a way you can feed the dumped dict directly?

haakonvt · 2023-10-01T05:16:26Z

tests/tests_unit/test_base.py

+
+        result_df = obj.to_pandas()
+
+        expected_df = pd.DataFrame({"value": [1]}, index=["id"])


If the point of the test was to make sure the ordering of rows stays consistent after your change, you gotta need more rows 😉

gabor-huseb · 2023-10-02T11:03:41Z

@haakonvt made altercations based on your comments, does this seem better?

haakonvt

Almost there!

haakonvt · 2023-10-02T19:50:00Z

cognite/client/data_classes/_base.py

-        for name, value in dumped.items():
-            df.loc[name] = [value]
+
+        df = pd.Series(dumped).to_frame(name="value")


Very nice! 🚀 Just a minor change:

return pd.Series(dumped).to_frame(name="value")

haakonvt · 2023-10-02T19:56:22Z

tests/tests_unit/test_base.py

+
+        obj = AssetList([Asset(external_id=f"ext-{i}", name=f"name-{i}") for i in range(5)])
+
+        result_df = obj.to_pandas()


This does not actually call your code, it will just call dump on each asset. I suggest you just create one test asset with quite a few of the accepted parameter:

1. external_id 2. name 3. parent_id 4. parent_external_id 5. description 6. data_set_id 7. metadata 8. source 9. labels 10. geo_location 11. id 12. created_time 13. last_updated_time 14. root_id 15. aggregates

Apart from that, the test looks great!

gabor-huseb · 2023-10-04T07:51:34Z

@haakonvt I hope i understood the feedback on the test correctly

haakonvt

Looks great! 🚀

…ition.

…andas())

…ry df= line before return

gabor-huseb requested review from a team as code owners September 29, 2023 11:41

gabor-huseb closed this Sep 29, 2023

gabor-huseb reopened this Sep 29, 2023

gabor-huseb closed this Sep 29, 2023

gabor-huseb reopened this Sep 29, 2023

haakonvt reviewed Oct 1, 2023

View reviewed changes

haakonvt reviewed Oct 2, 2023

View reviewed changes

haakonvt changed the title ~~Gabor loc fixing~~ Fix slow DataFrame instantiation in CogniteResource.to_pandas Oct 2, 2023

haakonvt changed the title ~~Fix slow DataFrame instantiation in CogniteResource.to_pandas~~ Fix slow DataFrame creation in CogniteResource.to_pandas Oct 2, 2023

haakonvt approved these changes Oct 4, 2023

View reviewed changes

haakonvt force-pushed the gabor-loc-fixing branch from 7101d6f to ea10820 Compare October 4, 2023 09:30

gabor-huseb added 12 commits October 4, 2023 12:11

Replace row-by-row addition using .loc with a more efficient bulk add…

dc65812

…ition.

Adding unit test for the new initiation of df in to_pandas()

c307343

Adding self to test_to_pandas_method

f0cc366

Adding @pytest.mark.dsl to test_to_pandas_method

2878ecf

Adding more elements test list and feeding dumped dict directly (to_p…

dae6fb7

…andas())

Adding making the list to AssetList in test_to_pandas_method()

221e39d

Fixing the expected df

f9a8224

returning to the two liner pd population

42d4217

Change DataFrame creation to use pd.DataFrame.from_dict directly

4d32010

Created DataFrame from dictionary items with columns 'Name' and 'Value'.

95fc54a

Created DataFrame goinf through series

e8dd479

Creating one big asset instead of list of assets. Also remove unecesa…

d5efe27

…ry df= line before return

haakonvt force-pushed the gabor-loc-fixing branch from ea10820 to d5efe27 Compare October 4, 2023 10:11

gabor-huseb merged commit dbcf6ca into master Oct 4, 2023
7 checks passed

gabor-huseb deleted the gabor-loc-fixing branch October 4, 2023 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix slow `DataFrame` creation in `CogniteResource.to_pandas` #1389

Fix slow `DataFrame` creation in `CogniteResource.to_pandas` #1389

gabor-huseb commented Sep 29, 2023 •

edited

Loading

codecov bot commented Sep 29, 2023 •

edited

Loading

haakonvt left a comment

haakonvt Oct 1, 2023

haakonvt Oct 1, 2023

gabor-huseb commented Oct 2, 2023

haakonvt left a comment

haakonvt Oct 2, 2023

haakonvt Oct 2, 2023

gabor-huseb commented Oct 4, 2023

haakonvt left a comment


		result_df = obj.to_pandas()

		expected_df = pd.DataFrame({"value": [1]}, index=["id"])


		obj = AssetList([Asset(external_id=f"ext-{i}", name=f"name-{i}") for i in range(5)])

		result_df = obj.to_pandas()

Fix slow DataFrame creation in CogniteResource.to_pandas #1389

Fix slow DataFrame creation in CogniteResource.to_pandas #1389

Conversation

gabor-huseb commented Sep 29, 2023 • edited Loading

Description

Checklist:

codecov bot commented Sep 29, 2023 • edited Loading

Codecov Report

haakonvt left a comment

Choose a reason for hiding this comment

haakonvt Oct 1, 2023

Choose a reason for hiding this comment

haakonvt Oct 1, 2023

Choose a reason for hiding this comment

gabor-huseb commented Oct 2, 2023

haakonvt left a comment

Choose a reason for hiding this comment

haakonvt Oct 2, 2023

Choose a reason for hiding this comment

haakonvt Oct 2, 2023

Choose a reason for hiding this comment

gabor-huseb commented Oct 4, 2023

haakonvt left a comment

Choose a reason for hiding this comment

Fix slow `DataFrame` creation in `CogniteResource.to_pandas` #1389

Fix slow `DataFrame` creation in `CogniteResource.to_pandas` #1389

gabor-huseb commented Sep 29, 2023 •

edited

Loading

codecov bot commented Sep 29, 2023 •

edited

Loading