Normalize `IndexSet.data` DB storage #122

glatterf42 · 2024-10-17T13:49:35Z

Thinking about our DB design, I have come across this description of data normalization. This does not apply to the way we currently store data from optimization items, but it probably should. So this PR starts by normalizing the IndexSet storage, which is likely the easiest one.

First off, this PR renames IndexSet.elements to IndexSet.data to be consistent with the other optimization items, closing #119.

@danielhuppmann confirmed that the .data of any IndexSet will always be of the same type(), that's why we can move the data_type column to the IndexSet table and have a separate table for IndexSetData. This also allows us to drop a few test cases.

data is now stored always as a string because conversion from string to the other data types works best this way. This conversion happens in the hybrid property .data, which is not an official column, unfortunately, requiring the adaptation of the tabulate() method in the IndexSetRepository.
If you immediately know how this or the type conversion can be made more efficient, please let me know :)

Other than that, all tests ~~should~~are still be passing, nothing in the user-facing API changed, and I haven't yet run any kind of stress test on this setup. Also, DB migrations are still missing.

* Introduce new IndexSetData table to normalize data * Rename .elements to .add for consistency

ixmp4/data/db/optimization/indexset/repository.py

meksor · 2024-10-22T13:06:04Z

ixmp4/data/db/optimization/indexset/model.py

-        return value
+    @db.hybrid_property
+    def data(self) -> list[float | int | str]:
+        return [cast_data_as_type(data, self.data_type) for data in self._data]


This also looks quite slow, but is porbably unavoidable?

Maybe @danielhuppmann can clarify, but I thought we want the indexset.data attribute to accurately portray the data type. So I guess we can either cast types here or in the API and core layers.
But the function itself I was also wondering about: there should be some efficient built-in that does this kind of casting, my intuition tells me. Do you know of a better way? Would it be faster to collect all self._data.values in a numpy array or so and cast that appropriately?

According to this guide, the O() of using loops and using numpy is both O(n), but locally, the tests/data test runs 0.1 seconds faster when using numpy as it is used now, so hope that's better :)

Well of course they are both O(n) yes, doesnt really say much about the actual cost though.

I always thought it did, assuming the n can't be made smaller or so. Looking forward to your workshop on this :)

Alright, a quick one liner would be: Big O notation only describes the relationship between runtime (or memory use if we are talking space complexity) and some characteristic "n" of the input data. To get a real runtime estimation one would have to insert scaling factors into the notation which would have to come from actual measurements...

meksor · 2024-10-22T13:08:13Z

ixmp4/data/db/optimization/variable/repository.py

@@ -65,7 +65,7 @@ def _add_column(
        self.columns.create(
            name=column_name,
            constrained_to_indexset=indexset.id,
-            dtype=pd.Series(indexset.elements).dtype.name,


ALso this looks quite expensive for what it does, no idea how to avoid it right now though...

For all items like parameter, variable, etc, I'm thinking that tables similar to IndexSetData might relieve us of the whole Column table. Especially if we end up translating parameter.data etc to IndexSetData.ids, since the main purpose of the columns is currently to store which indexset the data columns belong to and which type it has. All of this can probably be taken care of in a better way, eliminating these function calls :)
However, this will only happen in a later PR, I'm afraid.

meksor · 2024-10-29T10:28:33Z

Hm so im unsure whether to try to get this PR in better shape or not @danielhuppmann.
I feel like we are introducing a lot of technical debt right now, but i would need a month or two to weed out some problems, get to know the code fridolin added in the last year and so on. May be best to merge it for now and revisit when integrating toolkit and reducing the codebase with some big abstractions. I know this is basically going to come down to me cleaning up, which I am not a fan of, but Im kind of at a loss at what else to do.

* Make data loading during tabulate() optional * Use bulk insert for IndexSetData * Use a normal property for `.data`

glatterf42

Some explanation for my recent changes. In addition, I read about partitioning, lambda caching, yield per, relationship loading styles, and some other topics which might further improve performance, but I don't want to keep this PR around too long to avoid scope creep.

glatterf42 · 2024-10-29T13:40:17Z

ixmp4/data/db/optimization/indexset/model.py

+    )
+
+    @property
+    def data(self) -> list[float | int | str]:


I discussed how to do this best with an SQLAlchemy maintainer here. Since we are not going to use .data in SQL queries, we should be better served with a normal Python property.

glatterf42 · 2024-10-29T13:41:41Z

ixmp4/data/db/optimization/indexset/repository.py

+        ]
+        try:
+            self.session.execute(
+                db.insert(IndexSetData).values(indexset__id=indexset_id),


Using ORM-enabled bulk inserting should be faster than creating individual objects (as I did before).

Yes this got very fast in sqlalchemy 2!

glatterf42 · 2024-10-29T13:51:15Z

I know this is basically going to come down to me cleaning up, which I am not a fan of, but Im kind of at a loss at what else to do.

I'm trying to avoid accumulating technical debt, so all that you see is there because I don't know better. I'm happy to clean up after myself, though, if you teach me how to write better code :)

meksor · 2024-10-29T13:51:17Z

Alright the new changes make me feel a lot better about this PR, sorry for the smacktalk

meksor · 2024-10-29T14:03:02Z

I'm trying to avoid accumulating technical debt, so all that you see is there because I don't know better. I'm happy to clean up after myself, though, if you teach me how to write better code :)

Yeah I understand, no worries... the thing is that I have some high level intuition about performance hotspots and bad practices but baking these into code is always the brunt of the work and that might make my feedback seem vague and unactionable.

And it also means my critiques might at times not be realistic expectations (not only of you specifically, but anyone) and require push back. Sorry about that. I also realise that I'm becoming less accurate with every new project I take ownership of since I am constantly reviewing and writing code in other code bases with other tech.

glatterf42 · 2024-10-29T14:46:36Z

Thanks for the review! I understand that being in charge of multiple projects is consuming a lot of time and energy. I'm not offended by your comment, I only wish I could train my intuition to move towards where yours is faster. I'm happy to brainstorm ideas how we could do this :)

My last commits here target the DB migrations entirely. I forgot to include DB migrations with the PRs that introduced parameters, equations, and variables, and so initially two commits in #101 added them back in. Since we'll likely merge this PR first and require a migration here, too, I've cherry-picked them to keep a linear migration history.

glatterf42 added the enhancement New feature or request label Oct 17, 2024

glatterf42 requested a review from meksor October 17, 2024 13:49

glatterf42 self-assigned this Oct 17, 2024

glatterf42 marked this pull request as ready for review October 17, 2024 13:49

glatterf42 mentioned this pull request Oct 18, 2024

Clear up several mypy # type: ignores #123

Closed

glatterf42 linked an issue Oct 18, 2024 that may be closed by this pull request

Make optimization<item>.data attributes consistent #119

Closed

glatterf42 added 5 commits October 21, 2024 12:15

Add new DB data types and imports

4e9bc8e

Refactor indexset.add_data()

6e51100

* Introduce new IndexSetData table to normalize data * Rename .elements to .add for consistency

Adapt tests to refactoring

24a1181

Adapt foreign mentions of indexset.data

4234042

Fix references to add_elements

4bbd26a

glatterf42 force-pushed the enh/indexset-db-storage branch from 2017927 to 4bbd26a Compare October 21, 2024 10:15

meksor reviewed Oct 22, 2024

View reviewed changes

ixmp4/data/db/optimization/indexset/repository.py Outdated Show resolved Hide resolved

meksor reviewed Oct 22, 2024

View reviewed changes

Cast types using numpy for efficiency

52f3423

glatterf42 added 3 commits October 29, 2024 14:30

Reduce number of lines needing tests

2b62ae7

Address review comments

a7246cd

* Make data loading during tabulate() optional * Use bulk insert for IndexSetData * Use a normal property for `.data`

Remove not-required parameter

03f75c2

glatterf42 commented Oct 29, 2024

View reviewed changes

meksor approved these changes Oct 29, 2024

View reviewed changes

glatterf42 added 3 commits October 29, 2024 15:36

TEMPORARY Add all missing db migrations

827a1b8

Reconcile alembic migrations

ef5616d

Record changes in DB migration

85dc134

glatterf42 merged commit 60d7e29 into main Nov 22, 2024
12 checks passed

glatterf42 deleted the enh/indexset-db-storage branch November 22, 2024 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize `IndexSet.data` DB storage #122

Normalize `IndexSet.data` DB storage #122

glatterf42 commented Oct 17, 2024 •

edited

Loading

meksor Oct 22, 2024

glatterf42 Oct 22, 2024

glatterf42 Oct 22, 2024

meksor Oct 29, 2024

glatterf42 Oct 29, 2024

meksor Oct 29, 2024

meksor Oct 22, 2024

glatterf42 Oct 22, 2024

meksor commented Oct 29, 2024

glatterf42 left a comment

glatterf42 Oct 29, 2024

glatterf42 Oct 29, 2024

meksor Oct 30, 2024

glatterf42 commented Oct 29, 2024

meksor commented Oct 29, 2024

meksor commented Oct 29, 2024

glatterf42 commented Oct 29, 2024 •

edited

Loading

Normalize IndexSet.data DB storage #122

Normalize IndexSet.data DB storage #122

Conversation

glatterf42 commented Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meksor commented Oct 29, 2024

glatterf42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glatterf42 commented Oct 29, 2024

meksor commented Oct 29, 2024

meksor commented Oct 29, 2024

glatterf42 commented Oct 29, 2024 • edited Loading

Normalize `IndexSet.data` DB storage #122

Normalize `IndexSet.data` DB storage #122

glatterf42 commented Oct 17, 2024 •

edited

Loading

glatterf42 commented Oct 29, 2024 •

edited

Loading