-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize IndexSet.data
DB storage
#122
Conversation
* Introduce new IndexSetData table to normalize data * Rename .elements to .add for consistency
2017927
to
4bbd26a
Compare
return value | ||
@db.hybrid_property | ||
def data(self) -> list[float | int | str]: | ||
return [cast_data_as_type(data, self.data_type) for data in self._data] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also looks quite slow, but is porbably unavoidable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @danielhuppmann can clarify, but I thought we want the indexset.data
attribute to accurately portray the data type. So I guess we can either cast types here or in the API and core layers.
But the function itself I was also wondering about: there should be some efficient built-in that does this kind of casting, my intuition tells me. Do you know of a better way? Would it be faster to collect all self._data.value
s in a numpy array or so and cast that appropriately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this guide, the O()
of using loops and using numpy is both O(n)
, but locally, the tests/data
test runs 0.1 seconds faster when using numpy as it is used now, so hope that's better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well of course they are both O(n) yes, doesnt really say much about the actual cost though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always thought it did, assuming the n
can't be made smaller or so. Looking forward to your workshop on this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, a quick one liner would be: Big O notation only describes the relationship between runtime (or memory use if we are talking space complexity) and some characteristic "n" of the input data. To get a real runtime estimation one would have to insert scaling factors into the notation which would have to come from actual measurements...
@@ -65,7 +65,7 @@ def _add_column( | |||
self.columns.create( | |||
name=column_name, | |||
constrained_to_indexset=indexset.id, | |||
dtype=pd.Series(indexset.elements).dtype.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALso this looks quite expensive for what it does, no idea how to avoid it right now though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all items like parameter
, variable
, etc, I'm thinking that tables similar to IndexSetData
might relieve us of the whole Column
table. Especially if we end up translating parameter.data
etc to IndexSetData.id
s, since the main purpose of the columns is currently to store which indexset the data columns belong to and which type it has. All of this can probably be taken care of in a better way, eliminating these function calls :)
However, this will only happen in a later PR, I'm afraid.
Hm so im unsure whether to try to get this PR in better shape or not @danielhuppmann. |
* Make data loading during tabulate() optional * Use bulk insert for IndexSetData * Use a normal property for `.data`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some explanation for my recent changes. In addition, I read about partitioning, lambda caching, yield per, relationship loading styles, and some other topics which might further improve performance, but I don't want to keep this PR around too long to avoid scope creep.
) | ||
|
||
@property | ||
def data(self) -> list[float | int | str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed how to do this best with an SQLAlchemy maintainer here. Since we are not going to use .data
in SQL queries, we should be better served with a normal Python property.
] | ||
try: | ||
self.session.execute( | ||
db.insert(IndexSetData).values(indexset__id=indexset_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ORM-enabled bulk inserting should be faster than creating individual objects (as I did before).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this got very fast in sqlalchemy 2!
I'm trying to avoid accumulating technical debt, so all that you see is there because I don't know better. I'm happy to clean up after myself, though, if you teach me how to write better code :) |
Alright the new changes make me feel a lot better about this PR, sorry for the smacktalk |
Yeah I understand, no worries... the thing is that I have some high level intuition about performance hotspots and bad practices but baking these into code is always the brunt of the work and that might make my feedback seem vague and unactionable. And it also means my critiques might at times not be realistic expectations (not only of you specifically, but anyone) and require push back. Sorry about that. I also realise that I'm becoming less accurate with every new project I take ownership of since I am constantly reviewing and writing code in other code bases with other tech. |
Thanks for the review! I understand that being in charge of multiple projects is consuming a lot of time and energy. I'm not offended by your comment, I only wish I could train my intuition to move towards where yours is faster. I'm happy to brainstorm ideas how we could do this :) My last commits here target the DB migrations entirely. I forgot to include DB migrations with the PRs that introduced parameters, equations, and variables, and so initially two commits in #101 added them back in. Since we'll likely merge this PR first and require a migration here, too, I've cherry-picked them to keep a linear migration history. |
Thinking about our DB design, I have come across this description of data normalization. This does not apply to the way we currently store data from optimization items, but it probably should. So this PR starts by normalizing the
IndexSet
storage, which is likely the easiest one.First off, this PR renames
IndexSet.elements
toIndexSet.data
to be consistent with the other optimization items, closing #119.@danielhuppmann confirmed that the
.data
of anyIndexSet
will always be of the sametype()
, that's why we can move thedata_type
column to theIndexSet
table and have a separate table forIndexSetData
. This also allows us to drop a few test cases.data
is now stored always as astring
because conversion fromstring
to the other data types works best this way. This conversion happens in the hybrid property.data
, which is not an official column, unfortunately, requiring the adaptation of thetabulate()
method in theIndexSetRepository
.If you immediately know how this or the type conversion can be made more efficient, please let me know :)
Other than that, all tests
shouldare stillbepassing, nothing in the user-facing API changed, and I haven't yet run any kind of stress test on this setup. Also, DB migrations are still missing.