-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CSVFileWrapper #273
Add CSVFileWrapper #273
Conversation
✅ Result of Pytest Coverage---------- coverage: platform linux, python 3.11.4-final-0 -----------
|
I am a bit worried about the order of PR that we will go through with regards to #270 as this is a change in a rewritten part of the storage component. Most likely your PR will win the race but it would be great if you could already think about the port to C++ as well :) Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR - added my comments and hopefully answered your questions :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you.
One note: Right now, we always read the entire file into memory, which we should generally avoid (get_csv_reader
decodes the entire file into memory). This is due to our usage of csv
. For now, this is fine, since we are rewriting storage in C++ anyways. In C++, at one point, we should try to not load everything into memory if possible, but instead only request the part of the file we need (if that is even possible with variable length data, but maybe using new lines or so there is some call). For now, this shall not be an issue.
In this PR, the CSV file wrapper is introduced.
Bytes are decoded to strings, the required rows are extracted and labels are separated from samples. The labels are returned as integers, the samples are reassembled in csv-like format and converted to bytes. Pandas is not needed.
The bytes_parser_function takes care of decoding the bytes, splitting the string to separators and converting/casting the data.
Two open questions: