Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSVFileWrapper #273

Merged
merged 19 commits into from
Jun 28, 2023
Merged

Conversation

francescodeaglio
Copy link
Collaborator

In this PR, the CSV file wrapper is introduced.

Bytes are decoded to strings, the required rows are extracted and labels are separated from samples. The labels are returned as integers, the samples are reassembled in csv-like format and converted to bytes. Pandas is not needed.

The bytes_parser_function takes care of decoding the bytes, splitting the string to separators and converting/casting the data.

Two open questions:

  • is it necessary to validate the content of the file (that it is a csv, that all rows are the same size, that the 'label' column exists...)? Because this requires a preliminary reading of the file. The method is already written, it is just a question of whether to call it or not.
  • can we assume that when an index list is requested (get_samples_from_indices), it is ordered and without duplicates? If not, it would complicate the function a bit to select the rows (nothing terrible, but with this assumption it is much more elegant)

@francescodeaglio francescodeaglio changed the title CSV file reader CSV file wrapper Jun 22, 2023
@github-actions
Copy link

github-actions bot commented Jun 22, 2023

✅ Result of Pytest Coverage

---------- coverage: platform linux, python 3.11.4-final-0 -----------

Name Stmts Miss Cover
modyn/common/benchmark/stopwatch.py 23 0 100%
modyn/common/ftp/ftp_server.py 31 0 100%
modyn/common/ftp/ftp_utils.py 33 12 64%
modyn/common/trigger_sample/trigger_sample_storage.py 88 3 97%
modyn/database/abstract_database_connection.py 35 0 100%
modyn/database/partition_by_meta.py 33 12 64%
modyn/metadata_database/metadata_base.py 3 0 100%
modyn/metadata_database/metadata_database_connection.py 44 3 93%
modyn/metadata_database/models/pipelines.py 9 1 89%
modyn/metadata_database/models/sample_training_metadata.py 15 0 100%
modyn/metadata_database/models/selector_state_metadata.py 45 10 78%
modyn/metadata_database/models/trained_models.py 14 0 100%
modyn/metadata_database/models/trigger_partitions.py 10 0 100%
modyn/metadata_database/models/trigger_training_metadata.py 14 0 100%
modyn/metadata_database/models/triggers.py 10 0 100%
modyn/metadata_processor/internal/grpc/metadata_processor_grpc_servicer.py 18 0 100%
modyn/metadata_processor/internal/grpc/metadata_processor_server.py 24 0 100%
modyn/metadata_processor/internal/metadata_processor_manager.py 23 4 83%
modyn/metadata_processor/metadata_processor.py 11 0 100%
modyn/metadata_processor/metadata_processor_entrypoint.py 24 1 96%
modyn/metadata_processor/processor_strategies/abstract_processor_strategy.py 29 0 100%
modyn/metadata_processor/processor_strategies/basic_processor_strategy.py 17 2 88%
modyn/metadata_processor/processor_strategies/processor_strategy_type.py 6 1 83%
modyn/model_storage/internal/grpc/grpc_server.py 22 0 100%
modyn/model_storage/internal/grpc/model_storage_grpc_servicer.py 65 0 100%
modyn/model_storage/model_storage.py 24 5 79%
modyn/model_storage/model_storage_entrypoint.py 32 3 91%
modyn/models/dlrm/cuda_ext/dot_based_interact.py 24 13 46%
modyn/models/dlrm/dlrm.py 58 9 84%
modyn/models/dlrm/nn/embeddings.py 123 64 48%
modyn/models/dlrm/nn/factories.py 24 9 62%
modyn/models/dlrm/nn/interactions.py 50 11 78%
modyn/models/dlrm/nn/mlps.py 77 23 70%
modyn/models/dlrm/nn/parts.py 55 4 93%
modyn/models/dlrm/setup.py 5 5 0%
modyn/models/dlrm/utils/install_lib.py 11 7 36%
modyn/models/dlrm/utils/utils.py 28 0 100%
modyn/models/resnet18/resnet18.py 6 2 67%
modyn/selector/internal/grpc/selector_grpc_servicer.py 75 18 76%
modyn/selector/internal/grpc/selector_server.py 26 1 96%
modyn/selector/internal/selector_manager.py 100 33 67%
modyn/selector/internal/selector_strategies/abstract_downsample_strategy.py 30 7 77%
modyn/selector/internal/selector_strategies/abstract_presample_strategy.py 64 4 94%
modyn/selector/internal/selector_strategies/abstract_selection_strategy.py 157 15 90%
modyn/selector/internal/selector_strategies/freshness_sampling_strategy.py 110 8 93%
modyn/selector/internal/selector_strategies/gradnorm_downsampling_strategy.py 4 0 100%
modyn/selector/internal/selector_strategies/loss_downsampling_strategy.py 4 0 100%
modyn/selector/internal/selector_strategies/new_data_strategy.py 90 6 93%
modyn/selector/internal/selector_strategies/random_presampling_strategy.py 7 0 100%
modyn/selector/selector.py 60 8 87%
modyn/selector/selector_entrypoint.py 24 1 96%
modyn/storage/internal/database/models/dataset.py 20 0 100%
modyn/storage/internal/database/models/file.py 17 0 100%
modyn/storage/internal/database/models/sample.py 44 7 84%
modyn/storage/internal/database/storage_base.py 3 0 100%
modyn/storage/internal/database/storage_database_connection.py 53 0 100%
modyn/storage/internal/database/storage_database_utils.py 21 0 100%
modyn/storage/internal/file_watcher/new_file_watcher.py 214 44 79%
modyn/storage/internal/file_watcher/new_file_watcher_watch_dog.py 59 9 85%
modyn/storage/internal/file_wrapper/abstract_file_wrapper.py 23 1 96%
modyn/storage/internal/file_wrapper/binary_file_wrapper.py 49 1 98%
modyn/storage/internal/file_wrapper/csv_file_wrapper.py 94 8 91%
modyn/storage/internal/file_wrapper/file_wrapper_type.py 8 0 100%
modyn/storage/internal/file_wrapper/single_sample_file_wrapper.py 48 2 96%
modyn/storage/internal/filesystem_wrapper/abstract_filesystem_wrapper.py 31 1 97%
modyn/storage/internal/filesystem_wrapper/filesystem_wrapper_type.py 6 0 100%
modyn/storage/internal/filesystem_wrapper/local_filesystem_wrapper.py 52 0 100%
modyn/storage/internal/grpc/grpc_server.py 20 0 100%
modyn/storage/internal/grpc/storage_grpc_servicer.py 123 10 92%
modyn/storage/storage.py 34 1 97%
modyn/storage/storage_entrypoint.py 24 1 96%
modyn/supervisor/entrypoint.py 39 5 87%
modyn/supervisor/internal/grpc_handler.py 190 33 83%
modyn/supervisor/internal/supervisor_counter.py 122 12 90%
modyn/supervisor/internal/trigger.py 6 0 100%
modyn/supervisor/internal/triggers/amounttrigger.py 15 0 100%
modyn/supervisor/internal/triggers/timetrigger.py 27 1 96%
modyn/supervisor/supervisor.py 219 19 91%
modyn/tests/database/test_abstract_database_connection.py 19 0 100%
modyn/tests/metadata_database/models/test_pipelines.py 33 0 100%
modyn/tests/metadata_database/models/test_sample_training_metadata.py 40 0 100%
modyn/tests/metadata_database/models/test_selector_state_metadata.py 46 0 100%
modyn/tests/metadata_database/models/test_trained_models.py 46 0 100%
modyn/tests/metadata_database/models/test_trigger_training_metadata.py 38 0 100%
modyn/tests/metadata_database/models/test_triggers.py 33 0 100%
modyn/tests/metadata_database/test_metadata_database_connection.py 29 0 100%
modyn/tests/metadata_processor/internal/grpc/test_metadata_processor_grpc_servicer.py 26 0 100%
modyn/tests/metadata_processor/internal/grpc/test_metadata_processor_server.py 27 0 100%
modyn/tests/metadata_processor/internal/test_metadata_processor_manager.py 42 3 93%
modyn/tests/metadata_processor/processor_strategies/test_abstract_processor_strategy.py 60 0 100%
modyn/tests/metadata_processor/processor_strategies/test_basic_processor_strategy.py 43 0 100%
modyn/tests/metadata_processor/test_metadata_processor.py 22 3 86%
modyn/tests/metadata_processor/test_metadata_processor_entrypoint.py 22 0 100%
modyn/tests/model_storage/internal/grpc/test_model_storage_grpc_server.py 13 0 100%
modyn/tests/model_storage/internal/grpc/test_model_storage_grpc_servicer.py 78 1 99%
modyn/tests/model_storage/test_model_storage.py 35 5 86%
modyn/tests/model_storage/test_model_storage_entrypoint.py 22 0 100%
modyn/tests/models/test_dlrm.py 19 0 100%
modyn/tests/selector/internal/grpc/test_selector_grpc_servicer.py 145 0 100%
modyn/tests/selector/internal/grpc/test_selector_server.py 42 0 100%
modyn/tests/selector/internal/selector_strategies/test_abstract_downsample_strategy.py 43 1 98%
modyn/tests/selector/internal/selector_strategies/test_abstract_presample_strategy.py 254 0 100%
modyn/tests/selector/internal/selector_strategies/test_abstract_selection_strategy.py 184 0 100%
modyn/tests/selector/internal/selector_strategies/test_freshness_sampling_strategy.py 308 0 100%
modyn/tests/selector/internal/selector_strategies/test_gradnorm_downsample_strategy.py 32 0 100%
modyn/tests/selector/internal/selector_strategies/test_loss_downsample_strategy.py 24 0 100%
modyn/tests/selector/internal/selector_strategies/test_new_data_strategy.py 519 0 100%
modyn/tests/selector/internal/selector_strategies/test_random_presampling_strategy.py 25 0 100%
modyn/tests/selector/internal/test_selector_manager.py 139 3 98%
modyn/tests/selector/internal/trigger_sample/test_trigger_sample_storage.py 176 0 100%
modyn/tests/selector/test_selector.py 84 3 96%
modyn/tests/selector/test_selector_entrypoint.py 22 0 100%
modyn/tests/storage/internal/database/models/test_dataset.py 47 0 100%
modyn/tests/storage/internal/database/models/test_file.py 64 0 100%
modyn/tests/storage/internal/database/models/test_sample.py 73 0 100%
modyn/tests/storage/internal/database/test_database_storage_utils.py 21 2 90%
modyn/tests/storage/internal/database/test_storage_database_connection.py 54 3 94%
modyn/tests/storage/internal/file_watcher/test_new_file_watcher.py 377 13 97%
modyn/tests/storage/internal/file_watcher/test_new_file_watcher_watch_dog.py 95 1 99%
modyn/tests/storage/internal/file_wrapper/test_binary_file_wrapper.py 92 0 100%
modyn/tests/storage/internal/file_wrapper/test_csv_file_wrapper.py 165 1 99%
modyn/tests/storage/internal/file_wrapper/test_file_wrapper_type.py 6 1 83%
modyn/tests/storage/internal/file_wrapper/test_single_sample_file_wrapper.py 90 0 100%
modyn/tests/storage/internal/filesystem_wrapper/test_filesystem_wrapper_type.py 6 1 83%
modyn/tests/storage/internal/filesystem_wrapper/test_local_filesystem_wrapper.py 167 0 100%
modyn/tests/storage/internal/grpc/test_grpc_server.py 11 0 100%
modyn/tests/storage/internal/grpc/test_storage_grpc_servicer.py 239 3 99%
modyn/tests/storage/test_storage.py 42 1 98%
modyn/tests/storage/test_storage_entrypoint.py 21 0 100%
modyn/tests/supervisor/internal/test_grpc_handler.py 241 0 100%
modyn/tests/supervisor/internal/test_status_bar.py 133 0 100%
modyn/tests/supervisor/internal/test_trigger.py 5 0 100%
modyn/tests/supervisor/internal/triggers/test_amounttrigger.py 30 0 100%
modyn/tests/supervisor/internal/triggers/test_timetrigger.py 26 0 100%
modyn/tests/supervisor/test_entrypoint.py 29 0 100%
modyn/tests/supervisor/test_supervisor.py 345 1 99%
modyn/tests/trainer_server/internal/data/key_sources/test_local_key_source.py 91 0 100%
modyn/tests/trainer_server/internal/data/key_sources/test_selector_key_source.py 92 0 100%
modyn/tests/trainer_server/internal/data/test_data_utils.py 22 1 95%
modyn/tests/trainer_server/internal/data/test_local_dataset_writer.py 59 0 100%
modyn/tests/trainer_server/internal/data/test_online_dataset.py 295 3 99%
modyn/tests/trainer_server/internal/grpc/test_trainer_server_grpc_server.py 17 0 100%
modyn/tests/trainer_server/internal/grpc/test_trainer_server_grpc_servicer.py 379 9 98%
modyn/tests/trainer_server/internal/metadata_collector/test_metadata_collector.py 41 0 100%
modyn/tests/trainer_server/internal/trainer/metadata_pytorch_callbacks/test_loss_callback.py 51 1 98%
modyn/tests/trainer_server/internal/trainer/remote_downsamplers/test_abstract_remote_downsampling_strategy.py 12 0 100%
modyn/tests/trainer_server/internal/trainer/remote_downsamplers/test_get_tensor_subset.py 56 0 100%
modyn/tests/trainer_server/internal/trainer/remote_downsamplers/test_remote_gradnorm_downsample.py 92 0 100%
modyn/tests/trainer_server/internal/trainer/remote_downsamplers/test_remote_loss_downsample.py 82 0 100%
modyn/tests/trainer_server/internal/trainer/test_pytorch_trainer.py 363 43 88%
modyn/tests/trainer_server/test_trainer_server.py 34 0 100%
modyn/tests/trainer_server/test_trainer_server_entrypoint.py 22 0 100%
modyn/tests/utils/test_utils.py 68 0 100%
modyn/trainer_server/custom_lr_schedulers/dlrm_lr_scheduler/dlrm_scheduler.py 33 33 0%
modyn/trainer_server/internal/dataset/data_utils.py 12 0 100%
modyn/trainer_server/internal/dataset/key_sources/abstract_key_source.py 21 5 76%
modyn/trainer_server/internal/dataset/key_sources/local_key_source.py 21 1 95%
modyn/trainer_server/internal/dataset/key_sources/selector_key_source.py 54 2 96%
modyn/trainer_server/internal/dataset/local_dataset_writer.py 68 4 94%
modyn/trainer_server/internal/dataset/online_dataset.py 128 4 97%
modyn/trainer_server/internal/grpc/trainer_server_grpc_server.py 22 0 100%
modyn/trainer_server/internal/grpc/trainer_server_grpc_servicer.py 232 33 86%
modyn/trainer_server/internal/metadata_collector/metadata_collector.py 33 0 100%
modyn/trainer_server/internal/mocks/mock_metadata_processor.py 22 2 91%
modyn/trainer_server/internal/trainer/metadata_pytorch_callbacks/base_callback.py 15 1 93%
modyn/trainer_server/internal/trainer/metadata_pytorch_callbacks/loss_callback.py 21 0 100%
modyn/trainer_server/internal/trainer/pytorch_trainer.py 322 94 71%
modyn/trainer_server/internal/trainer/remote_downsamplers/abstract_remote_downsample_strategy.py 29 3 90%
modyn/trainer_server/internal/trainer/remote_downsamplers/remote_gradnorm_downsample.py 37 3 92%
modyn/trainer_server/internal/trainer/remote_downsamplers/remote_loss_downsample.py 29 3 90%
modyn/trainer_server/internal/utils/metric_type.py 3 0 100%
modyn/trainer_server/internal/utils/trainer_messages.py 4 0 100%
modyn/trainer_server/internal/utils/training_info.py 44 1 98%
modyn/trainer_server/internal/utils/training_process_info.py 10 0 100%
modyn/trainer_server/trainer_server.py 19 0 100%
modyn/trainer_server/trainer_server_entrypoint.py 32 3 91%
modyn/utils/utils.py 98 12 88%
TOTAL 11629 767 93%
Coverage HTML written to
================= 506 passed, 22

@francescodeaglio francescodeaglio self-assigned this Jun 22, 2023
@vGsteiger
Copy link
Collaborator

I am a bit worried about the order of PR that we will go through with regards to #270 as this is a change in a rewritten part of the storage component. Most likely your PR will win the race but it would be great if you could already think about the port to C++ as well :) Thanks!

@MaxiBoether MaxiBoether changed the title CSV file wrapper ADD CsvFileWrapper Jun 27, 2023
@MaxiBoether MaxiBoether changed the title ADD CsvFileWrapper Add CSVFileWrapper Jun 27, 2023
Copy link
Contributor

@MaxiBoether MaxiBoether left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR - added my comments and hopefully answered your questions :)

modyn/config/schema/modyn_config_schema.yaml Show resolved Hide resolved
modyn/config/schema/modyn_config_schema.yaml Show resolved Hide resolved
modyn/config/schema/modyn_config_schema.yaml Outdated Show resolved Hide resolved
modyn/config/schema/modyn_config_schema.yaml Outdated Show resolved Hide resolved
modyn/storage/internal/file_wrapper/csv_file_wrapper.py Outdated Show resolved Hide resolved
modyn/storage/internal/file_wrapper/csv_file_wrapper.py Outdated Show resolved Hide resolved
modyn/storage/internal/file_wrapper/csv_file_wrapper.py Outdated Show resolved Hide resolved
Copy link
Contributor

@MaxiBoether MaxiBoether left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you.

One note: Right now, we always read the entire file into memory, which we should generally avoid (get_csv_reader decodes the entire file into memory). This is due to our usage of csv. For now, this is fine, since we are rewriting storage in C++ anyways. In C++, at one point, we should try to not load everything into memory if possible, but instead only request the part of the file we need (if that is even possible with variable length data, but maybe using new lines or so there is some call). For now, this shall not be an issue.

@francescodeaglio francescodeaglio merged commit 61d66bb into main Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants