Add CSVFileWrapper #273

francescodeaglio · 2023-06-22T12:27:54Z

In this PR, the CSV file wrapper is introduced.

Bytes are decoded to strings, the required rows are extracted and labels are separated from samples. The labels are returned as integers, the samples are reassembled in csv-like format and converted to bytes. Pandas is not needed.

The bytes_parser_function takes care of decoding the bytes, splitting the string to separators and converting/casting the data.

Two open questions:

is it necessary to validate the content of the file (that it is a csv, that all rows are the same size, that the 'label' column exists...)? Because this requires a preliminary reading of the file. The method is already written, it is just a question of whether to call it or not.
can we assume that when an index list is requested (get_samples_from_indices), it is ordered and without duplicates? If not, it would complicate the function a bit to select the rows (nothing terrible, but with this assumption it is much more elegant)

github-actions · 2023-06-22T12:38:57Z

✅ Result of Pytest Coverage

---------- coverage: platform linux, python 3.11.4-final-0 -----------

Name	Stmts	Miss	Cover
modyn/common/benchmark/stopwatch.py	23	0	100%
modyn/common/ftp/ftp_server.py	31	0	100%
modyn/common/ftp/ftp_utils.py	33	12	64%
modyn/common/trigger_sample/trigger_sample_storage.py	88	3	97%
modyn/database/abstract_database_connection.py	35	0	100%
modyn/database/partition_by_meta.py	33	12	64%
modyn/metadata_database/metadata_base.py	3	0	100%
modyn/metadata_database/metadata_database_connection.py	44	3	93%
modyn/metadata_database/models/pipelines.py	9	1	89%
modyn/metadata_database/models/sample_training_metadata.py	15	0	100%
modyn/metadata_database/models/selector_state_metadata.py	45	10	78%
modyn/metadata_database/models/trained_models.py	14	0	100%
modyn/metadata_database/models/trigger_partitions.py	10	0	100%
modyn/metadata_database/models/trigger_training_metadata.py	14	0	100%
modyn/metadata_database/models/triggers.py	10	0	100%
modyn/metadata_processor/internal/grpc/metadata_processor_grpc_servicer.py	18	0	100%
modyn/metadata_processor/internal/grpc/metadata_processor_server.py	24	0	100%
modyn/metadata_processor/internal/metadata_processor_manager.py	23	4	83%
modyn/metadata_processor/metadata_processor.py	11	0	100%
modyn/metadata_processor/metadata_processor_entrypoint.py	24	1	96%
modyn/metadata_processor/processor_strategies/abstract_processor_strategy.py	29	0	100%
modyn/metadata_processor/processor_strategies/basic_processor_strategy.py	17	2	88%
modyn/metadata_processor/processor_strategies/processor_strategy_type.py	6	1	83%
modyn/model_storage/internal/grpc/grpc_server.py	22	0	100%
modyn/model_storage/internal/grpc/model_storage_grpc_servicer.py	65	0	100%
modyn/model_storage/model_storage.py	24	5	79%
modyn/model_storage/model_storage_entrypoint.py	32	3	91%
modyn/models/dlrm/cuda_ext/dot_based_interact.py	24	13	46%
modyn/models/dlrm/dlrm.py	58	9	84%
modyn/models/dlrm/nn/embeddings.py	123	64	48%
modyn/models/dlrm/nn/factories.py	24	9	62%
modyn/models/dlrm/nn/interactions.py	50	11	78%
modyn/models/dlrm/nn/mlps.py	77	23	70%
modyn/models/dlrm/nn/parts.py	55	4	93%
modyn/models/dlrm/setup.py	5	5	0%
modyn/models/dlrm/utils/install_lib.py	11	7	36%
modyn/models/dlrm/utils/utils.py	28	0	100%
modyn/models/resnet18/resnet18.py	6	2	67%
modyn/selector/internal/grpc/selector_grpc_servicer.py	75	18	76%
modyn/selector/internal/grpc/selector_server.py	26	1	96%
modyn/selector/internal/selector_manager.py	100	33	67%
modyn/selector/internal/selector_strategies/abstract_downsample_strategy.py	30	7	77%
modyn/selector/internal/selector_strategies/abstract_presample_strategy.py	64	4	94%
modyn/selector/internal/selector_strategies/abstract_selection_strategy.py	157	15	90%
modyn/selector/internal/selector_strategies/freshness_sampling_strategy.py	110	8	93%
modyn/selector/internal/selector_strategies/gradnorm_downsampling_strategy.py	4	0	100%
modyn/selector/internal/selector_strategies/loss_downsampling_strategy.py	4	0	100%
modyn/selector/internal/selector_strategies/new_data_strategy.py	90	6	93%
modyn/selector/internal/selector_strategies/random_presampling_strategy.py	7	0	100%
modyn/selector/selector.py	60	8	87%
modyn/selector/selector_entrypoint.py	24	1	96%
modyn/storage/internal/database/models/dataset.py	20	0	100%
modyn/storage/internal/database/models/file.py	17	0	100%
modyn/storage/internal/database/models/sample.py	44	7	84%
modyn/storage/internal/database/storage_base.py	3	0	100%
modyn/storage/internal/database/storage_database_connection.py	53	0	100%
modyn/storage/internal/database/storage_database_utils.py	21	0	100%
modyn/storage/internal/file_watcher/new_file_watcher.py	214	44	79%
modyn/storage/internal/file_watcher/new_file_watcher_watch_dog.py	59	9	85%
modyn/storage/internal/file_wrapper/abstract_file_wrapper.py	23	1	96%
modyn/storage/internal/file_wrapper/binary_file_wrapper.py	49	1	98%
modyn/storage/internal/file_wrapper/csv_file_wrapper.py	94	8	91%
modyn/storage/internal/file_wrapper/file_wrapper_type.py	8	0	100%
modyn/storage/internal/file_wrapper/single_sample_file_wrapper.py	48	2	96%
modyn/storage/internal/filesystem_wrapper/abstract_filesystem_wrapper.py	31	1	97%
modyn/storage/internal/filesystem_wrapper/filesystem_wrapper_type.py	6	0	100%
modyn/storage/internal/filesystem_wrapper/local_filesystem_wrapper.py	52	0	100%
modyn/storage/internal/grpc/grpc_server.py	20	0	100%
modyn/storage/internal/grpc/storage_grpc_servicer.py	123	10	92%
modyn/storage/storage.py	34	1	97%
modyn/storage/storage_entrypoint.py	24	1	96%
modyn/supervisor/entrypoint.py	39	5	87%
modyn/supervisor/internal/grpc_handler.py	190	33	83%
modyn/supervisor/internal/supervisor_counter.py	122	12	90%
modyn/supervisor/internal/trigger.py	6	0	100%
modyn/supervisor/internal/triggers/amounttrigger.py	15	0	100%
modyn/supervisor/internal/triggers/timetrigger.py	27	1	96%
modyn/supervisor/supervisor.py	219	19	91%
modyn/tests/database/test_abstract_database_connection.py	19	0	100%
modyn/tests/metadata_database/models/test_pipelines.py	33	0	100%
modyn/tests/metadata_database/models/test_sample_training_metadata.py	40	0	100%
modyn/tests/metadata_database/models/test_selector_state_metadata.py	46	0	100%
modyn/tests/metadata_database/models/test_trained_models.py	46	0	100%
modyn/tests/metadata_database/models/test_trigger_training_metadata.py	38	0	100%
modyn/tests/metadata_database/models/test_triggers.py	33	0	100%
modyn/tests/metadata_database/test_metadata_database_connection.py	29	0	100%
modyn/tests/metadata_processor/internal/grpc/test_metadata_processor_grpc_servicer.py	26	0	100%
modyn/tests/metadata_processor/internal/grpc/test_metadata_processor_server.py	27	0	100%
modyn/tests/metadata_processor/internal/test_metadata_processor_manager.py	42	3	93%
modyn/tests/metadata_processor/processor_strategies/test_abstract_processor_strategy.py	60	0	100%
modyn/tests/metadata_processor/processor_strategies/test_basic_processor_strategy.py	43	0	100%
modyn/tests/metadata_processor/test_metadata_processor.py	22	3	86%
modyn/tests/metadata_processor/test_metadata_processor_entrypoint.py	22	0	100%
modyn/tests/model_storage/internal/grpc/test_model_storage_grpc_server.py	13	0	100%
modyn/tests/model_storage/internal/grpc/test_model_storage_grpc_servicer.py	78	1	99%
modyn/tests/model_storage/test_model_storage.py	35	5	86%
modyn/tests/model_storage/test_model_storage_entrypoint.py	22	0	100%
modyn/tests/models/test_dlrm.py	19	0	100%
modyn/tests/selector/internal/grpc/test_selector_grpc_servicer.py	145	0	100%
modyn/tests/selector/internal/grpc/test_selector_server.py	42	0	100%
modyn/tests/selector/internal/selector_strategies/test_abstract_downsample_strategy.py	43	1	98%
modyn/tests/selector/internal/selector_strategies/test_abstract_presample_strategy.py	254	0	100%
modyn/tests/selector/internal/selector_strategies/test_abstract_selection_strategy.py	184	0	100%
modyn/tests/selector/internal/selector_strategies/test_freshness_sampling_strategy.py	308	0	100%
modyn/tests/selector/internal/selector_strategies/test_gradnorm_downsample_strategy.py	32	0	100%
modyn/tests/selector/internal/selector_strategies/test_loss_downsample_strategy.py	24	0	100%
modyn/tests/selector/internal/selector_strategies/test_new_data_strategy.py	519	0	100%
modyn/tests/selector/internal/selector_strategies/test_random_presampling_strategy.py	25	0	100%
modyn/tests/selector/internal/test_selector_manager.py	139	3	98%
modyn/tests/selector/internal/trigger_sample/test_trigger_sample_storage.py	176	0	100%
modyn/tests/selector/test_selector.py	84	3	96%
modyn/tests/selector/test_selector_entrypoint.py	22	0	100%
modyn/tests/storage/internal/database/models/test_dataset.py	47	0	100%
modyn/tests/storage/internal/database/models/test_file.py	64	0	100%
modyn/tests/storage/internal/database/models/test_sample.py	73	0	100%
modyn/tests/storage/internal/database/test_database_storage_utils.py	21	2	90%
modyn/tests/storage/internal/database/test_storage_database_connection.py	54	3	94%
modyn/tests/storage/internal/file_watcher/test_new_file_watcher.py	377	13	97%
modyn/tests/storage/internal/file_watcher/test_new_file_watcher_watch_dog.py	95	1	99%
modyn/tests/storage/internal/file_wrapper/test_binary_file_wrapper.py	92	0	100%
modyn/tests/storage/internal/file_wrapper/test_csv_file_wrapper.py	165	1	99%
modyn/tests/storage/internal/file_wrapper/test_file_wrapper_type.py	6	1	83%
modyn/tests/storage/internal/file_wrapper/test_single_sample_file_wrapper.py	90	0	100%
modyn/tests/storage/internal/filesystem_wrapper/test_filesystem_wrapper_type.py	6	1	83%
modyn/tests/storage/internal/filesystem_wrapper/test_local_filesystem_wrapper.py	167	0	100%
modyn/tests/storage/internal/grpc/test_grpc_server.py	11	0	100%
modyn/tests/storage/internal/grpc/test_storage_grpc_servicer.py	239	3	99%
modyn/tests/storage/test_storage.py	42	1	98%
modyn/tests/storage/test_storage_entrypoint.py	21	0	100%
modyn/tests/supervisor/internal/test_grpc_handler.py	241	0	100%
modyn/tests/supervisor/internal/test_status_bar.py	133	0	100%
modyn/tests/supervisor/internal/test_trigger.py	5	0	100%
modyn/tests/supervisor/internal/triggers/test_amounttrigger.py	30	0	100%
modyn/tests/supervisor/internal/triggers/test_timetrigger.py	26	0	100%
modyn/tests/supervisor/test_entrypoint.py	29	0	100%
modyn/tests/supervisor/test_supervisor.py	345	1	99%
modyn/tests/trainer_server/internal/data/key_sources/test_local_key_source.py	91	0	100%
modyn/tests/trainer_server/internal/data/key_sources/test_selector_key_source.py	92	0	100%
modyn/tests/trainer_server/internal/data/test_data_utils.py	22	1	95%
modyn/tests/trainer_server/internal/data/test_local_dataset_writer.py	59	0	100%
modyn/tests/trainer_server/internal/data/test_online_dataset.py	295	3	99%
modyn/tests/trainer_server/internal/grpc/test_trainer_server_grpc_server.py	17	0	100%
modyn/tests/trainer_server/internal/grpc/test_trainer_server_grpc_servicer.py	379	9	98%
modyn/tests/trainer_server/internal/metadata_collector/test_metadata_collector.py	41	0	100%
modyn/tests/trainer_server/internal/trainer/metadata_pytorch_callbacks/test_loss_callback.py	51	1	98%
modyn/tests/trainer_server/internal/trainer/remote_downsamplers/test_abstract_remote_downsampling_strategy.py	12	0	100%
modyn/tests/trainer_server/internal/trainer/remote_downsamplers/test_get_tensor_subset.py	56	0	100%
modyn/tests/trainer_server/internal/trainer/remote_downsamplers/test_remote_gradnorm_downsample.py	92	0	100%
modyn/tests/trainer_server/internal/trainer/remote_downsamplers/test_remote_loss_downsample.py	82	0	100%
modyn/tests/trainer_server/internal/trainer/test_pytorch_trainer.py	363	43	88%
modyn/tests/trainer_server/test_trainer_server.py	34	0	100%
modyn/tests/trainer_server/test_trainer_server_entrypoint.py	22	0	100%
modyn/tests/utils/test_utils.py	68	0	100%
modyn/trainer_server/custom_lr_schedulers/dlrm_lr_scheduler/dlrm_scheduler.py	33	33	0%
modyn/trainer_server/internal/dataset/data_utils.py	12	0	100%
modyn/trainer_server/internal/dataset/key_sources/abstract_key_source.py	21	5	76%
modyn/trainer_server/internal/dataset/key_sources/local_key_source.py	21	1	95%
modyn/trainer_server/internal/dataset/key_sources/selector_key_source.py	54	2	96%
modyn/trainer_server/internal/dataset/local_dataset_writer.py	68	4	94%
modyn/trainer_server/internal/dataset/online_dataset.py	128	4	97%
modyn/trainer_server/internal/grpc/trainer_server_grpc_server.py	22	0	100%
modyn/trainer_server/internal/grpc/trainer_server_grpc_servicer.py	232	33	86%
modyn/trainer_server/internal/metadata_collector/metadata_collector.py	33	0	100%
modyn/trainer_server/internal/mocks/mock_metadata_processor.py	22	2	91%
modyn/trainer_server/internal/trainer/metadata_pytorch_callbacks/base_callback.py	15	1	93%
modyn/trainer_server/internal/trainer/metadata_pytorch_callbacks/loss_callback.py	21	0	100%
modyn/trainer_server/internal/trainer/pytorch_trainer.py	322	94	71%
modyn/trainer_server/internal/trainer/remote_downsamplers/abstract_remote_downsample_strategy.py	29	3	90%
modyn/trainer_server/internal/trainer/remote_downsamplers/remote_gradnorm_downsample.py	37	3	92%
modyn/trainer_server/internal/trainer/remote_downsamplers/remote_loss_downsample.py	29	3	90%
modyn/trainer_server/internal/utils/metric_type.py	3	0	100%
modyn/trainer_server/internal/utils/trainer_messages.py	4	0	100%
modyn/trainer_server/internal/utils/training_info.py	44	1	98%
modyn/trainer_server/internal/utils/training_process_info.py	10	0	100%
modyn/trainer_server/trainer_server.py	19	0	100%
modyn/trainer_server/trainer_server_entrypoint.py	32	3	91%
modyn/utils/utils.py	98	12	88%
TOTAL	11629	767	93%
Coverage	HTML	written	to
=================	506	passed,	22

vGsteiger · 2023-06-25T16:41:11Z

I am a bit worried about the order of PR that we will go through with regards to #270 as this is a change in a rewritten part of the storage component. Most likely your PR will win the race but it would be great if you could already think about the port to C++ as well :) Thanks!

MaxiBoether

Thank you for the PR - added my comments and hopefully answered your questions :)

modyn/tests/storage/internal/file_wrapper/test_csv_file_wrapper.py

modyn/config/schema/modyn_config_schema.yaml

modyn/storage/internal/file_wrapper/csv_file_wrapper.py

MaxiBoether

LGTM! Thank you.

One note: Right now, we always read the entire file into memory, which we should generally avoid (get_csv_reader decodes the entire file into memory). This is due to our usage of csv. For now, this is fine, since we are rewriting storage in C++ anyways. In C++, at one point, we should try to not load everything into memory if possible, but instead only request the part of the file we need (if that is even possible with variable length data, but maybe using new lines or so there is some call). For now, this shall not be an issue.

francescodeaglio added 10 commits June 22, 2023 10:42

CSV file wrapper introduction

c71c8a8

Get all labels when the label is not present

b89111e

Fake delete_samples

e8f6b1d

Assertions changed to index errors

2261fcb

Fixed typing. Convert labels to integer.

35033a1

Basic tests

3e587a6

Validate file content

10c4da0

Test TSV vs CSV (different separator)

8b1e841

Basic documentation

b5c1555

Added Pipeline Config

487f96c

francescodeaglio changed the title ~~CSV file reader~~ CSV file wrapper Jun 22, 2023

francescodeaglio self-assigned this Jun 22, 2023

MaxiBoether changed the title ~~CSV file wrapper~~ ADD CsvFileWrapper Jun 27, 2023

MaxiBoether changed the title ~~ADD CsvFileWrapper~~ Add CSVFileWrapper Jun 27, 2023

MaxiBoether requested changes Jun 27, 2023

View reviewed changes

francescodeaglio added 6 commits June 27, 2023 12:03

Merge branch 'main' into feature/francescodeaglio/simple_csv_reader

6d3dd58

Changed default separator to comma

f7665d0

Optional validation

220b93b

Enforced label_index

9e6afa1

Ouut of order indexes and tests

fabe355

Integration test CSV format

7a37bca

MaxiBoether approved these changes Jun 28, 2023

View reviewed changes

francescodeaglio added 3 commits June 28, 2023 09:05

CSV reader integration tests

0e0dd53

Merge branch 'main' into feature/francescodeaglio/simple_csv_reader

1ef118f

Compliance check on integration tests

ddbcc6b

francescodeaglio merged commit 61d66bb into main Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CSVFileWrapper #273

Add CSVFileWrapper #273

francescodeaglio commented Jun 22, 2023

github-actions bot commented Jun 22, 2023 •

edited

Loading

vGsteiger commented Jun 25, 2023

MaxiBoether left a comment

MaxiBoether left a comment

Add CSVFileWrapper #273

Add CSVFileWrapper #273

Conversation

francescodeaglio commented Jun 22, 2023

github-actions bot commented Jun 22, 2023 • edited Loading

✅ Result of Pytest Coverage

vGsteiger commented Jun 25, 2023

MaxiBoether left a comment

Choose a reason for hiding this comment

MaxiBoether left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 22, 2023 •

edited

Loading