Log fsspec caching statistics #82

chuckwondo · 2024-07-03T16:07:42Z

The implementation of #77 added a version of fsspec that includes capturing read cache statistics, including the following details: file size, total bytes requested, hit count, miss count.

The current s3fs/fsspec cache parameters were chosen only to maximize speed. After further investigation, it was discovered that the "all" cache type in fsspec behaves effectively identically to downloading files, because it simply reads the file in its entirety in a single request, caching it in memory (rather than writing it to local disk). While this produces slightly better overall algorithm speed than downloading (~10% faster on average), it produces the same volume of data transfer as downloading.

What we would like to do is also run profiling to determine which cache type and block size combination produces the least amount of data transfer (i.e., minimizes the total number of bytes requested). Now that fsspec includes capturing this statistic in its read cache implementations, we can record the stats for each granule file.

In order to easily analyze the cache stats, we want to write the stats to a separate log file so that we do not have to separate these stats from the rest of the log messages. Further, we want to write the stats in JSON format to avoid having to write any parsing logic to parse the stats log file. We should consider using https://colin-b.github.io/logging_json/ to simplify implementation of this.

Acceptance Criteria

Upon successful subsetting of each granule file, the fsspec cache stats are written to a separate log file, in JSON format, one JSON record per line (which should allow us to later easily use duckdb for analysis), perhaps using https://colin-b.github.io/logging_json/ to do so. The file should land in the same directory as the other outputs already produced, and should be named fsspec-stats.json (or similarly).
Each record should include:
- file size in bytes ("filesize_bytes")
- total number of bytes requested ("requested_bytes")
- number of cache hits ("hits")
- number of cache misses ("misses")
- block size ("blocksize_bytes")
- number of blocks ("blocks")
- file path ("path")
- file system type ("filesystem", likely easiest to simply use type(fs).__name__ to write the class name of the filesystem)
- number of seconds to subset the individual file ("subset_seconds", the time to run to the subset_hdf5 function)
- each element of fsspec_kwargs as a separate entry (e.g., separate entries for "default_cache_type", etc., excluding "default_block_size", which is captured in "blocksize_bytes" above)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log fsspec caching statistics #82

Log fsspec caching statistics #82

chuckwondo commented Jul 3, 2024

Log fsspec caching statistics #82

Log fsspec caching statistics #82

Comments

chuckwondo commented Jul 3, 2024

Acceptance Criteria