Implement a disk-based hash cache #41

ferd · 2024-07-05T14:46:37Z

Inspired by the S3 hash cache and the slow runtimes on my bad VPS for some file synchronization where scanning a moderate size directory for hashes could take 20-30 seconds on unchanged data, this PR implements an optional hash cache for disk backends.

I initially started only using last-modified times (as in the first commits), but this showed a consistent failure in synchronization tests, which run too fast to detect sub-second edit times. However, I know that S3 usage works, and I also understand that logically the cache should work, and the problem is mostly granularity at test level.

So instead the cache uses overall file info details, which includes modification and change times, but also inode and file size (we ignore access time because reading shouldn't be significant). The session tests validating the cache also got their sample file sizes modified to exercise that part of the flow, and now they pass.

In practice, running the scan operation on my bad VPS improves drastically:

# hydrate the cache on the first run ever:
→ time ./_build/prod/bin/revault_cli scan -dirs books
Scanning books: ok
./_build/prod/bin/revault_cli scan -dirs books  0.50s user 0.18s system 2% cpu 24.197 total

# run with a hot cache
→ time ./_build/prod/bin/revault_cli scan -dirs books
Scanning books: ok
./_build/prod/bin/revault_cli scan -dirs books  0.47s user 0.13s system 68% cpu 0.870 total

A ~25x improvement on slower disks/file systems seems worth it as an option, particularly when one wants to synchronize much bigger files.

The timing constraints being whole-second on file's last modified attributes are just too large for this to work right during synchronization exchanges during tests. Large directories may want to turn it on optionally, but not within this context.

The tests just can't work there due to timing constraints.

By using file info, we're able to detect changes by time and also file modifications that impact size, which drastically increases our sensitivity without much in terms of performance costs. By adjusting the session tests to change file sizes, we can work around the modification stamp limitations and show that disk cache works fine with a lower cost than duplicating all the FSM tests.

ferd added 5 commits July 5, 2024 13:04

Remove disk cache from session tests.

d721ab4

The tests just can't work there due to timing constraints.

make sure tests default to no disk cache

f839ecb

backport path join fix to s3 cache

473e429

ferd merged commit f003f01 into main Jul 6, 2024
1 check passed

ferd deleted the disk-hash-cache branch July 6, 2024 02:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a disk-based hash cache #41

Implement a disk-based hash cache #41

ferd commented Jul 5, 2024

Implement a disk-based hash cache #41

Implement a disk-based hash cache #41

Conversation

ferd commented Jul 5, 2024