Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a disk-based hash cache #41

Merged
merged 5 commits into from
Jul 6, 2024
Merged

Implement a disk-based hash cache #41

merged 5 commits into from
Jul 6, 2024

Conversation

ferd
Copy link
Owner

@ferd ferd commented Jul 5, 2024

Inspired by the S3 hash cache and the slow runtimes on my bad VPS for some file synchronization where scanning a moderate size directory for hashes could take 20-30 seconds on unchanged data, this PR implements an optional hash cache for disk backends.

I initially started only using last-modified times (as in the first commits), but this showed a consistent failure in synchronization tests, which run too fast to detect sub-second edit times. However, I know that S3 usage works, and I also understand that logically the cache should work, and the problem is mostly granularity at test level.

So instead the cache uses overall file info details, which includes modification and change times, but also inode and file size (we ignore access time because reading shouldn't be significant). The session tests validating the cache also got their sample file sizes modified to exercise that part of the flow, and now they pass.

In practice, running the scan operation on my bad VPS improves drastically:

# hydrate the cache on the first run ever:
→ time ./_build/prod/bin/revault_cli scan -dirs books
Scanning books: ok
./_build/prod/bin/revault_cli scan -dirs books  0.50s user 0.18s system 2% cpu 24.197 total

# run with a hot cache
→ time ./_build/prod/bin/revault_cli scan -dirs books
Scanning books: ok
./_build/prod/bin/revault_cli scan -dirs books  0.47s user 0.13s system 68% cpu 0.870 total

A ~25x improvement on slower disks/file systems seems worth it as an option, particularly when one wants to synchronize much bigger files.

ferd added 5 commits July 5, 2024 13:04
The timing constraints being whole-second on file's last modified
attributes are just too large for this to work right during
synchronization exchanges during tests.

Large directories may want to turn it on optionally, but not within
this context.
The tests just can't work there due to timing constraints.
By using file info, we're able to detect changes by time and also file
modifications that impact size, which drastically increases our
sensitivity without much in terms of performance costs.

By adjusting the session tests to change file sizes, we can work around
the modification stamp limitations and show that disk cache works fine
with a lower cost than duplicating all the FSM tests.
@ferd ferd merged commit f003f01 into main Jul 6, 2024
1 check passed
@ferd ferd deleted the disk-hash-cache branch July 6, 2024 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant