-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2.x] Major performance and data handling improvements to media assets #1918
[2.x] Major performance and data handling improvements to media assets #1918
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## normalize-the-asset-api #1918 +/- ##
=============================================================
- Coverage 100.00% 99.97% -0.03%
- Complexity 1875 1883 +8
=============================================================
Files 193 193
Lines 4952 4971 +19
=============================================================
+ Hits 4952 4970 +18
- Misses 0 1 +1 ☔ View full report in Codecov by Sentry. |
21b86a0
to
d611ad3
Compare
f3093c2
to
96781f9
Compare
f294d4a
to
0907bc5
Compare
778ca37
to
1db1553
Compare
30d36c8
to
4aa11b4
Compare
@@ -119,6 +128,11 @@ public function getMimeType(): string | |||
return 'text/plain'; | |||
} | |||
|
|||
public function getHash(): string | |||
{ | |||
return md5_file($this->getAbsolutePath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about which algoritm we should use. I think MD5
is sufficient. We don't need anything cryptographically secure. MD5 is fast, and it's a short string too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude response when asked a neutral question:
For a MediaFile class in a PHP project based on Laravel, using a checksum hash for cache busting is a good approach. When choosing an algorithm for this purpose, you'll want to consider factors like speed, collision resistance, and compatibility. Here are some options, along with my recommendation:
-
MD5: Fast, but not cryptographically secure. Still widely used for non-security critical applications like cache busting.
-
SHA-1: Slightly slower than MD5, but more collision-resistant. Also commonly used for non-security purposes.
-
SHA-256: More secure, but slower than MD5 and SHA-1. Overkill for cache busting.
-
CRC32: Very fast, but has a higher collision rate. Suitable for cache busting if speed is crucial.
For your use case, I'd recommend using MD5. It's a good balance of speed and collision resistance for cache busting purposes. Here's why:
- Speed: MD5 is fast to compute, which is important when dealing with potentially large media files.
- Collision resistance: While not cryptographically secure, it's sufficient for cache busting.
- Widespread support: MD5 is widely supported and understood.
- Compatibility: Laravel and PHP have built-in functions for MD5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some benchmarks on the Claude suggestions, and wow CRC32 is fast!
$this->file('zero', '');
$this->file('tiny', 'Hello, World!');
$this->file('small', str_repeat('a', 1024)); // 1 KB
$this->file('compressed', str_repeat('a', 1024 * 256)); // 256 KB
$this->file('medium', str_repeat('a', 1024 * 1024)); // 1 MB
$this->file('large', str_repeat('a', 1024 * 1024 * 10)); // 10 MB
$algos = ['md5', 'sha1', 'sha256', 'crc32'];
// Warm up the statcache
$files = ['zero', 'tiny', 'small', 'compressed', 'medium', 'large'];
array_map('md5_file', array_map(fn ($file) => Hyde::path($file), $files));
$its = 100;
foreach ($algos as $algo) {
echo "Benchmarking $algo:\n";
$avgBuffer = [];
foreach ($files as $file) {
$result = \Illuminate\Support\Benchmark::measure(function () use ($file, $algo) {
hash_file($algo, Hyde::path($file));
}, $its);
echo " $file: ".$result."ms\n";
$avgBuffer[] = $result;
}
echo " Avg: ".array_sum($avgBuffer) / count($avgBuffer)."ms\n";
}
Benchmarking md5:
zero: 0.083057ms
tiny: 0.131709ms
small: 0.111467ms
compressed: 0.702308ms
medium: 2.670385ms
large: 24.054236ms
Avg: 4.625527ms
Benchmarking sha1:
zero: 0.080178ms
tiny: 0.08563ms
small: 0.085257ms
compressed: 0.734365ms
medium: 2.702944ms
large: 28.270893ms
Avg: 5.3265445ms
Benchmarking sha256:
zero: 0.080278ms
tiny: 0.085282ms
small: 0.090875ms
compressed: 1.401351ms
medium: 5.457427ms
large: 54.346367ms
Avg: 10.243596666667ms
Benchmarking crc32:
zero: 0.081022ms
tiny: 0.088007ms
small: 0.083692ms
compressed: 0.171689ms
medium: 0.440915ms
large: 5.784513ms
Avg: 1.1083063333333ms
Result without the large file:
Benchmarking md5:
zero: 0.08615ms
tiny: 0.087616ms
small: 0.086381ms
compressed: 0.803126ms
medium: 2.554298ms
Avg: 0.7235142ms
Benchmarking sha1:
zero: 0.087621ms
tiny: 0.091061ms
small: 0.093293ms
compressed: 0.788253ms
medium: 2.759281ms
Avg: 0.7639018ms
Benchmarking sha256:
zero: 0.079744ms
tiny: 0.089832ms
small: 0.091324ms
compressed: 1.426645ms
medium: 5.374404ms
Avg: 1.4123898ms
Benchmarking crc32:
zero: 0.082501ms
tiny: 0.087204ms
small: 0.084978ms
compressed: 0.175249ms
medium: 0.43878ms
Avg: 0.1737424ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each iteration is very fast, and doesn't mean much. But since I was thinking about hashing the files on construct time I got curious about how long the total times will be for a real project. So I ran the benchmarks on all the files on the HydePHP.com website, and this does not bode well.
About 65 files totalling 5.25 MB
Benchmarking md5:
Avg: 0.28125801612903ms
Total: 17.437997ms
Benchmarking sha1:
Avg: 0.31133517741935ms
Total: 19.302781ms
Benchmarking sha256:
Avg: 0.53320258064516ms
Total: 33.05856ms
Benchmarking crc32:
Avg: 0.11808622580645ms
Total: 7.321346ms
The CRC32
results are acceptable, MD5
and SHA1
are borderline.
Adding another 80 images of some random photos totalling about 50 MB
:
Benchmarking md5:
Avg: 0.96980565957447ms
Total: 136.742598ms
Benchmarking sha1:
Avg: 1.146947858156ms
Total: 161.719648ms
Benchmarking sha256:
Avg: 2.1824318085106ms
Total: 307.722885ms
Benchmarking crc32:
Avg: 0.27406927659574ms
Total: 38.643768ms
CRC32
is still acceptable here, but MD5
has shot up too much. Granted, this is an edge case, but over 100ms is actually quite a lot.
public function testHashBenchmark()
{
$files = MediaFile::all();
$algos = ['md5', 'sha1', 'sha256', 'crc32'];
// Warm up the statcache
array_map('md5_file', array_map(fn ($file) => Hyde::path($file->path), $files->all()));
$its = 100;
foreach ($algos as $algo) {
echo "Benchmarking $algo:\n";
$avgBuffer = [];
foreach ($files as $file) {
$result = \Illuminate\Support\Benchmark::measure(function () use ($file, $algo) {
hash_file($algo, Hyde::path($file->path));
}, $its);
// echo " $file->path: ".$result."ms\n";
$avgBuffer[] = $result;
}
echo " Avg: ".array_sum($avgBuffer) / count($avgBuffer)."ms\n";
echo " Total: ".array_sum($avgBuffer)."ms\n";
}
die;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, CRC is short! And it's designed for data transmission integrity checks.
# hash($algo, 'foo');
md5: acbd18db4cc2f85cedef654fccc4a4d8
sha1: 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
sha256: 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
crc32: a5c4fe49
Now I'm just curious about the hash collision rate, to see if it's a problem in the real world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we would need a fallback to MD5, but I checked, and CRC32 is available on all supported PHP versions. https://3v4l.org/3L6EJ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context: Laravel Mix seems to use MD5 for versioning, though documented example seems to be SHA1. Both are not cryptographically secure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude comparing MD5 and CRC32:
When considering md5 vs crc32 for cache busting in a static site generator built with PHP, there are a few key factors to consider:
- Collision resistance:
- MD5: More collision-resistant, but overkill for this use case.
- CRC32: Less collision-resistant, but sufficient for cache busting.
- Performance:
- MD5: Slower to compute.
- CRC32: Faster to compute, which can be beneficial for larger sites.
- Output length:
- MD5: 32 characters (128 bits)
- CRC32: 8 characters (32 bits)
- Availability: Both functions are readily available in PHP.
For cache busting in a static site generator, CRC32 is generally the better choice:
- It's fast to compute, which is important when generating many files.
- The shorter output (8 characters) is sufficient for cache busting and keeps URLs cleaner.
- While it has a higher collision chance than MD5, this is rarely an issue in practice for cache busting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May consider changing the unixsum function to use crc
See lengthy discussion at #1918 (comment)
5ef7285
to
2471dfa
Compare
May want to improve the /**
* Get a collection of all pages, parsed into page models, keyed by the source file identifiers.
*
* @return \Hyde\Foundation\Kernel\PageCollection<static>
*/
public static function all(): PageCollection
{
return Facades\Pages::getPages(static::class);
} |
These are part of the public interface of the class, but the should be immutable since there is no way to change them (in terms of persisting modified states)
31c595a
to
a39ba5b
Compare
0c574da
to
e077ac6
Compare
This is part of #1904 and follows up upon #1917
Main changes:
$path
property normalized to start with the media directory prefix, to better match the documented state in the parent class.CRC32
for media asset hashing (and thus cache busting) instead ofMD5
in dca20a0 and 4cb592ffiles()
returning an array,all()
returning a collectionO(k)
toO(n * m)
(orO(n)
ifm
is constant) - wherek
is the number of media files,m
is the average number of assets included on a page, andn
is the number of pages. 1 This is a significant improvement, especially for large sites. 2Footnotes
https://github.com/hydephp/develop/pull/1918#issuecomment-2255507224 ↩
For sites with a massive amount of media assets, this could lead to a performance decrease when rebuilding a single page / viewing a page in the realtime compiler. If this proves an issue, we can lazy load the discovery in such cases. (~150 files with ~50 MB takes less than 40ms)
For smaller sites the overhead is less than 8ms, so this should not be that big of an issue, see 1918#discussion_r1694291464 ↩