Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.x] Major performance and data handling improvements to media assets #1918

Merged

Conversation

caendesilva
Copy link
Member

@caendesilva caendesilva commented Jul 28, 2024

This is part of #1904 and follows up upon #1917

Main changes:

  • MediaFile instances now have their $path property normalized to start with the media directory prefix, to better match the documented state in the parent class.
  • Uses CRC32 for media asset hashing (and thus cache busting) instead of MD5 in dca20a0 and 4cb592f
  • Media files are now validated to exist upon constructing the instances.
  • The API is designed to match HydePages, for example files() returning an array, all() returning a collection
  • All data in the media file instance is computed at construct, improving performance from O(k) to O(n * m) (or O(n) if m is constant) - where k is the number of media files, m is the average number of assets included on a page, and n is the number of pages. 1 This is a significant improvement, especially for large sites. 2

Footnotes

  1. https://github.com/hydephp/develop/pull/1918#issuecomment-2255507224

  2. For sites with a massive amount of media assets, this could lead to a performance decrease when rebuilding a single page / viewing a page in the realtime compiler. If this proves an issue, we can lazy load the discovery in such cases. (~150 files with ~50 MB takes less than 40ms)
    For smaller sites the overhead is less than 8ms, so this should not be that big of an issue, see 1918#discussion_r1694291464

@caendesilva caendesilva mentioned this pull request Jul 28, 2024
21 tasks
@caendesilva caendesilva marked this pull request as draft July 28, 2024 12:46
@caendesilva caendesilva marked this pull request as ready for review July 28, 2024 12:47
Copy link

codecov bot commented Jul 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.97%. Comparing base (74eed1f) to head (c3bcc78).

Additional details and impacted files
@@                      Coverage Diff                      @@
##             normalize-the-asset-api    #1918      +/-   ##
=============================================================
- Coverage                     100.00%   99.97%   -0.03%     
- Complexity                      1875     1883       +8     
=============================================================
  Files                            193      193              
  Lines                           4952     4971      +19     
=============================================================
+ Hits                            4952     4970      +18     
- Misses                             0        1       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@caendesilva caendesilva force-pushed the bring-media-assets-into-the-hyde-kernel branch from 21b86a0 to d611ad3 Compare July 28, 2024 12:48
@caendesilva caendesilva marked this pull request as draft July 28, 2024 12:54
@caendesilva caendesilva force-pushed the bring-media-assets-into-the-hyde-kernel branch from f3093c2 to 96781f9 Compare July 28, 2024 13:55
@caendesilva caendesilva changed the title [2.x] Improve media asset file data [2.x] Improve media asset file data handling Jul 28, 2024
@caendesilva caendesilva force-pushed the bring-media-assets-into-the-hyde-kernel branch from f294d4a to 0907bc5 Compare July 28, 2024 14:21
@caendesilva caendesilva force-pushed the bring-media-assets-into-the-hyde-kernel branch 2 times, most recently from 778ca37 to 1db1553 Compare July 28, 2024 15:31
@caendesilva caendesilva force-pushed the bring-media-assets-into-the-hyde-kernel branch from 30d36c8 to 4aa11b4 Compare July 28, 2024 15:52
@@ -119,6 +128,11 @@ public function getMimeType(): string
return 'text/plain';
}

public function getHash(): string
{
return md5_file($this->getAbsolutePath());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about which algoritm we should use. I think MD5 is sufficient. We don't need anything cryptographically secure. MD5 is fast, and it's a short string too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude response when asked a neutral question:

For a MediaFile class in a PHP project based on Laravel, using a checksum hash for cache busting is a good approach. When choosing an algorithm for this purpose, you'll want to consider factors like speed, collision resistance, and compatibility. Here are some options, along with my recommendation:

  1. MD5: Fast, but not cryptographically secure. Still widely used for non-security critical applications like cache busting.

  2. SHA-1: Slightly slower than MD5, but more collision-resistant. Also commonly used for non-security purposes.

  3. SHA-256: More secure, but slower than MD5 and SHA-1. Overkill for cache busting.

  4. CRC32: Very fast, but has a higher collision rate. Suitable for cache busting if speed is crucial.

For your use case, I'd recommend using MD5. It's a good balance of speed and collision resistance for cache busting purposes. Here's why:

  1. Speed: MD5 is fast to compute, which is important when dealing with potentially large media files.
  2. Collision resistance: While not cryptographically secure, it's sufficient for cache busting.
  3. Widespread support: MD5 is widely supported and understood.
  4. Compatibility: Laravel and PHP have built-in functions for MD5.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some benchmarks on the Claude suggestions, and wow CRC32 is fast!

$this->file('zero', '');
$this->file('tiny', 'Hello, World!');
$this->file('small', str_repeat('a', 1024)); // 1 KB
$this->file('compressed', str_repeat('a', 1024 * 256)); // 256 KB
$this->file('medium', str_repeat('a', 1024 * 1024)); // 1 MB
$this->file('large', str_repeat('a', 1024 * 1024 * 10)); // 10 MB

$algos = ['md5', 'sha1', 'sha256', 'crc32'];

// Warm up the statcache
$files = ['zero', 'tiny', 'small', 'compressed', 'medium', 'large'];
array_map('md5_file', array_map(fn ($file) => Hyde::path($file), $files));

$its = 100;

foreach ($algos as $algo) {
    echo "Benchmarking $algo:\n";
    $avgBuffer = [];
    foreach ($files as $file) {
        $result = \Illuminate\Support\Benchmark::measure(function () use ($file, $algo) {
            hash_file($algo, Hyde::path($file));
        }, $its);
        echo "  $file: ".$result."ms\n";
        $avgBuffer[] = $result;
    }
    echo "  Avg: ".array_sum($avgBuffer) / count($avgBuffer)."ms\n";
}
Benchmarking md5:
  zero: 0.083057ms
  tiny: 0.131709ms
  small: 0.111467ms
  compressed: 0.702308ms
  medium: 2.670385ms
  large: 24.054236ms
  Avg: 4.625527ms
Benchmarking sha1:
  zero: 0.080178ms
  tiny: 0.08563ms
  small: 0.085257ms
  compressed: 0.734365ms
  medium: 2.702944ms
  large: 28.270893ms
  Avg: 5.3265445ms
Benchmarking sha256:
  zero: 0.080278ms
  tiny: 0.085282ms
  small: 0.090875ms
  compressed: 1.401351ms
  medium: 5.457427ms
  large: 54.346367ms
  Avg: 10.243596666667ms
Benchmarking crc32:
  zero: 0.081022ms
  tiny: 0.088007ms
  small: 0.083692ms
  compressed: 0.171689ms
  medium: 0.440915ms
  large: 5.784513ms
  Avg: 1.1083063333333ms

Result without the large file:

Benchmarking md5:
  zero: 0.08615ms
  tiny: 0.087616ms
  small: 0.086381ms
  compressed: 0.803126ms
  medium: 2.554298ms
  Avg: 0.7235142ms
Benchmarking sha1:
  zero: 0.087621ms
  tiny: 0.091061ms
  small: 0.093293ms
  compressed: 0.788253ms
  medium: 2.759281ms
  Avg: 0.7639018ms
Benchmarking sha256:
  zero: 0.079744ms
  tiny: 0.089832ms
  small: 0.091324ms
  compressed: 1.426645ms
  medium: 5.374404ms
  Avg: 1.4123898ms
Benchmarking crc32:
  zero: 0.082501ms
  tiny: 0.087204ms
  small: 0.084978ms
  compressed: 0.175249ms
  medium: 0.43878ms
  Avg: 0.1737424ms

Copy link
Member Author

@caendesilva caendesilva Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each iteration is very fast, and doesn't mean much. But since I was thinking about hashing the files on construct time I got curious about how long the total times will be for a real project. So I ran the benchmarks on all the files on the HydePHP.com website, and this does not bode well.

About 65 files totalling 5.25 MB

Benchmarking md5:
  Avg: 0.28125801612903ms
  Total: 17.437997ms
Benchmarking sha1:
  Avg: 0.31133517741935ms
  Total: 19.302781ms
Benchmarking sha256:
  Avg: 0.53320258064516ms
  Total: 33.05856ms
Benchmarking crc32:
  Avg: 0.11808622580645ms
  Total: 7.321346ms

The CRC32 results are acceptable, MD5 and SHA1 are borderline.

Adding another 80 images of some random photos totalling about 50 MB:

Benchmarking md5:
  Avg: 0.96980565957447ms
  Total: 136.742598ms
Benchmarking sha1:
  Avg: 1.146947858156ms
  Total: 161.719648ms
Benchmarking sha256:
  Avg: 2.1824318085106ms
  Total: 307.722885ms
Benchmarking crc32:
  Avg: 0.27406927659574ms
  Total: 38.643768ms

CRC32 is still acceptable here, but MD5 has shot up too much. Granted, this is an edge case, but over 100ms is actually quite a lot.

public function testHashBenchmark()
{
    $files = MediaFile::all();

    $algos = ['md5', 'sha1', 'sha256', 'crc32'];

    // Warm up the statcache
    array_map('md5_file', array_map(fn ($file) => Hyde::path($file->path), $files->all()));

    $its = 100;

    foreach ($algos as $algo) {
        echo "Benchmarking $algo:\n";
        $avgBuffer = [];
        foreach ($files as $file) {
            $result = \Illuminate\Support\Benchmark::measure(function () use ($file, $algo) {
                hash_file($algo, Hyde::path($file->path));
            }, $its);
         //   echo "  $file->path: ".$result."ms\n";
            $avgBuffer[] = $result;
        }
        echo "  Avg: ".array_sum($avgBuffer) / count($avgBuffer)."ms\n";
        echo "  Total: ".array_sum($avgBuffer)."ms\n";
    }

    die;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, CRC is short! And it's designed for data transmission integrity checks.

# hash($algo, 'foo');
md5: acbd18db4cc2f85cedef654fccc4a4d8
sha1: 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
sha256: 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
crc32: a5c4fe49

Now I'm just curious about the hash collision rate, to see if it's a problem in the real world.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we would need a fallback to MD5, but I checked, and CRC32 is available on all supported PHP versions. https://3v4l.org/3L6EJ

Copy link
Member Author

@caendesilva caendesilva Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context: Laravel Mix seems to use MD5 for versioning, though documented example seems to be SHA1. Both are not cryptographically secure.

Copy link
Member Author

@caendesilva caendesilva Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude comparing MD5 and CRC32:

When considering md5 vs crc32 for cache busting in a static site generator built with PHP, there are a few key factors to consider:

  1. Collision resistance:
  • MD5: More collision-resistant, but overkill for this use case.
  • CRC32: Less collision-resistant, but sufficient for cache busting.
  1. Performance:
  • MD5: Slower to compute.
  • CRC32: Faster to compute, which can be beneficial for larger sites.
  1. Output length:
  • MD5: 32 characters (128 bits)
  • CRC32: 8 characters (32 bits)
  1. Availability: Both functions are readily available in PHP.

For cache busting in a static site generator, CRC32 is generally the better choice:

  1. It's fast to compute, which is important when generating many files.
  2. The shorter output (8 characters) is sufficient for cache busting and keeps URLs cleaner.
  3. While it has a higher collision chance than MD5, this is rarely an issue in practice for cache busting.

Copy link
Member Author

@caendesilva caendesilva Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in dca20a0 and 4cb592f

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May consider changing the unixsum function to use crc

@caendesilva caendesilva force-pushed the bring-media-assets-into-the-hyde-kernel branch from 5ef7285 to 2471dfa Compare August 1, 2024 11:05
@caendesilva
Copy link
Member Author

May want to improve the HydePage documentation to match the now better MediaFile documentation.

/**
 * Get a collection of all pages, parsed into page models, keyed by the source file identifiers.
 *
 * @return \Hyde\Foundation\Kernel\PageCollection<static>
 */
public static function all(): PageCollection
{
    return Facades\Pages::getPages(static::class);
}

These are part of the public interface of the class, but the should be immutable since there is no way to change them (in terms of persisting modified states)
@caendesilva caendesilva force-pushed the bring-media-assets-into-the-hyde-kernel branch from 31c595a to a39ba5b Compare August 1, 2024 13:16
@caendesilva caendesilva changed the title [2.x] Improve media asset file data handling [2.x] Major performance and data handling improvements to media assets Aug 1, 2024
@caendesilva caendesilva force-pushed the bring-media-assets-into-the-hyde-kernel branch from 0c574da to e077ac6 Compare August 2, 2024 09:03
@caendesilva caendesilva marked this pull request as ready for review August 3, 2024 10:35
@caendesilva caendesilva merged commit a85d5d3 into normalize-the-asset-api Aug 3, 2024
7 checks passed
@caendesilva caendesilva deleted the bring-media-assets-into-the-hyde-kernel branch August 3, 2024 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant