-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential encoder speedup on non-win32 #585
Comments
I have made noe attempt at finding relevant code (since I'm incompetent at it), but the behaviour seems to be introduced with 1.3.1, which is consistent with the changelog entry that reads "I/O buffering improvements on Windows to reduce disk fragmentation when writing files." An observation that might be relevant: After "first write" (i.e. first size update) Windows reports 10 485 760 bytes and Size on disk: 10 485 760 bytes too. But, after second update when Windows reports file size of 20 971 520, "Size on disk" is bigger. It is not constant across settings. Also as the number of writes (that is, the multiple of 10 MiB) increases, the difference between "Size on disk" and "size" also oscillates. But that difference seems to be pinned to zero at first write. |
Also tried setting a 10MiB+8 buffer in the decoder reader (10mb8r). Same re-encode test, average of 10 runs:
There might be a benefit to a large buffer when reading but it's close to margin of error. Reading will also be harder to measure as OS caching likely obscures things. Not added to the repo. But the decoder writes too, so tried hacking in a 10MiB+8 buffer into ./flac's decode.c writer. Same file as before decoded to wav. Average of 10 runs:
Less pronounced but still a definite benefit on my system (writing larger chunks may better hide the drawback of small buffers, could explain why the benefit on decode is smaller). If windows also sees a performance benefit to a large buffer this should apply. If someone wants to test potential performance improvements on windows they can try compiling the current state of my repo with the current state of the xiph repo and comparing decode when a file is generated.
That tracks. |
I don't know what to make of this. FLAC calls fflush after every fwrite, so large buffers should not make a difference? Even with setvbuf, fflush is honoured as far as I know. Also, I can't explain the difference between encoding and decoding really. They both seem to write the same way (once for every frame). Feel free to explore this further, I'll probably won't spend much time on this until I clear some bugs found by fuzzing that seem rather hard to debug. |
I don't think fflush is called every frame, fflush is present only when FLAC__VALGRIND_TESTING is defined on encode and decode, presumably only in test and/or debug builds.
The only difference I can see is that on decode the data is more consistent and there's more to write. More data should mean more impact if anything but there seems to be less impact.
I'll explore it to a conclusion as much as I can on Linux. The only other hardware I have has a modern fast SSD which probably won't be of much help and VM's seem like a bad idea here. Maybe there's a raspberry pi or slow USB stick somewhere that might be useful to make I/O issues more obvious. |
Yes indeed, I didn't see that right |
Tested decode on 2193 tracks mostly 16 bit against the default build and the vbuf10m8dec (renamed 10m8) build. ~10% of tracks were decoded quicker by the default build but the sum clearly trends towards the 10m8 build, laptop was being lightly used during test which probably didn't help with variance. One build fully ran the tracks followed by the other build fully running the tracks to eliminate caching as a potential factor.
Made a selection of builds with varying buffer sizes and tested decode with the 150MB 24 bit track, 10 runs averaged:
That seems to confirm a benefit but not the reason behind it. Now most +8 buffer sizes are showing as slightly worse, probably too close to margin of error to make a judgement, that being the case the +8 is probably irrelevant. Here's some perf stats from a few decode runs. I only have a rudimentary grasp of tools like perf and gdb so this'll have to do for now. The default build has more branch misses for some reason. Less page faults on default build makes sense, less memory used so less first-time access causing a fault. 10m build
default build
|
strace confirms that the default does orders of magnitude more write syscalls 10m:
default:
9417 write syscalls vs 23 spent writing the wav file, the rest are for the progress bar. The above only shows time taken to do the syscall, not time taken in the syscall. strace -T outputs time taken in each syscall, summing the output of that results in this, which I believe is the time taken to actually write the wav file (possibly with or without the time to do the syscall from the first part of this post):
What's curious is that each frame is written in two write calls for the default build, here's the start of the write calls for the default build:
Which presumably is because the header partially fills the buffer, then each frame in this example file wants to write 49152 bytes. At each frame the rest of the buffer is filled and written, then the remaining multiple of 4096 is written at once, then the buffer is partially filled to be written with the next frame. It looked like a default build that also flushed the header might half the number of write syscalls when a small buffer aligns neatly (like decoding to a simple format like wav with a typical blocksize of 4096, very common), but it didn't:
Still doing 4096+45056 instead of a single write of 49152. Double checked and there was nothing else written after the header to misalign the buffer, so it looks like (at least in gcc and clang's implementation) that the buffer is filled even if it starts empty, then the remainder (maybe in full or the largest multiple of buffer size) is written in a single call. Unfortunate for the edge case, but the solution that works everywhere is to have a buffer that's at least as big as the maximum expected raw data contained in a frame. The doubling of write syscalls is an inefficiency when the buffer is too small, but it might not be the whole story. It might not even be a major part of why it's slower, but it would be convenient if it were. |
For _WIN32, other than using an external buffer the only other difference is that stdout is determined with What remains to be solved to potentially turn this into a PR:
|
I haven't looked into your changed in depth, but I would consider moving the |
That makes a lot of sense. |
Measuring IO is apparently hard, the latest results make it questionable when and where the apparent gains are to be had. These are the old results to compare against
After trying some Linux kernel tweaks to try and disable/flush cache to get more consistent results these new results happened (no defrag)
The default is the exact same build for both tests. Not only are the results suspiciously much better and the two builds close to margin of error, the results are also much less consistent. As mentioned previously the old test was pretty consistent, this test had wild swings in both directions. My best guess as to what's happening is that the new measured time is junk. fflush on Linux doesn't guarantee that the data is written to disk, only that the buffer is emptied and that it's up to the kernel to finish things. Whatever stuck from the kernel tweaks I recklessly tried (and can't seem to reverse) seems to have resulted in the kernel returning quicker from writing than it did before, possibly very quickly. The kernels internal buffer doesn't seem to have enough time to empty across decodes, when it builds up too much it seems to fully empty at which point whatever test was running gets a severe wall time penalty. If that's what's actually happening then the only writes that are reflected in the new data are the ones where the kernel fully dumps its buffer. The next thing to try is fsync on fclose, which should push the data in the kernel buffer to the HDD and should make the results consistent again. Even fsync doesn't guarantee that the data is fully written, only that the disk fully has it (so SSD's have it at least in their RAM), but this is an old HDD with limited cache so it should be good enough to test. |
Disregard any prior benchmarks or conclusions drawn, I vastly misunderstood how aggressively and when laptops put themselves in different states so it's all tainted. The following results try to control as many variables as is feasible (laptop plugged in fully charged, no UI, no other access to the disk, screen off, no other programs open, cooldown between runs). Fairly confident that these results are as accurate as they can be.
Will eventually test on other hardware and redo the buffer size test to see if there's potentially a more appropriate size. |
Tried the same decode test using a fast M.2 ssd with a Zen2 4700u APU. Had to rebuild everything, these use gcc 12.2 and clang 11.1, the previous test used gcc 9.4 but the difference is probably minimal.
The disk is fast enough that buffer size is not relevant using a single core. Parallelising with parallel only utilised an estimated ~330% of 8 cores and cut the time to ~210-225 seconds, still not enough for the default build to be noticeably inadequate using these tests. Processing 60GB of flac into 98GB of wav in under 4 minutes is neat. Slow sata SSD's are common and sit between HDD and fast SSD in specs. I'm guessing the default build is adequate there under normal conditions (stronger compression, single core), maybe slow SSD's feel the pressure when fully threaded. |
Tried a thumb drive with the zen2, not the slowest with ~15MB/s writes but not quick. With roughly the first third of the corpus as the full corpus doesn't fit on the drive. Fresh format, no fragmentation
So it looks like the benefit measured before is from fragmentation and not an OS difference, same reason init_file sets the 10m buffer on windows for encode. Don't fancy trying to defrag an ntfs partition on Linux to find out for sure, sounds like a recipe for data loss. Curiously it looks like the data is fsynced at the end of the script, don't know what else explains the script times all aligning but not the sum. Nothing in the script explains the large discrepency between script time and flac sum, it's just a loop that calls flac for every file. |
The current state of the vbuf branch does as suggested, flac uses init_FILE and all changes are contained to encode.c/decode.c. WIN32 also uses an external buffer, it still needs to be tested on windows to make sure encode speed remains the same and that decode speed has improved in the presence of fragmentation. I can probably check that win32 works but not benchmark from a VM as the VM would translate to Linux calls and we want to benchmark the windows calls. |
I can test on Windows. Something that might be relevant: what filesystem did you test on? You mentioned ntfs once, are all tests on ntfs? |
Thanks. The laptop hdd is ntfs with the OS on a primary SSD. The fast SSD is ext4 with OS on the same partition. The USB stick is FAT32. |
So, could you summarize what you're seeing? What hardware were you using for the numbers in the post where you said to disregard all earlier measurements? Was that an HDD? The current state of your vbuf branch, that is 'nofsync' I presume? |
The results from the disregard post were the HDD on laptop with i7-6700hq. The post after that tested the fast ssd in a miniPC host containing Ryzen 4700u. The post after that tested the USB stick on miniPC host. The current state of vbuf is 10m_nofsync, just a 10MiB external output buffer everywhere. tl;dr I'm seeing a big speedup when writing to a presumably fragmented NTFS HDD, a set of flac that takes an hour to decode to wav on the default build takes 45 minutes with a 10MiB buffer. All other hardware hasn't shown a significant difference. |
I've tried to replicate. Here are results for running on Ubuntu 22.04, Intel Celeron N5105, Seagate Barracuda 2.5 5400. ext4 partition.
That gives me the following numbers
Looking at the averages and standard error, I cannot conclude which one is faster, but I think it is safe to say any possible difference gets drowned in the noise anyway. I do have a theory as to why adding a buffer might make encoding slower: having a small buffer, the OS can start writing right away while buffering a bit. However, when the program itself implements a buffer, the OS cannot start sending data to the harddisk until that buffer is full. Results on windows will follow later. |
Thank you for trying to confirm.
That makes sense if the OS is fully writing one flac file before letting the next flac process run, aka there is an implicit fsync so there's time at the start of every file where there is no write in progress. What I'm seeing I've interpreted as the next file being processed while the previous one is still being written by the kernel, the previous file has left the fwrite buffer thanks to the implicit fflush from fclose, but is still in the kernel buffer being written to disk after the process that made it has exited. There's wildly variable behaviour of the default and 10m_nofsync builds on the HDD, the same results as before but highlighted for clarity: What I've attributed this to is the kernel buffer getting full and the flac process having to wait for the kernel buffer to empty enough during fwrite. My interpretation may be entirely wrong, pure guess. My OS is nothing special AFAIK, Ubuntu 20.04:
Sorry for the mess, no matter what the reason is it seems to be peculiar behaviour. I'll try running the USB on the laptop to see if the behaviour matches the USB result from the miniPC or the HDD result from the laptop. It should match the miniPC result, but if it matches the HDD result then there's definitely something going on with the laptop OS |
The USB result behaves the same on the Ubuntu 20.04 laptop as the miniPC, mildly supporting that it's not OS weirdness on slow media causing the HDD result and likely is a result of fragmentation. Reran the HDD test on the laptop using a live image of Fedora 38 and rebuilt with gcc 13. Did a run of default then 10m, rebooted, then a run of 10m then default to rule out caching:
This rules out any weirdness that may have accumulated on the daily driver Ubuntu 20.04 that's been in service for 3 years, or from old toolchains. Admittedly it's the same hardware, but two OS's have confirmed that the 10m build performs better when using a (very likely) fragmented HDD. |
How much space is left on that device when running the tests? If that is more than 20%, fragmentation should not be too much of an issue. On the other hand, the linux ntfs handling might not be as optimized as one Windows, let alone as the ext4 handling. |
It's 98.1% full, 18.5GB space free out of ~1TB. Linux NTFS handling certainly leaves a lot to be desired, or it did at least. My Ubuntu 20.04 uses ntfs-3g which is the old way of handling ntfs (feature-incomplete but mature, implemented via FUSE which is a userspace solution which incurs extra context switching overhead at the very least compared to a kernel implementation). A much better replacement made it's way into the kernel last year (ntfs3) but it looks like it stalled from world events so is still WIP, fedora 38 probably hasn't enabled it by default. It's possible that the additional ntfs-3g context switching is the bottleneck, I doubt it but will have to figure out if the ntfs3 driver is default on the fedora live image and if not redo the test with it to see if there's a difference. Even ext4 is seeing improvements still ( https://www.phoronix.com/news/Linux-6.3-EXT4 ). Tangentially related, io_uring is an exciting new async IO interface for Linux, as much as IO interfaces can be exciting. It's unlikely that it's that much of a benefit to flac's synchronous access pattern but there are likely at least minor gains from less system call overhead. |
Checked the live image and it turns out Fedora 38 does use ntfs3, so the Fedora result above rules out an unoptimised ntfs-3g as an issue as ntfs3 shows similar gains with the 10m build. Note the HDD results between OS's are not directly comparable, AFAIK there's nothing that can be read from the Ubuntu runs taking longer than the Fedora runs. To control for laptop power state shenanigans I had the Ubuntu runs wait 10 minutes to get into a stable (not ramping due to UI) lower power state. The Fedora live image didn't seem to have any adaptive power behaviour, presumably the live image doesn't enable any of that for a snappier live impression. There's also probably different schedulers in use and all sorts of different settings. |
I've finally taken some time to test this under Windows, and I must say, under the right conditions the difference is horrific, truly. I've compared three compiles, one with the vbuf tree of your repository, one with that same tree but rolled back a few commits and one rolled back, with the I've taken an old 16GB no-name exFAT USB-stick and filled it for 98%. In the remaining 2% testing was done, first copying a WAV file there, then compressing it to a FLAC, then decompressing it to a WAV. These 3 files are then removed, making space for the next round. Differences are night and day, for both encoding as well as decoding. Testes presets are 0, 1, 2, 3 and 4. It seems encoding gets 10x as fast, and decoding gets 4-5x as fast. I think the reason 3 and 4 decode so much faster is because of the larger blocksize (hence larger writes) and 4096 * 4 is probably a much better write size than 1152 * 4. So, clearly, I'd say this is an improvement. Even if it gives a slight slowdown in situations where fragmentation is low (which I can't say for sure) the advantages far outweigh the (possible) disadvantages. I'll do some testing on using smaller block sizes, perhaps somethink like 128KiB already improves a lot. I'm wondering though, should the setvbuf be removed from libFLAC, in effect moving it to the flac command line tool? Or should we only add it to the |
Here are some more detailed results I'd say using a 32KiB buffer already gives a nice performance improvement. I'm not sure whether it makes much sense, but I think 10MiB is a bit wasteful. Then again, that is what's been in FLAC for years now on Windows. I'll do some more testing, trying to get the same result with a heavily fragmented ext4 partition or something like that. |
Some more results. Took a 32GB NTFS partition on a 250GB SSD (several years old) and filled it for 99%. Then ran the tests on Windows 10. X-axis is different here, because logarithmic scale doesn't show any labels. Difference is small but consistent. I have no clue why 32KiB and 10MiB buffer would be slower here than a 128KiB and 512KiB buffer on decoding. |
Even more results. I tried to test something a little more realistic, so took a 1TB HDD, filled it for 95% and ran tests. The results are opposite of the previous few: adding a buffer slows down encoding by about 20% for some reason, while a buffer of 128KiB or 512KiB speeds up decoding by about 40%. These results don't seem to make sense, but they are reproducible, I've ran the tests several times. Also, running the binaries is interleaved, 10 tracks are tested, and each track gets tested which each binary before moving on to the next track. |
Sorry I just noticed your latest posts.
In a way I'm glad not to be the only one mystified by benchmarking I/O.
I came to the conclusion that at least 64KiB showed a benefit but in the MiB range did help a bit more, but that was during some unreliable benchmarking so inconclusive. Research indicated 8MiB could be beneficial and that there was little benefit to going beyond 16MiB, but I didn't save the link.
IMO control of how a stream is handled should stay with whatever opened the stream, ie libflac when init_file is used or the external program when init_FILE is used. Quoting a manpage:
We don't know the history of the stream so can't guarantee that no ops have been performed. Similarly that's why I suggested elsewhere to only fclose in libflac if libflac was the one to fopen (for all we know the stream could be an archive or chained ogg file that the external program doesn't necessarily want closed).
If you mean a slowdown because a larger buffer takes longer for the initial flush, IMO that's not really relevant and not really a slowdown. When I/O pressure is low (kernel+hardware buffer is not saturated, ie fast hardware and/or small data) it just delays the actual write slightly and shouldn't alter ./flac's execution time. When I/O pressure is high (kernel+hardware buffer is saturated so an fwrite has to wait during runtime, ie slow hardware and/or a lot of data like processing an entire collection) there are bigger factors at play than a millisecond delay on first flush. |
I've repeated the last results, but I've added an executable that not only has a vbuf on the output file but also a vbuf on the input file. That seems to alleviate the problem. |
That reminds me of hacks in onion108/cplusplus-output-benchmark#1 |
This ( https://hydrogenaud.io/index.php/topic,123889.msg1024853.html#msg1024853 ) post prompted me to poke around fwrite and hack in setting the fwrite output buffer size for non-win32 targets: https://github.com/chocolate42/flac/tree/vbuf
Tested by re-encoding a ~150MB 24 bit flac file to -0 to the same drive, on a 2.5" HDD and a slowish SSD on a skylake laptop. Tried a buffer of 10MB to match the win32 path, and a buffer of 10MB+8 based on the recommendations in this post: https://www.enterprisestorageforum.com/hardware/a-trip-down-the-data-path-i-o-and-performance/
These are the average wall times of 10 runs each in seconds on Linux 64 bit:
Didn't make a proper PR directly because there's some open questions
The text was updated successfully, but these errors were encountered: