-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Provide a Sync() abstraction on writable file abstractions provided in arrow::io #39967
Comments
Let's call it
|
My understanding of
Either way, maybe you could add the details of what is used under the hood for each implementation? |
@pitrou Would |
Ok, then |
That depends what your needs are? Since I don't know how SSD-to-GPU works, it's difficult to know if that's what you want. Also, regardless, |
@pitrou @NicolasDenoyelle the answer about the directory question is easy: no. Reasons: as @pitrou said, (1) a file handle doesn't (can't) have a reference to the directory it's currently in (imagine how hard that would be in a multi-threaded environment!), and (2) an application that needs the directory entry update might have multiple files to sync and it should sync all files before it syncs the directory and it should do that once. You should also note that |
Well, the manpage seems to indicate that
|
+1 on this idea. And from description, seems it would not have any gurantee on object-store when part-upload enabled? |
What is the status of this feature request? Has there been any work on this? I was just looking through the code for exactly that feature and was a bit surprised that it does not yet exist. I would be happy to chime in and get this done! |
I started but got overwhelmed about all the implementations of the IO interfaces. |
@felipecrv do you have a branch somewhere with your changes so far? |
Describe the enhancement requested
Operating systems don't immediately commit data provided by userland code into storage devices. This is usually not a problem because (1) the kernel will not take very long to asynchronously commit the data on its own, (2) the kernel mediates access to the filesystem and guarantees all processes sees the writes performed to the same file so far [1], and (3) applications can handle missing data due to power loss or kernel crashes (e.g. a file used for caching getting corrupted can be easily re-downloaded).
Applications with more stringent durability requirements (e.g. SQLite) will force commit of pending data on the kernel by calling
fsync
on transaction commit. But even databases avoid doing this for every file and opt tofsync
only a special file storing the Write-Ahead Log [2] containing batched updates. Don't think ofSync()
as a flushing mechanism that you should always call —fsync
can add a lot of unnecessary latency (in the many hundreds of milliseconds) and wear down storage devices.Network and Distributed File Systems
Networked file systems usually provide commands that ensure durability of the pending writes on the server storage media —
Sync
would delegate to these commands in these cases.Distributed filesystems that rely on data replication might provide operations to ensure writes are propagated before returning (Quorum Writes). Since late 2020 this is not an issue with AWS S3, so
Sync
on S3 [3] files would be a no-op.Masking Latency
If you must issue
Sync
calls, one way to mask the latency caused by them is to issue writes as soon as possible, do some other work, and only then callSync
.[1] exceptions to this exist with the use of flags like
O_DIRECT
on Linux'sopen
syscall https://www.man7.org/linux/man-pages/man2/open.2.html[2] then in the event of a power loss, the database can replay the write-ahead log and complete any missing write to the more complex structures of the database like indexes
[3] https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
Component(s)
C++
The text was updated successfully, but these errors were encountered: