-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tar.extract_file: extract content of single file in tarball #95
base: master
Are you sure you want to change the base?
Conversation
1c033cf
to
893c1b9
Compare
Codecov Report
@@ Coverage Diff @@
## master #95 +/- ##
==========================================
+ Coverage 96.37% 96.46% +0.08%
==========================================
Files 4 4
Lines 634 650 +16
==========================================
+ Hits 611 627 +16
Misses 23 23
Continue to review full report at Codecov.
|
Thoughts about API:
|
If the path is a directory in the tarball, we could extract the contents of all files under that directory. Or we could error. |
Yea, I thought about that too so I will update it.
Okay. You might risk reading more files than you actually intented if the predicate is not strict enough though. |
True but if it's a string it will only ever match one file, which seems safe enough. We could return the number of files matched. Another thought I had was invoking a callback for each tarball entry with the IO handle of the data stream. But that may be a bit too low-level an API even though it is maximally flexible. |
You mean that if you give
would |
Since the result is written to an io instead of beeing returned, perhaps the return value could be something like an array with matched files and number of bytes:
That could be useful if you want to split the returned stream yourself. Edit: It should just return all the headers that correspond to the file.... |
that match a predicate, from the tarball archive.
893c1b9
to
eb0f56b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote to write to an IO and to return the header(s) of the matching files.
### `Tar.extract_file` | ||
|
||
```jl | ||
extract_file(predicate::Function, tarball, out) -> Vector{Header} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be
extract_file(predicate::Function, tarball, out) -> Vector{Header} | |
extract_files(predicate::Function, tarball, out) -> Vector{Header} |
?
To me, giving a single file to extract makes the most sense. If you want to extract multiple files, you just write a loop over the (possibly filtered) files and pass in the same |
Fair enough. Is there an urgent need for this? I'm on parental leave until at least next month. |
No, there is no urgency. Trying this out a bit in Pkg, it allocates 40GB to read all files in the registry and most of the time is spent on allocating buffer vectors. Should be a few low hanging fruits. Edit: Lowest hanging fruit removed, next one seems to be allocations for all the headers. |
7255613
to
17797c7
Compare
This is what I came up with for use in Pkg: https://gist.github.com/KristofferC/b354b654e776f4910b03985830418723 Here is it in action in Pkg: JuliaLang/Pkg.jl#2431. Ended up not using anything from this PR. |
After trying to use this for a while, I don't think this API works very well. The problem is that to extract a single file you still need to parse through all headers (or at least until you find the file). So I think there needs to be some intermediate object created that does this for you and caches it (basically https://github.com/JuliaLang/Pkg.jl/pull/2431/files#diff-81abdf925a6c6e19622f1607e7777ae3bd3caabc942b73a7b33cbc0b96a9edff). |
So my reading of that particular Pkg application is that rather than having a function to extract a single file at a time (calling that repeatedly makes the whole thing O(n^2)), you really want to have a way to iterate through the entire tarball and populate an in-memory representation of the tarball contents. I'd also rather avoid exposing special types like |
The reason for the way it is done in Pkg is to avoid having to read data that isn't used. So that's why it just reads the headers and remembers position + size and only reads the data (and caches it) on demand. Regarding the special types, they are not meant to be "public", it was just a way I saw it convenient to structure the internals for the Pkg use case. |
When we decompress the tarball, we've already read all the data, so we've paid the cost of doing that already. We might as well keep it around in memory — the only cost of doing that is the size of a decompressed General tarball, which is currently 25MB, not so big these days. If we use the |
The tarball is decompressed once (when the registry is downloaded). Not every time you read the registry. |
I was planning to store it compressed to save disk space (yes, we use a ton of disk space, but every byte counts). If we're going to decompress it on disk, then we could use mmap, access it like an in-memory array and let the kernel decide when to page the contents in/out of memory. That seems easier and more efficient than seeking around the tarball I/O handle. Currently we redo all the work to load the registry on each Pkg operation, which made sense when it was possible for the registry contents to change out from under you with each command. But when the registry is a single, immutable file, I think it makes more sense to cache the registry data structure more persistently. If the registry changes, it will be a different file entirely. So I think it makes sense to:
We could use a WeakRef to store the registry cache so that it gets cleaned up only when there is GC pressure. We could either use the size+mtime+ctime+inode of the compressed registry file as a cache key and reload when that changes or just flush the cache when some Pkg operation in this process has actually updated the registry and not otherwise. Or both. |
After JuliaLang/Pkg.jl#2438 the registry retrieved from PkgServer is considered immutable and it is "cached" through Pkg operations. It caches things based on the git-tree-sha of the registry.
That's been done for a while with It is useful for when the registry is a git repo since we don't cache the registry itself then, but for pkg server registries (with the new cache), it doesn't really do anything since it will never even attempt to read the same TOML file twice. Regarding decompression / compression, I think I measured it to around 0.1 second to decompress so perhaps it fine to keep it decompressed. |
Here is an example implementation + benchmark of keeping things compressed: using Downloads, Tar, Pkg
using Pkg: PlatformEngines.exe7z
reg_url = "https://pkg.julialang.org/registry/23338594-aafe-5451-b93e-139f81909106/28c601505a512da4e5abebc75cd448d48b8dac9c"
reg_file = "registry.tar.gz"
Downloads.download(reg_url, reg_file)
function untar_memory(tar_gz::AbstractString)
data = Dict{String, Vector{UInt8}}()
buf = Vector{UInt8}(undef, Tar.DEFAULT_BUFFER_SIZE)
io = IOBuffer()
open(`$(exe7z()) x $tar_gz -so`) do tar
Tar.read_tarball(x->true, tar; buf=buf) do hdr, _
if hdr.type == :file
Tar.read_data(tar, io; size=hdr.size, buf=buf)
data[hdr.path] = take!(io)
end
end
end
return data
end
using BenchmarkTools
@btime untar_memory(reg_file) Takes about 100ms. Just doing @btime read(`$(exe7z()) x $reg_file -so`) takes about 60 ms. |
That seems like a good approach then, doesn't it? Storing the data as a |
Start with Julia 1.7, a registry can also be stored in a depot as a gzipped tarball. This commit unpacks the tarball into a temporary directory. This was the simplest way to support this new feature. This implementation will unpack a fresh copy of the registry on every call to `reachable_registries`. We could instead cache the temporary directory for future calls. There's an open PR for the Tar.jl package that will be of future interest: JuliaIO/Tar.jl#95 After that is merged, we could instead stream the individual files straight out of the tarball, rather than having to unpack it into a temporary location.
Starting with Julia 1.7, a registry can also be stored in a depot as a gzipped tarball. This commit unpacks the tarball into a temporary directory. This was the simplest way to support this new feature. This implementation will unpack a fresh copy of the registry on every call to `reachable_registries`. We could instead cache the temporary directory for future calls. There's an open PR for the Tar.jl package that will be of future interest: JuliaIO/Tar.jl#95 This would allow us to instead stream the individual files straight out of the tarball, rather than having to unpack it into a temporary location.
I keep coming back to this seemingly simple feature and getting caught up in all the unfortunate corner cases. What we want this to do is something equivalent to extracting the tarball somewhere temporary and then reading a single file out (or a set of files based on some criterion). Seems simple enough: read through the tarball and skipping over anything that's not the file you're interested in; when you find a header for the file that you want to extract, copy the data for that file to the given IO handle. And that simple strategy is what this PR implements. So what's the problem? Unfortunately, tar files can have multiple entries for the same output path. If you're extracting the tarball, that's not an issue—you just extract each entry in order, overwriting the previous one with the latest one and the last one is left at the end. But what happens with this implementation? We end up extracting the contents of every version of the file to the IO handle, not just the last one. That's very much not the desired behavior of sending the contents of the file that would have been left after extraction. In order to make sure you extract only the last version of a file to the IO handle, you have to either be able to seek back and forth in the input tarball or you need to buffer the entire contents of the file you're extracting, which could take a lot of memory. If you want to not tax your RAM, it might be preferable to extract a selected set of paths to a temporary location and then read what you're interested in from that temp location ... which is exactly what There's also the symlink issue, which is alluded to in the PR as a TODO note. This entails similar problems: if the symlink comes before the entry you want to extract in the tarball, then you can extract by scanning forward only, but if the symlink comes later on in the tarball, then you have to either go back and find the target file or have it buffered. This situation is worse since though since the target file could be any of the files in the tarball, so you'd either have to be able to seek or you have to buffer everything. Again, in order to avoid potentially blowing out memory, you'd want to extract to a temp location and then read the contents afterwards. So in the case where your tarball is seekable, you can implement this efficiently by reading to the end, keeping track of where file's data can be found in the tarball, and then extracting the correct path contents at the end. However, a seekable tarball is atypical since tarballs are usually compressed so input is being read either from a command doing decompression or from a decompressor stream object—neither of which is seekable. In the non-seekable case, the best you can do is extract the entire tarball to a temporary directory and stream the contents of the paths you're interested in to the output IO stream. |
Given the complications above, a better API might be |
Any chance that this might be included in Tar.jl soon? Part of my work involves sensitive data on systems with mainly shared file systems [1], so extracting this data to a temporary location on the file system is a no-go. I modified @KristofferC 's gist (https://gist.github.com/KristofferC/b354b654e776f4910b03985830418723) to handle compression: https://gist.github.com/JBlaschke/ec0f4ece417bdfe1428a85c78c0804f7 and it works pretty well. I ran some benchmarks where I inflated 187 MB to 1.2GB, analyzed a single file (~800MB) and moved onto the next archive. My application's ram footprint varied from ~3GB to ~7GB -- which is not great, but not terrible either. I think I can safely say that the workflow: i) extract entire archive to [1]: now that I describe it, it sounds way cooler than it actually is. |
Are you asking about the |
@StefanKarpinski I am asking for any functionality that will let me extract a given file from a tar archive into an io buffer. I see that there are quite a few corner cases that might make |
I've just come across this, and do let me know if I'm misinterpreting the situation, but it looks like the two main challenges are:
I may be missing something, but it seems to me that one potential approach is to have a |
I do like the API of extracting the tarball to a dict, mapping paths that are files or symlinks to files to their content, and just not including non-files. That's already a functionality that Pkg is using and it seems generally useful. Not sure about extracting an index; as pointed out, since tarballs are usually compressed, you'd have to decompress everything up to the point of the file you want to extract anyway. One option that @staticfloat and I have been kicking around is to have a footer block that contains some magic and an index of where to find each object in the tarball: if the footer exists, then you can use it, if the footer doesn't exist, we could generate it once and then use that index, which I guess is roughly the proposed API. With regard to compression, it's actually possible to chunk the compression so that you can jump to the start of a specific item in a tarball without having to decode any of the previous chunks. It's a bit tricky, but we might be able to implement this special format, which has the advantage of allowing efficient random access file extraction while still remaining readable by the classic |
The footer idea is interesting, and a quick search turns up some prior art: Of course, this only works when you're dealing with a modified tarball. I'd think a progressive enhancement approach may be a good fit here — look for such a footer if it exists, if not fall back on scanning the tarball, with the option to create the footer as you suggest. For speed, I imagine an analogue of This seems straightforward to me with an uncompressed tarball, but as you note a compressed tarball is harder. Chunked compression would help, but I wonder if it's possible to create "decompression checkpoints" over the course of the scanning that could be jumped to (for the non-chunked case). A bit like an |
I think there should be a (very) strong use case to start putting in non-standard stuff in the tar balls. The number of people in this issue over two years suggest to me that there isn't really such a strong case. In any case, functionality like that in https://github.com/devsnd/tarindexer seems preferable to me where you have an external index since you don't mess with the tarball at all then. |
For reference, the use case that's brought me to this issue is wanting to support being given a tarball in the form of an |
@KristofferC, the tarball would work completely normally with other tools. You can insert what are effectively comment blocks in tarballs that contains this additional information but don't produce any files. The gzip format also has ways to insert "comments", although they're a little sneaky. Zstd has an official way to do it, which is even better. |
No description provided.