-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a wheel management API #697
Comments
Do you have an issue or something that shows what you're proposing as a "wheel public API"? And for reading or writing wheel files, we have purposefully taken a sans-I/O approach, so it would be more like "processing and creating |
There's a long time draft PR here: pypa/wheel#472
But |
It looks like https://github.com/pypa/wheel/pull/472/files#diff-8ca627810a744afb59bbb1d28a71f6ee66c9878506cfbbf2b6b980053a13c2b8 is the key bit. We already deal with wheel file names, so the key bit seems to be the parts that parse/create all the details of a wheel file.
Nope. Closest thing we're working on right now is parsing metadata. |
Ok, so to clarify, does that mean you don't want to have an I/O implementation in this project? I was hoping to reduce the amount of separate projects, and having |
FWIW, an abstracted interface for interacting with wheels was created as part of the |
I am doubtful that we can create a sans-io implementation of the wheel format -- wheels can and do contain files that are GBs in size, which would make operating in-memory with such wheels memory-intensive, if we insist on operating with only complete strings/bytes. If we're doing things incrementally, it's better to rely on existing I/O patterns. Notably, that's basically what we have with https://github.com/pypa/installer/blob/5f281b8e312bac4da00670b643dd4499e1c66a9a/src/installer/sources.py#L139 and even the wheel public API PR ~basically implements that interface (thanks @agronholm for doing that!). |
@pradyunsg if |
I suspect |
Probably |
Correct, and I wasn't proposing that. We actually can't have any dependencies at this point. |
Quoting https://installer.pypa.io/en/stable/concepts/ (about
Do I get a cookie for this not being needless overengineering? 🙃
That's correct. The installation usecase only cares about reading, so that was the scoping there. |
I just realised that this was never answered directly -- yes, I think that's a good idea. If @brettcannon and @dstufft agree (or at least are OK with going down this route), we'd have concensus among that |
Sorry, missed the notification for this comment! My only real concern is starting to have to care about file I/O which we have avoided thus far. If we can manage to keep a sans-I/O approach then I'm fine with it. If we can't then I think we need to discuss what that change in stance means and how we are going to keep it under control (i.e., I don't want to take on any networking, for instance). |
I don't quite get the apprehension about file I/O. Abstracting it out is fine, but what's the problem with providing a implementation that handles the file I/O? I was hoping to get all of that moved out of |
@agronholm not getting in the middle of I/O has been a project policy because people want to use this project in odd places and ways. For instance, if we try go do file I/O we then will have people asking about how to make it work with files stored in some cloud blob storage, some async framework, etc. We have been able to avoid those situations by simply not engaging in any I/O in any way and thus not forcing people into a specific I/O approach. And even abstractions are a bit tenuous. For instance, if we have a creation API for wheels, how do you pass in the files you want to write and their locations? The simple way is to take a directory and just slurp it all into the zip file. But that requires an API to something that can read a directory and then read those contents. We could define our own protocol that mirrors I personally think we could do this in stages. The obvious stuff that requires no I/O we can do without any controversy in the first stage. And then for anything that we think would require file I/O we would have to be very thoughtful about whatever protocols we wanted to support to make it happen and not corner ourselves in some way that isn't appropriately flexible and in a second stage. |
@brettcannon I'm not sure I follow. Why can't we have an The detail of dealing with the filesystem or network is then a problem for the IO object passed in, and if it can't provide a compatible API, then we can't work with that. Taking in an IO object as an argument implies that we don't have to deal with the details of the IO here (leaving that to the object at hand). |
I would like to add that during the 5 years I've maintained |
I'm assuming you mean And we definitely can go with that protocol, we just to have to agree that's the protocol we want to go with and understand the implications. Specifically, that protocol is big and it isn't async. I'm personally fine with all of that, to be clear, but I want us to be explicit that this is how we want to support file I/O in the project going forward, especially if we don't want to put in the effort to specify smaller, stricter protocols (which I'm once again fine not doing for the simplicity of it all). We also have to understand that protocol has no way to walk the file system (e.g. Once again, I'm not saying "no" to trying to make this work at all (I think it makes sense for a wheel API to exist here and it makes sense for it to be able to read and create wheels), I just want us to all agree on where the edges of this project are going to be pushed out to since this will change some things for us. |
Ah, OK. I understand the concern now. 😅 I agree that the choice of having I/O-based API (and the corresponding I/O mechanism) should be an explicit design choice. And, I'd be on-board for having a sans-I/O implementation (with a filesystem-based I/O helper provided out of the box). However, I can't really come up with any sans-I/O design where we can have tooling handle wheels with large files in them, without a bunch of memory overhead in a sans-I/O model. Given that, I'm OK with a non-async I/O model for composing/creating a wheel and processing/reading a wheel. The reading portion of a wheel API is covered by the installer's |
It seems like it should be possible to design a sans I/O approach here? Maybe not with the zipfile module itself, but that's just a limitation of that module then. But in theory, we should be able to define an abstraction that is passed and emits chunks rather than whole files at once. |
Are you thinking of something like: from functools import partial
from shutil import COPY_BUFSIZE
wheel = ...
with open(large_model_filename, "rb") as model_file:
# something before to initialize the file, somehow?
for chunk in iter(partial(model_file.read, COPY_BUFSIZE), b''):
wheel.<something>(chunk, "purelib", "awesome-model.safetensors")
# something after to finalize the file, somehow? If we put it in a context manager... I'd argue we're basically re-creating: with open(large_model_filename, "rb") as model_file:
with wheel.open(destination, "purelib", "wb") as dest_file:
shutil.copyfileobj(model_file, dest_file) ... except without the ability to use shutil. 😉 (and, no, I don't think that's a good API)
Yea, that's part of what I was hinting at -- there's only one-shot mechanisms to write to a zip (in-memory with That leaves us with the only way to do things (without taking an I/O object) as pretending that it's an I/O-ish object (while not using standard I/O APIs) while allowing for potential misuse, or a stateful object when performing the "add a file" operation. I don't think a sans-IO model based on chunks would be better than: with open(large_model_filename, "rb") as model_file:
wheel.<something>(model_file, "purelib", "awesome-model.safetensors") and having a loop in there that's basically https://github.com/pypa/installer/blob/444e5295f2403ab303dcf14160591074396f1d21/src/installer/utils.py#L116. |
I think it's a fine API? Like yea you're kinda doing it weird for shutil, but it also means you can do: from functools import partial
from shutil import COPY_BUFSIZE
async def make_wheel():
wheel = ...
session = aioboto3.Session()
async with session.client("s3") as s3:
s3_ob = await s3.get_object(Bucket=bucket, Key=blob_s3_key)
stream = s3_ob["Body"]
while chunk := stream.read(COPY_BUFSIZE):
wheel.<something>(chunk, "purelib", "awesome-model.safetensors") I think allowing that is worth the slightly worse implementation when you couple yourself to an IO implementation (and let's be honest, all Sans I/O APIs are at least a little bit less streamlined than ones that tie yourself to an IO implementation, because they have to be to actually function). Stepping back a minute though, can we define what it is exactly that this API is trying to do? Looking at the installer code linked, I see a few things:
Reading that, to me none of the interesting bits are actually driving the import zipfile
from shutil import COPY_BUFSIZE
filename = "..."
with zipfile.ZipFile(filename, "r") as zfp:
wheel = SansIOWheelFile(filename, zfp.namelist()) # Parses the filename and assigns attributes to wheel.version etc
wheel.dist_info_dir # Searches the list of filenames it's stored internally and returns the correct one.
wheel.dist_info_filenames # Searches the list of filenames it's stored internally and returns the correct ones.
# Equivalent to installer.sources.WheelFile.read_dist_info, but letting the caller do the I/O
zfp.read(wheel.dist_info_filename("METADATA"))
# Equivalent to installer.sources.WheelFile.validate_record(validate_contents=True) but
# letting the caller do the I/O.
#
# Note: Context manager and for loop only need for the validate_contents = True case
with wheel.record_validator(zfp.read(wheel.dist_info_filename("RECORD"))) as validator:
for file_validator in validator:
with zfp.open(file_validator.filename, "r") as stream:
for chunk in stream.read(COPY_BUFSIZE):
_chunk = file_validator.read(chunk) # Does no I/O, just updates the internal hash and returns the chunk
# Upon validator.__exit__ all of the hashes will get finalized and we'll collect our issues
# Equivalent to installer.sources.WheelFile.get_contents but letting the caller do the I/O
for filename in wheel.get_filenames(zfp.read(wheel.dist_info_filename("RECORD"))):
zfp.read(filename) # ... Or use zfp.open or whatever Like all Sans I/O APIs, it's a little more verbose than one that is married to a specific I/O implementation, but it's not like.. terrible? It seems perfectly fine, and most of these things actually exist in installer already (particularly the reading of chunks to validate the files in records) but by turning the API "inside out" we divorce the logic of what we're doing from the logic of how I/O happens. Making it possible to drive it from an asynchronous I/O case, but also making testing significantly easier since we separate it from the need to do I/O at all. In fact, you can see that the You can tweak the exact API a bit, for instance you could have the various methods that need to access the list of filenames to each pass in the filenames rather than requiring it passed into the constructor... or you could also record the Of course, given a import zipfile
class WheelFile:
def __init__(self, zfp: zipfile.ZipFile):
self._zipfile = zfp
self._wheel = SansIOWheelFile(os.path.basename(zfp.filename), zfp.namelist())
@property
def dist_info_dir(self):
return self._wheel.dist_info_dir
@property
def dist_info_filenames(self):
return self._wheel.dist_info_filenames
def read_dist_info(self, filename):
self._zipfile.read(self._wheel.dist_info_filename(filename))
def validate_record(self):
with wheel.record_validator(zfp.read(wheel.dist_info_filename("RECORD"))) as validator:
for file_validator in validator:
with self._zipfile.open(file_validator.filename, "r") as stream:
for chunk in stream.read(COPY_BUFSIZE):
file_validator.read(chunk)
def get_contents(self):
for filename in wheel.get_filenames(zfp.read(wheel.dist_info_filename("RECORD"))):
yield self._zipfile.open(filename, "r") |
@dstufft , the ones you wrote, plus:
Lastly:
I think this thread started from a comment in the setuptools issue tracker, when @agronholm and I were discussing moving Other part of the discussion that preceeded this thread is that, This is related to some existing functionality in Anyway, this is the final requirement:
|
Sounds good so far! Having a sans-I/O base implementation, plus a helper for actually working with wheels in the file system would work great in my opinion. I'm not opposed to an async helper either if someone wants that.
Just to be clear, the current machine should preferably be a non-factor when creating wheels. I would like the implementation to determine the platform tags from the metadata and files being presented to the API, if possible. |
I assume the inputs for creating a wheel are the files to include + the relevant metadata files, sans the wheel specific ones like import contextlib
import os
import zipfile
from shutil import COPY_BUFSIZE
# Kinda sucks I feel like read/create should be different classes, but both kinda
# want to be called WheelFile
from something import SansIOWheelMaker
# Maybe we can have some defaults for the various tags? Or maybe we should
# require them to be passed in. Probably require them to be passed in? I definitely
# think any serious tag generation should live outside this, and get passed in.
#
# This should also include stuff like the Generator, If root should be purelib,
# etc. Basically all the stuff that goes into WHEEL and/or influences where stuff
# gets placed in the wheel.
wheel = SansIOWheelMaker("myproject", "myversion", python_tag="...")
with contextlib.ExitStack() as stack:
# The library has computed the wheel filename for us, so we can just go
# ahead and open it.
wheelfile = stack.enter_context(zipfile.ZipFile(wheel.filename, "w"))
# Real things will probably do something smarter than just scandir, like
# checking for files committed in VCS, or looking at the package names or
# whatever.
for entry in stack.enter_context(os.scandir("build/")):
# Zip files can contain directories, but the RECORD file doesn't mention
# them, so it's unclear if wheels can/should contain directory records
# or not, and if they do what they should be in RECORD, but if they
# can we can just do the right thing here.
if entry.is_dir():
continue
# We'll need to open the existing file, and also the zip file to add it
with contextlib.ExitStack() as fstack:
# In reality, we'd have to have logic to determine if this is a purelib
# or something else.
file = fstack.enter_context(wheel.add_file(entry.name, "purelib"))
sfp = fstack.enter_context(open(entry.path, "rb"))
zfp = fstack.enter_context(wheelfile.open(file.name, "w"))
while chunk := sfp.read(COPY_BUFSIZE):
# file.write does not hold onto any data, it's just taking chunks
# so that it can run them through a hasher for RECORD, and returns
# the chunk so the caller can do something with (presumably write
# it to a zip file).
zfp.write(file.write(chunk))
# When the `file` context manager ends, the RECORD file hash will
# be finalized. We could make it not a context manager, and just
# finalize whenever we emit the RECORD file, or offer both.
# We also need to pass the metadata in somehow, which we could make a
# different function for it, or we could just use a "metadata" sigil to
# wheel.add_file. Here we pick a different function, because there's no
# promise that sysconfig schemes don't add a `metadata` type in the future.
#
# Another option for METADATA would be to have it take a Metadata class
# instance, or possibly even have the SansIOWheelMaker take it and it can
# use that for the project name / version / etc, then METADATA can be treated
# like WHEEL and RECORD is, as "finalizing" files.
with contextlib.ExitStack() as fstack:
file = fstack.enter_context(wheel.add_metadata_file("METADATA"))
zfp = fstack.enter_context(wheelfile.open(file.name, "w"))
zfp.write(file.write(b"...."))
# After this, calling anything to add any kind of file to the wheel should
# be an error, as should trying to write to any existing file objects, etc.
# This emits the RECORD and WHEEL file (and METDATA, or any other future
# files that may be mandatory).
#
# We could make SansIOWheelFile a context manager and have its __exit__
# error if this method hasn't been called.
for filename, filedata in wheel.finalize():
with wheelfile.open(filename, "w") as zfp:
zfp.write(filedata) I think that API looks perfectly fine? What's interesting is I tried to make a synchronous API like I did for reading, and that actually feels much harder to do in a generic way, because there's a lot of little fiddly bits that are hard to get right generically, because for each file you basically need to know:
However, that becomes hard to do without ending up with API that looks similar to the above or that makes assumptions about the above based on something. I kind of sketched it out the best I could, and came up with this: class WheelMaker:
def add_files(self, path, callback):
with os.scandir(path) as it:
for entry in it:
if entry.is_dir():
continue
if meta := callback(path, entry.name):
scheme, arcname = meta
with contextlib.ExitStack() as fstack:
file = fstack.enter_context(self._wheel.add_file(arcname, scheme))
sfp = fstack.enter_context(open(entry.path, "rb"))
zfp = fstack.enter_context(self._zipfilefile.open(file.name, "w"))
while chunk := sfp.read(COPY_BUFSIZE):
zfp.write(file.write(chunk)) You would have to pass a callback that ws something like: def callback(base, filepath):
if filepath.endswith("*.pyc"): # Check in VCS, or whatever makes sense
return None
# (Scheme, Filepath Relative To Scheme Dir)
return "purelib", filepath The real callback would likely be much more complicated than that of course, and it works? But it assumes that you have the files on disk in a layout that makes sense to be shuffled right into the wheel (which may be a reasonable assumption?) or that you will call Another option is to add (either instead of, or in addition to) a class WheelMaker:
def add_files(self, path, callback):
with os.scandir(path) as it:
for entry in it:
if entry.is_dir():
continue
if meta := callback(path, entry.name):
scheme, arcname = meta
self.add_file(entry.path, arcname, scheme)
def add_file(self, path_or_obj, arcname, scheme):
with contextlib.ExitStack() as fstack:
if isinstance(path_or_obj, str): # PathLike I guess?
sfp = fstack.enter_context(open(path_or_obj, "rb"))
else:
sfp = path_or_obj
file = fstack.enter_context(self._wheel.add_file(arcname, scheme))
zfp = fstack.enter_context(self._zipfilefile.open(file.name, "w"))
while chunk := sfp.read(COPY_BUFSIZE):
zfp.write(file.write(chunk)) Even with all of that, you still need a way to add metadata files... which I guess you could just treat as a special type of scheme? But like earlier there's no guarantee that "metadata" doesn't become a scheme name in the future. We could add an class WheelMaker:
def add_metadata_file(self, filename, path_or_obj):
with contextlib.ExitStack() as fstack:
if isinstance(path_or_obj, str): # PathLike I guess?
sfp = fstack.enter_context(open(path_or_obj, "rb"))
else:
sfp = path_or_obj
file = fstack.enter_context(self._wheel.add_metadata_file(filename))
zfp = fstack.enter_context(self._zipfile.open(file.name, "w"))
while chunk := sfp.read(COPY_BUFSIZE):
zfp.write(file.write(chunk)) Since we can do synchronous I/O, we wouldn't need the Oh, but in the above I was assuming that they could pass us a fp that we could wrap with a Anyways, all together, the sync API use could end up looking something like: import contextlib
import io
from something import WheelMaker
class FileFilter:
def __init__(self, *args, **kwargs):
# Some information stored here like the VCS root, or the package name
# or whatever this caller uses to classify and filter files.
pass
def __call__(self, base, filepath):
# Real filtering/classification Logic
pass
wheel = WheelMaker("myproject", "myversion", python_tag="...")
with contextlib.ExitStack() as stack:
wfp = stack.enter_context(open(wheel.filename, "wb"))
# This returns an object, and we probably move all of the add_* stuff to an
# inner class that is returned from __enter__, unless we want to just return
# self and have all of the add_* methods have to check if we have a file
# we're operating on at the moment.
wheel = stack.enter_context(wheel.open(wfp))
# Assume that we have a build directory that has stuff in the correct place
# since that is our best case, more complicated cases will have to use the
# wheel.add_file() directly.
wheel.add_files("build/", FileFilter())
# This file is generated in memory or whatever, for some reason it's not in
# the build directory
wheel.add_file(io.BytesIO(b"..."), "mypkg/__generated__.py", "purelib")
# Add whatever metadata you want.
wheel.add_metadata_file("METADATA", io.BytesIO(b"..."))
# When the wheel context manager exits, it will write the RECORD and WHEEL and
# any other files to the archive, then close the internal zipfile, the underlying
# file object will still be open, which in the above will be closed by ExitStack
# next. It gets a little awkward due to the circular dependency between the |
I saw #805, and wanted to highlight that there is a Wheel editing API in I would be happy to discuss the api in more detail if there is interest (at the least, I think it is interesting to consider when designing a wheel api from scratch). An example of its use for modifying wheel metadata can be found at https://github.com/wimglenn/setuptools-ext/blob/0.9/setuptools_ext.py#L221-L233. Finally, it is a bit of a surprise to me that we would go from a specialised project ( |
My question is, since it's not a lot of code, why should we want to have so many scattered packaging infrastructure projects? I'm trying to undo at least some of the insanity. You have so far given no rationale for putting this interface to If you have suggestions/criticism about the API itself, feel free to voice your opinions in the PR. |
I'm the maintainer of
wheel
. I'm wondering if you would be interested in receiving an API for reading and writing wheels in this project. I've been wondering for a while if it makes any sense for PyPA to have so many packaging projects.This is what I had in mind:
wheel
setuptools
would adopt this API and ditchwheel
as a dependency for building wheelswheel
would depend onpackaging
at run time (currently it vendors this project) and become nothing but a CLI toolauditwheel
andwheel
would be merged (preferably intowheel
) (optional, to be discussed later)The goal is to get rid of at least one PyPA project.
The text was updated successfully, but these errors were encountered: