Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gist): fsspec file system for GitHub gists (resolves #888) #1791

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

lmmx
Copy link

@lmmx lmmx commented Feb 16, 2025

This PR introduces a new filesystem backend, GistFileSystem, which allows read-only access to files within a single GitHub Gist (as suggested in #888). I'd find this really useful in combination with Universal Pathlib (also an fsspec project)!

  • Gists are essentially flat collections of files, so there is no subdirectory concept. (Technically they are git repos that can store dirs too but we only need to support them as flat file lists, that's all the website UI will show them as)
  • The implementation is closely based on GithubFileSystem but simplified for a single gist.
  • Supports both public and private gists, latter needed user/token (PAT).

Users can do:

import fsspec

# For a public gist
fs = fsspec.filesystem("gist", gist_id="729837f14264089288178a5f632221ab")
print(fs.ls(""))  # lists files
with fs.open("test1.txt", "rb") as f:
    print(f.read().decode())

For a private gist, the same but also passing username and token args.

  • Implemented FS (methods: ls, _open, cat, invalidate_cache), read-only impl
  • Added to registry (alphabetically)
  • Added a test - I changed the gist ID to one by martindurant so you wouldn't have to worry about relying on someone else preserving their artifact for your tests to pass, also it's fairly small so shouldn't be slow to load
  • Added documentation in docs/source/api.rst.
  • Verified that read-only operations (ls, cat, open) are working with public gists.

Example usage

Below is a short snippet showing how to retrieve files from a public gist:

import fsspec

gist_id = "16bee4256595d3b6814be139ab1bd54e"
print("Gist ID:", gist_id)
fs = fsspec.filesystem("gist", gist_id=gist_id)
file_list = fs.ls("")
print("Files in the Gist (via fsspec):", file_list)
contents = fs.cat(["gistfile1.txt"])
print(contents["gistfile1.txt"].decode()[:120] + "\n...")

Gist ID: 16bee4256595d3b6814be139ab1bd54e
Files in the Gist (via fsspec): ['gistfile1.txt']
import astropy.io.fits._tiled_compression as tiled
from astropy.io import fits
import numcodecs
import fsspec
import zar
...

@martindurant
Copy link
Member

Thanks for providing! I haven't had a chance to look yet, but I will soon :)

@lmmx
Copy link
Author

lmmx commented Feb 22, 2025

Most welcome, no worries! 😃

@martindurant
Copy link
Member

Quick suggestion: it would be good to enable bundling the gist ID with the URL:

with fsspec.open("gist://16bee4256595d3b6814be139ab1bd54e@/test1.txt", "rb")

like github: allows. It would require enabling extracting kwargs from the URL.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read through, I found that you answered almost all of my questions elsewhere in the code :)

I don't think there's a good way to test this thoroughly, but at least we can reasonably expect gist to be available whenever GHA is running.

import fsspec


@pytest.mark.parametrize("gist_id", ["16bee4256595d3b6814be139ab1bd54e"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a good idea to pin to a specific hash, so that the values can be tested.

r.raise_for_status()
return MemoryFile(path, None, r.content)

def cat(self, path, recursive=False, on_error="raise", **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this doesn't work already with the parent version in AbstractFileSystem

Parse 'gist://' style URLs into GistFileSystem constructor kwargs.
For example:
gist://:TOKEN@<gist_id>/file.txt
gist://username:TOKEN@<gist_id>/file.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this already allow for my suggestion that fsspec.open() should work in one go? It looks like it should. One more thing we might want to care about here, is the commit hash - do you think it can be added, or does it need to be kwarg-only?

GitHub username for authentication (required if token is given).
token : str (optional)
GitHub personal access token (required if username is given).
timeout : (float, float) or float, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps more general set of arguments to pass to requests?

self.timeout = timeout if timeout is not None else (60, 60)

# We use a single-level "directory" cache, because a gist is essentially flat
self.dircache[""] = self._fetch_file_list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unusual to fetch file listings on init, but I think it makes sense in this case.

Clear the dircache. If path is given, we could refetch—but for gist,
we typically refetch everything in one shot anyway.
"""
self.dircache.clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting way to "refresh". There could be a refresh= kwarg on ls(); but actually I'd be happy to treat the dircache as immutable for the purposes of this implementation.

@martindurant
Copy link
Member

Please ping me when I should have another look

@lmmx
Copy link
Author

lmmx commented Feb 28, 2025

Thanks for reviewing Martin, gotten sidetracked in a CI fixing rabbit hole this week I've thankfully emerged and can return to revisit this!

Please ping me when I should have another look

Will do 🫡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants