Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util.parse should support accessing tags, which would require a less restrictive regex #299

Closed
gaurav opened this issue Dec 17, 2024 · 2 comments · Fixed by #300
Closed

Comments

@gaurav
Copy link

gaurav commented Dec 17, 2024

util.parse() currently assumes that the branch name is "Only letters, digits, underscores and dash, no leading dash."

# First regex reflects the lakeFS repository naming rules:
# only lowercase letters, digits and dash, no leading dash, minimum 3, maximum 63 characters
# https://docs.lakefs.io/understand/model.html#repository
# Second regex is the branch: Only letters, digits, underscores and dash, no leading dash.
path_regex = re.compile(r"(?:lakefs://)?([a-z0-9][a-z0-9\-]{2,62})/(\w[\w\-]*)/(.*)")
results = path_regex.fullmatch(path)
if results is None:
raise ValueError(
f"expected path with structure lakefs://<repo>/<ref>/<resource>, got {path!r}"
)
repo, ref, resource = results.groups()
return repo, ref, resource

However, users might want to use a tag name instead of a branch name, which has a much more complicated syntax. Notably, the current regex doesn't match the two example tags in the LakeFS docs (v2.3 and dev-jane-before-v2.3-merge). It might also be useful to support ref expressions.

I would suggest making the regex less restrictive -- at a minimum (\w[\w\-\.]*), but possibly going all the way to support ref expressions to include [~^]. The error could also be rewritten to be a bit clearer (e.g. because lakefs.ls(...) calls this after calling _strip_protocol(), I get the following error message: ValueError: expected path with structure lakefs://<repo>/<ref>/<resource>, got 'heal-studies/v2.0/' -- which made me think that I had forgotten to add the lakefs:// prefix before realizing that the branch was being rejected for being incorrect).

Alternatively, if I'm misunderstanding how tags and ref expressions should be used in lakefs-spec, please let me know -- and thank you so much for building this great library!

@nicholasjng
Copy link
Collaborator

Thank you for the detailed report!

Since we want to fully implement lakeFS's URI specification, we must also match their support for ref expressions. Your regex suggestion sounds great, would you be willing to send a PR?

I agree that the error message is a bit sparse, and even misleading in the case you presented. We might need to check our callsites for the order of util.parse() and _strip_protocol() - alternatively, we could decide to split the protocol off the template path if the input doesn't have one, so we don't confuse people with the protocol unnecessarily.

@gaurav
Copy link
Author

gaurav commented Dec 22, 2024

This works great now, thanks so much for the quick fix!

gaurav added a commit to helxplatform/dug-data-ingest that referenced this issue Jan 14, 2025
Given a list of LakeFS repositories and LakeFS authentication information, this script will check to see if there are any duplicate study IDs amongst dbGaP XML files anywhere in those repositories. See https://renci.atlassian.net/browse/DUG-374 for details. LakeFS authentication information can be provided in one of the [two ways supposed by the LakeFS library and client](https://lakefs.io/blog/improved-python-experience/#configuring-a-lakefs-client):
1. By creating an ~/.lakectl.yaml file with the authentication information.
2. By setting the LAKECTL_SERVER_ENDPOINT_URL, LAKECTL_CREDENTIALS_ACCESS_KEY_ID and LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY environment variables.

We require at least `lakefs-spec ~= 0.11.1`, because otherwise [`.`s in tags are not supported](aai-institute/lakefs-spec#299) (implemented in [lakefs-spec v0.11.1](https://github.com/aai-institute/lakefs-spec/releases/tag/v0.11.1)).

This PR also removes Avalon from requirements.txt, since we no longer use that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants