Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub tag matching #103

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft

GitHub tag matching #103

wants to merge 15 commits into from

Conversation

16Martin
Copy link
Collaborator

This PR addresses #99 and introduces code intended to replace the current combination of get_github_info() followed by get_matching_tag() which exists only in capycli.bom.findsources.

This approach first tries to match a tag using the original get_matching_tag(). If all the guessing does not yield any results, the algo implicitly falls back to analyzing each tag with get_matching_tag().

@16Martin
Copy link
Collaborator Author

16Martin commented Nov 16, 2024

I cannot make the unittest work.

In test_find_golang_url_github() (https://github.com/sw360/capycli/blob/main/tests/test_find_sources.py#L329-L339) we expect the result of find_golang_url() to be 'https://pkg.go.dev/github.com/opencontainers/runc'. I do not understand this.

find_golang_url() has essentially two ways to set source_url, the variable that it ultimately returns:

  1. if len(split_version) == 3, in this case the source_url would contain /archive/
  2. in any other case source_url is set to the result of get_matching_tags, which is mocked to 'https://github.com/opencontainers/runc/archive/refs/tags/v1.0.1.zip', which also contains /archive/
  3. Every non-empty result of get_matching_tag() ends with .zip

How and why does this test work?

As a user, I would not feel good about getting https://pkg.go.dev/github.com/opencontainers/runc as a source URL, but maybe the variable names are misleading?

@16Martin
Copy link
Collaborator Author

Found it. The order in which the unittest sets up its mocks does not match the mock naming.

@16Martin 16Martin force-pushed the martin/fix-github-tag-matching branch from 5c9acdb to 4730fa3 Compare November 19, 2024 17:15
@16Martin
Copy link
Collaborator Author

Some functionality has been moved to protected methods in order to split the task into smaller, more focused parts. There is only one new public method: FindSources.version_to_github_tag().

The core of this PR are the lines in https://github.com/sw360/capycli/blob/martin/fix-github-tag-matching/capycli/bom/findsources.py#L278-L290:

While the current approach first fetch all tags that belong to a specific project and then passes the full list to get_matching_tag(), this new approach calls get_matching_tag() for each tag for each page of tags. As a result, if the new tag-guessing does not yield any matches,, the algo will match (or not match) the same tag the current approach matches.

The most important additions are lines 290 and following. If get_matching_tag() was unable to match the version to the tag, the logic generates eight possible candidates from the version and then checks if any of these candidates exist in the repo, before moving on to the next tag in the list.

The logic to create the candidates is (to some extend) the inverse of to_semver_string(). The idea is that if any of these candidates exists, get_matching_tag() would yield a positive match. The logic generating the candidates is to some extend the inverse of to_semver_string().

The algo then looks for each candidate in the current result-page and if that local lookup does not yield a match, then the algo queries the GitHub API and specifically asks if a tag with the candidate's name exists. If we can find a match through either of these two lookups, we use that match and stop the search.

With my BOMs, I notice a tremendous speedup. On average the guessing part finds a positive match immediately on the first results page. If it doesn't the API query is successful. Using my BOMs, the algo never fetches the second page of tags from GitHub.

* I dubbed the original implementation verstion_to_github_tag, but on success
  it would actually return a source url, not a GitHub tag.
  => rename in allusion to get_matching_tag() which it aims to replace
* moved tag guessing heuristic to its own method _gen_tags()
* introduced TagCache to avoid throwing the same bad guesses at the GitHub
  API over and over again. It is used transparently in _gen_tags(). This
  means it is perfectly viable for _gen_tags() to return an empty list.
* also, addressed the mypy shenanigans
@16Martin 16Martin force-pushed the martin/fix-github-tag-matching branch from 9116a54 to 5dd3cbb Compare November 20, 2024 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant