Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commits SHAs from p2c not indexed? #9

Open
k----n opened this issue Oct 26, 2020 · 9 comments
Open

Commits SHAs from p2c not indexed? #9

k----n opened this issue Oct 26, 2020 · 9 comments
Labels
enhancement New feature or request

Comments

@k----n
Copy link
Contributor

k----n commented Oct 26, 2020

For example, I run echo "liferay_liferay-portal" ~/lookup/getValues -f p2c | grep 4ec29d1bdde625673f844e2a44cc7d9095253b35 which means that a commit 4ec29d1bdde625673f844e2a44cc7d9095253b35 should exist.

This is what happens when I run the following:

> echo "4ec29d1bdde625673f844e2a44cc7d9095253b35" | ~/lookup/getValues c2ta
no 4ec29d1bdde625673f844e2a44cc7d9095253b35 in /data/basemaps/c2taFullS

> echo "4ec29d1bdde625673f844e2a44cc7d9095253b35"  | ~/lookup/showCnt commit
no commit 4ec29d1bdde625673f844e2a44cc7d9095253b35 in 78
@k----n
Copy link
Contributor Author

k----n commented Oct 26, 2020

The commit also still exists on github: liferay/liferay-portal@4ec29d1

@audrism
Copy link
Collaborator

audrism commented Nov 16, 2020

4ec29d1bdde625673f844e2a44cc7d9095253b35 is a regular commit (not among the bad commits listed in woc.pm) but is, indeed missing. The repo is updated regularly (and was updated for version S) but that specific commit was lost in the process, so hopefully it will get successfully extracted during the next collection.

@k----n
Copy link
Contributor Author

k----n commented Dec 10, 2020

@audrism I've also come across commits existing but not a project for them.

Are you interested in the list of commits?

@audrism
Copy link
Collaborator

audrism commented Dec 11, 2020

Getting a list is easy:
for inPonly:

join -v1 <(zcat c2PFullS0.s|uniq) <(zcat c2datFullS0.s| cut -d\; -f1)

for inConly:

join -v1 <(zcat c2PFullS0.s|uniq) <(zcat c2datFullS0.s| cut -d\; -f1)

What would be helpful is a scrip or audit process that tries to recover missing commits for inPonly
and recovers projects for orphaned commits inConly.

While the first is traightforward in case the git repo is still online and has not been compacted, the second is more tricky:

use ghtorrent/SwHeritage?

@k----n
Copy link
Contributor Author

k----n commented Dec 11, 2020

ghtorrent and SwHeritage might not cover the most recent commits.

There is a way to search for it on github... but API limits:
https://github.com/search?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Commits

Note that the CI bot has already deleted the branch, but the commit still shows up in a PR:
image

liferay/liferay-portal@4ec29d1

I guess you could also query https://github.com/<project/user name>/<repo>/commit/<sha1> to see if the commit still exists before exhausting API limits.
e.g. https://github.com/liferay/liferay-portal/commit/4ec29d1bdde625673f844e2a44cc7d9095253b35

But it doesn't have the metadata for whether or not the commit belongs to the repo vs getting this link from search: https://github.com/liferay/liferay-portal/pull/3498/commits/4ec29d1bdde625673f844e2a44cc7d9095253b35


SwHeritage returns no hits: https://archive.softwareheritage.org/browse/search/?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&with_visit=true&with_content=true&search_metadata=true

@audrism
Copy link
Collaborator

audrism commented Dec 12, 2020

So the search is affected by api limits? the url does not appear to invoke rest/graphql api

@k----n
Copy link
Contributor Author

k----n commented Dec 12, 2020

Search has a limit of 30 requests/min with a token (https://docs.github.com/en/free-pro-team@latest/rest/reference/search#rate-limit).

You can also query for when your rate limit expires: https://docs.github.com/en/free-pro-team@latest/rest/reference/rate-limit

I imagine the lookup to be 2 steps:

  1. 9 non-api endpoints are queried for counts
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Users
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Wikis
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Topics
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Marketplace
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=RegistryPackages
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Discussions
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Issues
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Code
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Repositories

Where https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Issues returns 2.

  1. Based on the counts you can then use the official API:
curl \
>   -H "Accept: application/vnd.github.v3+json" \
>   https://api.github.com/search/issues?q=4ec29d1bdde625673f844e2a44cc7d9095253b35

The 30 requests/min is limiting, and the non-api endpoints are also rate limited (although I'm unsure what it is exactly).

Your mileage may vary as well with getting useful results (the example works because the commit sha was included somewhere in the pull request body?). e.g.

"body": "Merging the following commit: [2f586e07928e14a424edfbf3b547a3881ca193f9](https://github.com/liferay/com-liferay-poshi-runner/commit/2f586e07928e14a424edfbf3b547a3881ca193f9)"

@audrism audrism added the enhancement New feature or request label Feb 4, 2021
@k----n
Copy link
Contributor Author

k----n commented Sep 16, 2021

It seems like git clone --mirror <repo> also retrieves more commits

@audrism
Copy link
Collaborator

audrism commented Sep 16, 2021

I use --mirror when cloning as it gets all the branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants