Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse tables #24

Open
sylviesworld opened this issue Feb 8, 2021 · 10 comments
Open

Clickhouse tables #24

sylviesworld opened this issue Feb 8, 2021 · 10 comments
Labels
clickhouse enhancement New feature or request

Comments

@sylviesworld
Copy link
Contributor

@audrism Can you update the Clickhouse Tables?

@audrism
Copy link
Collaborator

audrism commented Feb 8, 2021

@dlomax64 could you remind what they should contain (i.e. schema)?

@sylviesworld
Copy link
Contributor Author

I think that the original b2cPtaPkgR_all table will work for what I want to do with it:
b2cPtaPkgR_all:

name type
blob FixedString(20)
commit FixedString(20)
project String
time UInt32
author String
language String
deps String

If I remember correctly, Chris was wanting to be able to get downstream commits from a dependency so this table should be fine for that + some things I want to add for the website.

@audrism
Copy link
Collaborator

audrism commented Feb 8, 2021

TY. Will let you know once done. My plan is to have
c2PtabllfPkgS_all
where 'llf' part stands for language,language,file (the second language is per determination of ctags while the first is based on file extension

name type
commit FixedString(20)
Project String
time UInt32
author String
language String
language-from-ctags String
filename String
deps String

@sylviesworld
Copy link
Contributor Author

Does the language extension and ctags for one commit often differ?

@noah-22
Copy link

noah-22 commented Feb 12, 2021

In my experience with the c2PtabllfPkgS map, ctags often raise the issue of existence rather than equality in comparison to WoC language. The WoC language field is never empty in the map, but 75,300,669 lines in the map do not contain a ctag.

For records where both language fields exist, they match a strong majority of the time.

Ctags also can become an issue when checking their value in scripts since they are not character-safe. For example, C# (the ctag for the C# language) may be a problem. The WoC equivelant is simply 'Cs', a safer string.

@noah-22
Copy link

noah-22 commented Feb 18, 2021

@audrism Can this table, or some other lookup program, be optimized for deps queries by somehow splitting deps into their own fields?

Currently, searching for all downstream commits for a pkg requires traversing all lines of each c2PtabllfPkgS file.
A mapping ~/lookup/getValues/pkg2P would also be very useful. I can raise a new issue suggesting this if you agree it is of interest, independent from this clickhouse table.

@audrism
Copy link
Collaborator

audrism commented Feb 18, 2021

  1. Got it, that means making a package a first-class object, but is the original table of use (in that case it is not a CH topic)?
  2. just added to CH commit_all: the schema is (tc it commit time (unix seconds), taz is author timezone and tcz is commiter tz)
    (sha1 FixedString(20), time Int32, tc Int32, tree FixedString(20), parent String, taz String, tcz String, author String, commiter String, project String, comment String

@audrism
Copy link
Collaborator

audrism commented Feb 18, 2021

In my experience with the c2PtabllfPkgS map, ctags often raise the issue of existence rather than equality in comparison to WoC language. The WoC language field is never empty in the map, but 75,300,669 lines in the map do not contain a ctag.

For records where both language fields exist, they match a strong majority of the time.

Ctags also can become an issue when checking their value in scripts since they are not character-safe. For example, C# (the ctag for the C# language) may be a problem. The WoC equivelant is simply 'Cs', a safer string.

I included ctags classification mostly as a way to see where language types do not match, it may make sense to use linguist tool to identify language in such cases

@audrism audrism added the enhancement New feature or request label Feb 18, 2021
@noah-22
Copy link

noah-22 commented Feb 19, 2021

I do not imagine the "original" c2PtabllfPkgS being useful for Clickhouse since it is a time series oriented database. Most queries on time series (in the work I've done) aim to collect all commits, authors, or blobs from a project in a set period of time. The c2PtabllfPkgS map would then report an inflated count for commits in the given period of time because some commits appear multiple times in the table (assuming records in the table would be identical to the existing files at /da1_data/basemaps/gz/c2PtabllfPkgFullS{0..127}.s).

What would be very useful is a similar table that uses Project as the key, such as p2cbtaPkg to allow for an accurate query of commits/blobs/authors in a given time frame per project.

@audrism
Copy link
Collaborator

audrism commented Mar 3, 2021

Apart from the commit detail (and project) table in the CH noted above, it is not entirely clear what to add
@noah-22 @dlomax64

a) One idea is to have monthly commit/blob counts (see issue #25)
b) it seems APIs need to be firs-class object, so CH may not be a good target: create TCH table?
c) I am thinking blob to first author/commit may be good CH use case since it has time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants