Skip to content

Latest commit

 

History

History
209 lines (166 loc) · 7.29 KB

README.md

File metadata and controls

209 lines (166 loc) · 7.29 KB

Github Stats Demo powered by ClickHouse

Simple app image

This repository provides a simple application to show off ClickHouse Cloud query endpoints by allowing users to search for any Python package and retrieve Github statistics. This uses over 7.5 billion rows in ClickHouse, thus demonstrating real-time analytics on large datasets.

For the accompanying blog post see Adding Analytics to an Application in under 10 minutes with ClickHouse and Query Endpoints

If you prefer to watch a video showing the steps to build this application, watch below:

// INSERT

Pre-requisites

  • >= node v20.9.0
  • >- npm v10.1.0

1. Getting started

Users can either sign up to a ClickHouse Cloud account or use the endpoint below to query the public ClickPy service. This applies quotas.

curl -H "Content-Type: application/json" -X 'POST' -s --user 'MdhWYPEpXaqiwGMjbXWT:4b1dKbabyQTvuKUWOnI08oXVbUD4tkaxKKjEwz7ORG' 'https://console-api.clickhouse.cloud/.api/query-endpoints/297797b1-c5b0-4741-9f5b-3d6456a9860d/run?format=JSONEachRow' --data-raw '{"queryVariables":{"package_name":"requests"}}'

If you using the public endpoint, skip to step (6) and run the app.

2. Signup to ClickHouse Cloud

To reproduce this demo with your own cluster, requires a ClickHouse Cloud account. Users can sign up here and receive $300 of free credits - more than sufficient for this demo.

3. Load the dataset

Create the supporting databases and tables for this demo:

CREATE DATABASE pypi
CREATE TABLE pypi.projects
(
    `metadata_version` String,
    `name` String,
    `version` String,
    `summary` String,
    `description` String,
    `description_content_type` String,
    `author` String,
    `author_email` String,
    `maintainer` String,
    `maintainer_email` String,
    `license` String,
    `keywords` String,
    `classifiers` Array(String),
    `platform` Array(String),
    `home_page` String,
    `download_url` String,
    `requires_python` String,
    `requires` Array(String),
    `provides` Array(String),
    `obsoletes` Array(String),
    `requires_dist` Array(String),
    `provides_dist` Array(String),
    `obsoletes_dist` Array(String),
    `requires_external` Array(String),
    `project_urls` Array(String),
    `uploaded_via` String,
    `upload_time` DateTime64(3),
    `filename` String,
    `size` Int64,
    `path` String,
    `python_version` String,
    `packagetype` String,
    `comment_text` String,
    `has_signature` Bool,
    `md5_digest` String,
    `sha256_digest` String,
    `blake2_256_digest` String
)
ENGINE = MergeTree
ORDER BY name

CREATE DATABASE github
CREATE TABLE github.github_events
(
    `event_type` Enum8('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22),
    `actor_login` LowCardinality(String),
    `repo_name` LowCardinality(String),
    `repo_id` LowCardinality(String),
    `created_at` DateTime,
    `updated_at` DateTime,
    `action` Enum8('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20)
)
ENGINE = MergeTree
ORDER BY (repo_id, event_type, created_at)

Load the dataset for each table. This data is up to date as of 06/09/2024:

-- pypi projects data
INSERT INTO pypi.projects SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/pypi/projects/projects.parquet')

-- github events data
INSERT INTO github.github_events SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github_issues/subset/github_events_*.parquet')

The github events dataset is quite large at almost 100 GB. Users can target subsets by year e.g.

INSERT INTO github.github_events
SELECT *
FROM s3Cluster('default', 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/github_issues/subset/github_events_2024.parquet')
SETTINGS parallel_distributed_insert_select = 2

0 rows in set. Elapsed: 296.479 sec. Processed 1.23 billion rows, 13.38 GB (4.15 million rows/s., 45.14 MB/s.)

Users can use ClickPipes to load this dataset.

4. Create a query endpoint

Save the following query in ClickHouse Cloud SQL console and create an endpoint with a Query Endpoints API token. Ensure you apply the readonly role to your endpoint

Note down the the credentials for the token and subsequent HTTP endpoint.

WITH
(
      SELECT version
      FROM pypi.projects
      WHERE name = {package_name:String}
      ORDER BY arrayMap(x -> toUInt8OrDefault(x, 0), splitByChar('.', version)) DESC
      LIMIT 1
) AS max_version,
project_details AS (
  SELECT
      name,
      max_version,
      summary,
      author,
      author_email,
      license,
      home_page,
      trim(TRAILING '/' FROM regexpExtract(arrayFilter(l -> (l LIKE '%https://github.com/%'), arrayConcat(project_urls, [home_page]))[1], '.*https://github.com/(.*)')) AS github
  FROM pypi.projects
  WHERE (name = {package_name:String})
  ORDER BY upload_time DESC
  LIMIT 1
),
id AS (
    SELECT repo_id
    FROM github.github_events
    WHERE repo_name IN (SELECT github FROM project_details) LIMIT 1
),
stats AS (
  SELECT
      uniqExactIf(actor_login, (event_type = 'WatchEvent') AND (action = 'started')) AS stars,
      uniqExactIf(number, event_type = 'IssuesEvent') AS issues,
      uniqExactIf(actor_login, event_type = 'ForkEvent') AS forks,
      uniqExactIf(number, event_type = 'PullRequestEvent') AS prs
  FROM github.github_events
  WHERE (repo_id IN id)
)
SELECT * FROM project_details, stats

Create endpoint

5. Update local credentials

Update the file .env in the root folder. Populate with the endpoint and API statistics recorded above e.g.

API_KEY_ID=MdhWYPEpXaqiwGMjbXWT
API_KEY_SECRET=4b1dKbabyQTvuKUWOnI08oXVbUD4tkaxKKjEwz7ORG
GITHUB_STATS_API=https://console-api.clickhouse.cloud/.api/query-endpoints/0b39f27a-795d-4c3c-a837-cade98e1d51f/run

6. Running the app

Install dependencies:

npm install
# or
yarn install
# or
bun install

Run the development server:

npm run dev
# or
yarn dev
# or
bun dev

Open http://localhost:3000 and search for a Python package.

Deploy on Vercel

The easiest way to deploy this app is to use the Vercel Platform from the creators of Next.js.

Check out our Next.js deployment documentation for more details.