Improve statistics for downloads #4642

jonatas · 2024-04-24T18:16:13Z

Is your feature request related to a problem?

I had a meeting with @simi to follow up and continue the draft
@segiddins started on #3560 and here let's break down the problem.

Problem: The actual DownloadGem does not offer granularity or insights to the team creating the gem. The idea is improve the support giving more granularity and details about the user behavior while installing the gems.

Describe the solution you'd like

Introduce a new granular track of downloads. Allowing users to know more details of when gems will are installed and expose publicly more statistics about gems being downloaded.

The gem page can present daily, weekly monthly totals. The public view can also see hourly downloads of "Today".

The ideal scenario would also include the location from where the Downloads comes from but I haven't investigated enough if we have such granular level of information available.

Describe alternatives you've considered

I haven't checked alternatives as Postgresql is already in the stack and TimescaleDB was already the suggestion.

Additional context

I'm very glad to work and support RubyGems. I'm a rubyist for almost 2 decades and last 3 years I moved to work at Timescale as a Developer Advocate, the company behind the TimescaleDB extension. I also created the timescaledb gem. So, my plan is break it down in a few PRs:

Introduce the Timescaledb to the stack setting up tests and creating the new Downloads hypertable.
Track downloaded gems and introduce a clone of the Fastly job that just stores the data on timescaledb
Introduce the continuous aggregates for storing totals for downloads in daily, monthly, yearly timeframes
Backfill data from all s3 buckets
Migrate front end statistics to use the continuous aggregates
Clean up old statistics and counters

segiddins · 2024-04-24T19:10:43Z

See also https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/ for how pypi handles this

colby-swandale · 2024-04-26T04:56:26Z

👋🏻 heyo, i'm Colby, i'm maintaining the infrastructure for rubygems.org and wanted to jump in to help get this done. I wanted ask some questions to better understand what changes introducing TimescaleDB will have.

I appreciate Timescale putting their hand up to help us here, it's super appreciated by everyone here. My big takeaway of this proposal is introducing a runtime dependency to rubygems.org, which we have already, ie: Fastly, but look to limit if possible. What benefit is it to run a Timescale Cloud instance vs would our use case be something simple enough for the Timescale Postgres extension could handle relatively easily? I also heard of a potential Timescale DB instance inside AWS being in active development, is this far away?

Our download logs only go as far back as 2015 when we moved to Fastly, so you'll probably need to add a step to backfill gem versions created before this date. Which you can probably backfill up to 365.days.ago to reduce the amount of logs needing to be parsed/inserted.

simi · 2024-04-26T08:21:29Z

Our download logs only go as far back as 2015 when we moved to Fastly, so you'll probably need to add a step to backfill gem versions created before this date. Which you can probably backfill up to 365.days.ago to reduce the amount of logs needing to be parsed/inserted.

@colby-swandale What data could be used to backfill pre-Fastly gems? In case there is none, we can just mark those versions as incomplete statistics-wise.

jonatas · 2024-04-26T18:28:52Z

Hello Colby! Thanks for reaching out!

What benefit is it to run a Timescale Cloud instance vs would our use case be something simple enough for the Timescale Postgres extension could handle relatively easily?

A cloud allows to use elastic computing and storage, high availability, replicas, etc. This would also be a great marketing for our product but the open source version just works.

I also heard of a potential Timescale DB instance inside AWS being in active development, is this far away?

I don't have details enough to share any estimates but will try to check with the team.

My big takeaway of this proposal is introducing a runtime dependency to rubygems.org, which we have already, ie: Fastly, but look to limit if possible.

I totally agree and I was thinking even how these statistics could be a separated service, like rubygems-analytics because the only thing we need to get is the same files from the s3 and maybe transport some rubygems metadata like rubygem_id and version_id, but the rest would be totally isolated.

So, I'm also happy to move it as an independent process to isolate the entire scenario too. If you agree I can first bring the POC that just runs totally independently.

simi · 2024-04-26T18:35:36Z

I totally agree and I was thinking even how these statistics could be a separated service, like rubygems-analytics because the only thing we need to get is the same files from the s3 and maybe transport some rubygems metadata like rubygem_id and version_id, but the rest would be totally isolated.

@colby-swandale on the other side new isolated app will add maintenance burden. 🤔 @jonatas do you have any idea/estimate what kind of response time we can get for most complex queries planned?

jonatas · 2024-04-26T18:41:02Z

I don't think we'll have anything over a second. Everything will be pre-processed, so I imagine the avg query will be under 300ms.

jonatas · 2024-05-08T17:23:17Z

Hi folks, I just created this POC with the basic code to allow us to collect hourly statistics from the raw data.

We can run all logs available and just pre-load the data into some instance, but I still don't have access to run it.

@simi brought the point of make it an isolated service versus run it on the actual infrastructure, and I'd love if we could

I see a lot of positive impact on building a isolated server which just track downloads. I don't think this type of feature needs to be part of the server and having the extra database layer would add a new layer of complexity over ActiveRecord as it uses a different connections.

On an isolated server we'd need to mimic LogTickets or just have access to s3 api to list and consume all the files:

We'll need a listener to subscribe to messages from new logs generated to process.
Create an endpoint for statistics that can be consumed by the official website.
Drop the old counters from the rubygems and replace the source with service calls.

I'm very open to follow in both ways. I can really integrate into the point that @segiddins went before. I just explored as a POC and looking for more feedback before we proceed to the production implementation. I think as an isolated server we have more chance to develop other types of analysis and even detect patterns.

simi · 2024-05-08T18:01:43Z

@simi brought the point of make it an isolated service versus run it on the actual infrastructure, and I'd love if we could

This was raised by @colby-swandale actually. We need to ensure Timescale service health is not going to affect health of the rest of the service. I thought we do something especial for OpenSearch, but seems we're not. 🤔 @colby-swandale would you mind to decide if it is ok to start with built-in API with some reasonable timeouts or rather start with isolated service?

jonatas added the feature label Apr 24, 2024

jonatas mentioned this issue May 15, 2024

Add timescaledb to infrastructure #4716

Merged

jonatas mentioned this issue Jul 11, 2024

Dump hierarchical continuous aggregates in the right order timescale/timescaledb-ruby#70

Closed

jonatas mentioned this issue Aug 23, 2024

Setup downloads on timescaledb #4979

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve statistics for downloads #4642

Improve statistics for downloads #4642

jonatas commented Apr 24, 2024

segiddins commented Apr 24, 2024

colby-swandale commented Apr 26, 2024 •

edited

Loading

simi commented Apr 26, 2024

jonatas commented Apr 26, 2024

simi commented Apr 26, 2024

jonatas commented Apr 26, 2024

jonatas commented May 8, 2024

simi commented May 8, 2024

Improve statistics for downloads #4642

Improve statistics for downloads #4642

Comments

jonatas commented Apr 24, 2024

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

segiddins commented Apr 24, 2024

colby-swandale commented Apr 26, 2024 • edited Loading

simi commented Apr 26, 2024

jonatas commented Apr 26, 2024

simi commented Apr 26, 2024

jonatas commented Apr 26, 2024

jonatas commented May 8, 2024

simi commented May 8, 2024

colby-swandale commented Apr 26, 2024 •

edited

Loading