Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve per-collection metrics and statistics #224

Open
5 tasks
ividito opened this issue Sep 12, 2023 · 0 comments
Open
5 tasks

Improve per-collection metrics and statistics #224

ividito opened this issue Sep 12, 2023 · 0 comments

Comments

@ividito
Copy link
Collaborator

ividito commented Sep 12, 2023

Related: https://github.com/NASA-IMPACT/veda-architecture/issues/267

Problem

Currently, veda-backend allows users to register searches for items in a STAC catalog. These searches can be performed based on the collection field or using a filter argument that includes a filter on collection. Searches have a usecount, but it is difficult to aggregate search uses based on the collections or assets they are accessing.

Example using collection:

{
    "collections": [
        "geoglam"
    ],
    "filter-lang": "cql2-json",
    "filter": {
        "op": "and",
        "args": [
            {
                "op": ">=",
                "args": [
                    {
                        "property": "datetime"
                    },
                    "2020-01-01T00:00:00.000Z"
                ]
            },
            {
                "op": "<=",
                "args": [
                    {
                        "property": "datetime"
                    },
                    "2020-01-01T23:59:59.999Z"
                ]
            }
        ]
    }
}

Example using filter (this is how the dashboard uses this endpoint):

{
    "filter-lang": "cql2-json",
    "filter": {
        "op": "and",
        "args": [
            {
                "op": ">=",
                "args": [
                    {
                        "property": "datetime"
                    },
                    "2020-01-01T00:00:00.000Z"
                ]
            },
            {
                "op": "<=",
                "args": [
                    {
                        "property": "datetime"
                    },
                    "2020-01-01T23:59:59.999Z"
                ]
            },
            {
                "op": "eq",
                "args": [
                    {
                        "property": "collection"
                    },
                    "geoglam"
                ]
            }
        ]
    }
}

Proposed Solutions

To improve observability and tracking, we can make the following enhancements:

  1. Associating requests with targeted collections: Inject a metrics dependency into tile requests to associate requests with the targeted asset or collections. By exporting usage metrics to Cloudwatch, we would be able to use a dashboard to track requests against different collections or assets. If we chose to collect these metrics at the asset level, we could use the data to prioritize caching of frequently-accessed files.

  2. Tracking searches using collection and filter: For searches using the collection field, we can already generate simple statistics based on the number of pgstac queries made against each search (using the usecount column associated with each search). However, tracking searches that use the filter argument instead is challenging. Ideally, we would be able to aggregate comparable searches. In order to do this, I think it could be helpful to refactor the register endpoint to convert filter statements to equivalent collection field values. A side-effect of this change would be the consolidation of searches, which could increase the effectiveness of caching mechanisms.

Acceptance Criteria

  • Select one of the following solutions:

Either:

  • Extend one of the existing dependencies in titiler to export metrics to Cloudwatch, associated with collection-ids or assets
  • Create a SQL query that can aggregate the usecount of searches over the same collection (failed attempt included below)
  • The register endpoint is restructured to convert searches filtering on collection to use the collection field.

Finally:

  • Ensure that the chosen solution doesn't result in performance issues (particularly the /register endpoint changes, where that endpoint is called every time a layer is rendered in the dashboard)

Additional Information

  • Simple statistics for searches using collection:
SELECT usecount, search->'collections', jsonb_array_length(search->'collections')
FROM pgstac.searches
WHERE jsonb_array_length(search->'collections') > 0
ORDER BY jsonb_array_length(search->'collections') DESC
  • Attempt at including searches using filter:
SELECT *
FROM (
  SELECT usecount,
		 search,
         search->'collections' AS collections,
         jsonb_array_length(search->'collections') AS collections_length
  FROM pgstac.searches
  WHERE jsonb_path_exists(search, '$.collections')
  UNION ALL
  SELECT usecount,
	     search,
         jsonb_agg(jsonb_extract_path(search, 'filter', 'args', 'args')) FILTER (WHERE jsonb_typeof(jsonb_extract_path(search, 'filter', 'args', 'args')) = 'array') AS collections,
         jsonb_array_length(jsonb_extract_path(search, 'filter', 'args', 'args')) AS collections_length
  FROM pgstac.searches
  WHERE NOT jsonb_path_exists(search, '$.collections')
  GROUP BY search, usecount
) AS subquery
GROUP BY search, collections, collections_length, usecount
ORDER BY collections_length
LIMIT 1000;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant