Improve per-collection metrics and statistics #224

ividito · 2023-09-12T13:13:56Z

Related: https://github.com/NASA-IMPACT/veda-architecture/issues/267

Problem

Currently, veda-backend allows users to register searches for items in a STAC catalog. These searches can be performed based on the collection field or using a filter argument that includes a filter on collection. Searches have a usecount, but it is difficult to aggregate search uses based on the collections or assets they are accessing.

Example using collection:

{
    "collections": [
        "geoglam"
    ],
    "filter-lang": "cql2-json",
    "filter": {
        "op": "and",
        "args": [
            {
                "op": ">=",
                "args": [
                    {
                        "property": "datetime"
                    },
                    "2020-01-01T00:00:00.000Z"
                ]
            },
            {
                "op": "<=",
                "args": [
                    {
                        "property": "datetime"
                    },
                    "2020-01-01T23:59:59.999Z"
                ]
            }
        ]
    }
}

Example using filter (this is how the dashboard uses this endpoint):

{
    "filter-lang": "cql2-json",
    "filter": {
        "op": "and",
        "args": [
            {
                "op": ">=",
                "args": [
                    {
                        "property": "datetime"
                    },
                    "2020-01-01T00:00:00.000Z"
                ]
            },
            {
                "op": "<=",
                "args": [
                    {
                        "property": "datetime"
                    },
                    "2020-01-01T23:59:59.999Z"
                ]
            },
            {
                "op": "eq",
                "args": [
                    {
                        "property": "collection"
                    },
                    "geoglam"
                ]
            }
        ]
    }
}

Proposed Solutions

To improve observability and tracking, we can make the following enhancements:

Associating requests with targeted collections: Inject a metrics dependency into tile requests to associate requests with the targeted asset or collections. By exporting usage metrics to Cloudwatch, we would be able to use a dashboard to track requests against different collections or assets. If we chose to collect these metrics at the asset level, we could use the data to prioritize caching of frequently-accessed files.
Tracking searches using collection and filter: For searches using the collection field, we can already generate simple statistics based on the number of pgstac queries made against each search (using the usecount column associated with each search). However, tracking searches that use the filter argument instead is challenging. Ideally, we would be able to aggregate comparable searches. In order to do this, I think it could be helpful to refactor the register endpoint to convert filter statements to equivalent collection field values. A side-effect of this change would be the consolidation of searches, which could increase the effectiveness of caching mechanisms.

Acceptance Criteria

Select one of the following solutions:

Either:

Extend one of the existing dependencies in titiler to export metrics to Cloudwatch, associated with collection-ids or assets
Create a SQL query that can aggregate the usecount of searches over the same collection (failed attempt included below)
The register endpoint is restructured to convert searches filtering on collection to use the collection field.

Finally:

Ensure that the chosen solution doesn't result in performance issues (particularly the /register endpoint changes, where that endpoint is called every time a layer is rendered in the dashboard)

Additional Information

Simple statistics for searches using collection:

SELECT usecount, search->'collections', jsonb_array_length(search->'collections')
FROM pgstac.searches
WHERE jsonb_array_length(search->'collections') > 0
ORDER BY jsonb_array_length(search->'collections') DESC

Attempt at including searches using filter:

SELECT *
FROM (
  SELECT usecount,
		 search,
         search->'collections' AS collections,
         jsonb_array_length(search->'collections') AS collections_length
  FROM pgstac.searches
  WHERE jsonb_path_exists(search, '$.collections')
  UNION ALL
  SELECT usecount,
	     search,
         jsonb_agg(jsonb_extract_path(search, 'filter', 'args', 'args')) FILTER (WHERE jsonb_typeof(jsonb_extract_path(search, 'filter', 'args', 'args')) = 'array') AS collections,
         jsonb_array_length(jsonb_extract_path(search, 'filter', 'args', 'args')) AS collections_length
  FROM pgstac.searches
  WHERE NOT jsonb_path_exists(search, '$.collections')
  GROUP BY search, usecount
) AS subquery
GROUP BY search, collections, collections_length, usecount
ORDER BY collections_length
LIMIT 1000;

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve per-collection metrics and statistics #224

Improve per-collection metrics and statistics #224

ividito commented Sep 12, 2023

Improve per-collection metrics and statistics #224

Improve per-collection metrics and statistics #224

Comments

ividito commented Sep 12, 2023

Problem

Proposed Solutions

Acceptance Criteria

Additional Information