Cache-specific Prometheus metrics #1854

h2zh · 2024-12-23T20:59:37Z

Pelican Service:

Currently, origin and cache have the same metrics in Prometheus monitoring. Besides them, we want to include more specific metrics for cache servers to collect additional information. Here are the candidates for cache-specific metrics.

Cache Performance Metric

We'd be interested in the response time of a cache server for each request it serves. By collecting xrootd_cache_latency_seconds, we will be able to analyze cache performance with the status of a request, size of object and client's geological location, etc.
Labels:
type: ["hit", "miss"]
path
size - object size
ip - this records the TCP connection metadata from client to XRootD server, including client's IP address and subnet
proj - client’s User-Agent header when requesting the file
Derived metrics:
xrootd_cache_hit_count (count by type == "hit")
xrootd_cache_miss_count (count by type == "miss")
Weighted xrootd_cache_hit/miss_count by object size (suggested by @Saartank)

Cache Space Management Metrics

In cache, objects come and go. We want to track the entire lifecycle of each object staying in a cache server, since it is prefetched/fetched from the origin. xrootd_cache_object
Labels:
path
ns
status: ["cached", "evicted"]
size
creation_time
access_count
eviction_reason - null if still cached, reason if evicted
Derived metrics:
age of files (current_time - creation_time)
eviction counts (count by eviction_reason)
space usage (sum of size)

Cache Health Metrics

When something bad happens in cache, we want to get an alert. xrootd_cache_corruption_count tracks cache data integrity issues. I'm not very familiar with how does Pelican wrap XRootD so these are my suggested labels.
Labels:
path
ns
type: ["checksum", "incomplete", "storage"]

checksum: cached file's checksum doesn't match original
incomplete: interrupted transfers from origin
storage: file system error, permission problem...

action: ["removed", "refetched", none]

The text was updated successfully, but these errors were encountered:

CannonLock · 2024-12-23T21:05:58Z

This might be a bit harder but I would be curious to see the

# bytes per project
IP address of requesting Client

h2zh · 2024-12-23T23:12:41Z

This might be a bit harder but I would be curious to see the

# bytes per project

IP address of requesting Client

Thanks! Just updated.

h2zh added enhancement New feature or request cache Issue relating to the cache component labels Dec 23, 2024

bbockelm assigned patrickbrophy Jan 16, 2025

bbockelm added the monitoring label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache-specific Prometheus metrics #1854

Cache-specific Prometheus metrics #1854

h2zh commented Dec 23, 2024 •

edited

Loading

CannonLock commented Dec 23, 2024 •

edited

Loading

h2zh commented Dec 23, 2024

Cache-specific Prometheus metrics #1854

Cache-specific Prometheus metrics #1854

Comments

h2zh commented Dec 23, 2024 • edited Loading

CannonLock commented Dec 23, 2024 • edited Loading

h2zh commented Dec 23, 2024

h2zh commented Dec 23, 2024 •

edited

Loading

CannonLock commented Dec 23, 2024 •

edited

Loading