You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, origin and cache have the same metrics in Prometheus monitoring. Besides them, we want to include more specific metrics for cache servers to collect additional information. Here are the candidates for cache-specific metrics.
Cache Performance Metric
We'd be interested in the response time of a cache server for each request it serves. By collecting xrootd_cache_latency_seconds, we will be able to analyze cache performance with the status of a request, size of object and client's geological location, etc.
Labels:
type: ["hit", "miss"]
path
size - object size
ip - this records the TCP connection metadata from client to XRootD server, including client's IP address and subnet
proj - client’s User-Agent header when requesting the file
Derived metrics:
xrootd_cache_hit_count (count by type == "hit")
xrootd_cache_miss_count (count by type == "miss")
Weighted xrootd_cache_hit/miss_count by object size (suggested by @Saartank)
Cache Space Management Metrics
In cache, objects come and go. We want to track the entire lifecycle of each object staying in a cache server, since it is prefetched/fetched from the origin. xrootd_cache_object
Labels:
path
ns
status: ["cached", "evicted"]
size
creation_time
access_count
eviction_reason - null if still cached, reason if evicted
Derived metrics:
age of files (current_time - creation_time)
eviction counts (count by eviction_reason)
space usage (sum of size)
Cache Health Metrics
When something bad happens in cache, we want to get an alert. xrootd_cache_corruption_count tracks cache data integrity issues. I'm not very familiar with how does Pelican wrap XRootD so these are my suggested labels.
Labels:
path
ns
type: ["checksum", "incomplete", "storage"]
checksum: cached file's checksum doesn't match original
incomplete: interrupted transfers from origin
storage: file system error, permission problem...
action: ["removed", "refetched", none]
The text was updated successfully, but these errors were encountered:
Pelican Service:
Currently, origin and cache have the same metrics in Prometheus monitoring. Besides them, we want to include more specific metrics for cache servers to collect additional information. Here are the candidates for cache-specific metrics.
Cache Performance Metric
We'd be interested in the response time of a cache server for each request it serves. By collecting
xrootd_cache_latency_seconds
, we will be able to analyze cache performance with the status of a request, size of object and client's geological location, etc.Labels:
type: ["hit", "miss"]
path
size - object size
ip - this records the TCP connection metadata from client to XRootD server, including client's IP address and subnet
proj - client’s User-Agent header when requesting the file
Derived metrics:
xrootd_cache_hit_count (count by type == "hit")
xrootd_cache_miss_count (count by type == "miss")
Weighted xrootd_cache_hit/miss_count by object size (suggested by @Saartank)
Cache Space Management Metrics
In cache, objects come and go. We want to track the entire lifecycle of each object staying in a cache server, since it is prefetched/fetched from the origin.
xrootd_cache_object
Labels:
path
ns
status: ["cached", "evicted"]
size
creation_time
access_count
eviction_reason - null if still cached, reason if evicted
Derived metrics:
age of files (current_time - creation_time)
eviction counts (count by eviction_reason)
space usage (sum of size)
Cache Health Metrics
When something bad happens in cache, we want to get an alert.
xrootd_cache_corruption_count
tracks cache data integrity issues. I'm not very familiar with how does Pelican wrap XRootD so these are my suggested labels.Labels:
path
ns
type: ["checksum", "incomplete", "storage"]
action: ["removed", "refetched", none]
The text was updated successfully, but these errors were encountered: