Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache-specific Prometheus metrics #1854

Open
1 of 7 tasks
h2zh opened this issue Dec 23, 2024 · 2 comments
Open
1 of 7 tasks

Cache-specific Prometheus metrics #1854

h2zh opened this issue Dec 23, 2024 · 2 comments
Assignees
Labels
cache Issue relating to the cache component enhancement New feature or request monitoring

Comments

@h2zh
Copy link
Collaborator

h2zh commented Dec 23, 2024

Pelican Service:

  • Client
  • Plugin
  • Registry
  • Director
  • Origin
  • Cache
  • Other (please give the detail)

Currently, origin and cache have the same metrics in Prometheus monitoring. Besides them, we want to include more specific metrics for cache servers to collect additional information. Here are the candidates for cache-specific metrics.

Cache Performance Metric

We'd be interested in the response time of a cache server for each request it serves. By collecting xrootd_cache_latency_seconds, we will be able to analyze cache performance with the status of a request, size of object and client's geological location, etc.
Labels:
type: ["hit", "miss"]
path
size - object size
ip - this records the TCP connection metadata from client to XRootD server, including client's IP address and subnet
proj - client’s User-Agent header when requesting the file
Derived metrics:
xrootd_cache_hit_count (count by type == "hit")
xrootd_cache_miss_count (count by type == "miss")
Weighted xrootd_cache_hit/miss_count by object size (suggested by @Saartank)

Cache Space Management Metrics

In cache, objects come and go. We want to track the entire lifecycle of each object staying in a cache server, since it is prefetched/fetched from the origin. xrootd_cache_object
Labels:
path
ns
status: ["cached", "evicted"]
size
creation_time
access_count
eviction_reason - null if still cached, reason if evicted
Derived metrics:
age of files (current_time - creation_time)
eviction counts (count by eviction_reason)
space usage (sum of size)

Cache Health Metrics

When something bad happens in cache, we want to get an alert. xrootd_cache_corruption_count tracks cache data integrity issues. I'm not very familiar with how does Pelican wrap XRootD so these are my suggested labels.
Labels:
path
ns
type: ["checksum", "incomplete", "storage"]

  • checksum: cached file's checksum doesn't match original
  • incomplete: interrupted transfers from origin
  • storage: file system error, permission problem...

action: ["removed", "refetched", none]

@h2zh h2zh added enhancement New feature or request cache Issue relating to the cache component labels Dec 23, 2024
@CannonLock
Copy link
Contributor

CannonLock commented Dec 23, 2024

This might be a bit harder but I would be curious to see the

  • # bytes per project
  • IP address of requesting Client

@h2zh
Copy link
Collaborator Author

h2zh commented Dec 23, 2024

This might be a bit harder but I would be curious to see the

  • # bytes per project
  • IP address of requesting Client

Thanks! Just updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cache Issue relating to the cache component enhancement New feature or request monitoring
Projects
None yet
Development

No branches or pull requests

4 participants