Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for persistent buffering in receiver/gateway #1342

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ extensions:
endpoint: {{ include "splunk-otel-collector.o11yApiUrl" . }}
{{- end }}

{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
file_storage/persistent_queue_gateway:
directory: {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/gateway
timeout: 0
{{- end }}


zpages:

Expand Down Expand Up @@ -142,6 +148,9 @@ service:
{{- if (eq (include "splunk-otel-collector.splunkO11yEnabled" .) "true") }}
- http_forwarder
{{- end }}
{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
- file_storage/persistent_queue_gateway
{{- end }}

# The default pipelines should not need to be changed. You can add any custom pipeline instead.
# In order to disable a default pipeline just set it to `null` in gateway.config overrides.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ The values can be overridden in .Values.clusterReceiver.config
extensions:
health_check:


{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
# k8s_observer w/ pod and node detection for eks/fargate deployment
k8s_observer:
Expand All @@ -16,6 +15,12 @@ extensions:
observe_nodes: true
{{- end }}

{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
file_storage/persistent_queue_receiver:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
file_storage/persistent_queue_receiver:
file_storage/persistent_queue_cluster_receiver:

directory: {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/clusterReceiver
timeout: 0
{{- end }}

receivers:
# Prometheus receiver scraping metrics from the pod itself
{{- include "splunk-otel-collector.prometheusInternalMetrics" "k8s-cluster-receiver" | nindent 2}}
Expand Down Expand Up @@ -205,11 +210,14 @@ service:
telemetry:
metrics:
address: 0.0.0.0:8889
{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
extensions: [health_check, k8s_observer]
{{- else }}
extensions: [health_check]
{{- end }}
extensions:
- health_check
{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
- k8s_observer
{{- end }}
{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
- file_storage/persistent_queue_receiver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
- file_storage/persistent_queue_receiver
- file_storage/persistent_queue_cluster_receiver

{{- end }}
pipelines:
{{- if or (eq (include "splunk-otel-collector.o11yMetricsEnabled" $) "true") (eq (include "splunk-otel-collector.platformMetricsEnabled" $) "true") }}
# k8s metrics pipeline
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,31 @@ spec:
mountPath: /splunk-messages
- mountPath: /conf
name: collector-configmap
- name: patch-log-dirs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We might be outgrowing "log" functionality here for patch-log-dirs. If it works for now I'm fine with it. It might be time to refactor the names and images a little to make them more abstract to better address these new use cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, something like patch-dirs would work for all configurations I guess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd go with patch-dirs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does persistent buffering require the .Values.logsCollection.checkpointPath directory for checkpoints? I don't think it does. I recommend removing the code that sets up the logsCollection checkpoint directory for the cluster receiver and gateway to avoid potential race conditions. Since the agent is deployed to all nodes and the cluster receiver and gateway are only deployed to some, they may attempt to set permissions on the same directory simultaneously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To move forward with this

  • I would just remove lines 111-113 for now so the agent and cluster receiver are not sharing on disk resources.
  • Make sure to only include any of this new cluster receiver code if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled=true

image: {{ template "splunk-otel-collector.image.initPatchLogDirs" . }}
imagePullPolicy: {{ .Values.image.initPatchLogDirs.pullPolicy }}
command: ['sh', '-c', '
mkdir -p {{ .Values.logsCollection.checkpointPath }};
chown -Rv {{ $clusterReceiver.securityContext.runAsUser | default 999 }}:{{ $clusterReceiver.securityContext.runAsGroup | default 999 }} {{ .Values.logsCollection.checkpointPath }};
chmod -v g+rwxs {{ .Values.logsCollection.checkpointPath }};

{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
mkdir -p {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/clusterReceiver;
chown -Rv {{ $clusterReceiver.securityContext.runAsUser | default 999 }}:{{ $clusterReceiver.securityContext.runAsGroup | default 999 }} {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/clusterReceiver;
chmod -v g+rwxs {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/clusterReceiver;
setfacl -n -Rm d:m::rx,m::rx,d:g:{{ $clusterReceiver.securityContext.runAsGroup | default 999 }}:rx,g:{{ $clusterReceiver.securityContext.runAsGroup | default 999 }}:rx {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/clusterReceiver;
{{- end }}']
securityContext:
runAsUser: 0
Copy link
Contributor

@jvoravong jvoravong Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We probably need to document this runAsUser 0 and above used permissions in https://github.com/signalfx/splunk-otel-collector-chart/blob/main/docs/advanced-configuration.md#data-persistence as well as customer documentation. This would be done for the agent, cluster receiver, and gateway. A number of customers request information like this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding documentation primarily in the Advanced Configuration: Data Persistence section. Then, include a brief note with a hyperlink in the Running the Container in Non-Root User Mode section that points to the data persistence section.

resources:
{{- toYaml $clusterReceiver.resources | nindent 12 }}
volumeMounts:
- name: checkpoint
mountPath: {{ .Values.logsCollection.checkpointPath }}
{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
- name: persistent-queue
mountPath: {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/clusterReceiver
{{- end }}
{{- end }}
containers:
- name: otel-collector
Expand Down Expand Up @@ -189,6 +214,10 @@ spec:
mountPath: /otel/etc
readOnly: true
{{- end }}
{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
- name: persistent-queue
mountPath: {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/clusterReceiver
{{- end }}
- mountPath: {{ .Values.isWindows | ternary "C:\\conf" "/conf" }}
name: collector-configmap
{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
Expand Down Expand Up @@ -217,6 +246,12 @@ spec:
secret:
secretName: {{ template "splunk-otel-collector.secret" . }}
{{- end }}
{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
- name: persistent-queue
hostPath:
path: {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/clusterReceiver
type: DirectoryOrCreate
{{- end }}
{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
- name: init-eks-fargate-cluster-receiver-script
configMap:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,24 @@ spec:
securityContext:
{{- include "splunk-otel-collector.securityContext" (dict "isWindows" .Values.isWindows "securityContext" $gateway.securityContext) | nindent 8 }}
{{- end }}
initContainers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ill only want to include the initContainers if persistent queue is enabled

      {{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
      initContainers:
...
      {{- end }}

- name: patch-log-dirs
image: {{ template "splunk-otel-collector.image.initPatchLogDirs" . }}
imagePullPolicy: {{ .Values.image.initPatchLogDirs.pullPolicy }}
command: ['sh', '-c', '
{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
mkdir -p {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/gateway;
chown -Rv {{ $gateway.securityContext.runAsUser | default 999 }}:{{ $gateway.securityContext.runAsGroup | default 999 }} {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/gateway;
chmod -v g+rwxs {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/gateway;
setfacl -n -Rm d:m::rx,m::rx,d:g:{{ $gateway.securityContext.runAsGroup | default 999 }}:rx,g:{{ $gateway.securityContext.runAsGroup | default 999 }}:rx {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/gateway;
{{- end }}']
securityContext:
runAsUser: 0
jvoravong marked this conversation as resolved.
Show resolved Hide resolved
volumeMounts:
{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
- name: persistent-queue
mountPath: {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/gateway
{{- end }}
containers:
- name: otel-collector
command:
Expand Down Expand Up @@ -151,6 +169,10 @@ spec:
{{- end }}
- mountPath: {{ .Values.isWindows | ternary "C:\\conf" "/conf" }}
name: collector-configmap
{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
- name: persistent-queue
mountPath: {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/gateway
{{- end }}
{{- if $gateway.extraVolumeMounts }}
{{- toYaml $gateway.extraVolumeMounts | nindent 8 }}
{{- end }}
Expand All @@ -167,6 +189,12 @@ spec:
secret:
secretName: {{ template "splunk-otel-collector.secret" . }}
{{- end }}
{{- if .Values.splunkPlatform.sendingQueue.persistentQueue.enabled }}
- name: persistent-queue
hostPath:
path: {{ .Values.splunkPlatform.sendingQueue.persistentQueue.storagePath }}/gateway
type: DirectoryOrCreate
{{- end }}
{{- if $gateway.extraVolumes }}
{{- toYaml $gateway.extraVolumes | nindent 6 }}
{{- end }}
Expand Down
Loading