-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Templating for Configmaps, Default Values file #647
base: main
Are you sure you want to change the base?
Conversation
secretKeyRef: | ||
name: {{ .Values.redis.authSecret }} | ||
key: auth_token | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need refactor/replace above, this should come from secret. Also need to verify this is actually used. Vaguely remember seeing that it was unnecessary (because we wind up putting the redis auth token in the fully specified URL in AWS secrets manager instead)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's apply this change
@@ -20,6 +20,9 @@ data: | |||
[profile {{ $profileName }}] | |||
role_arn = {{ index $annotations "eks.amazonaws.com/role-arn" }} | |||
web_identity_token_file = /var/run/secrets/eks.amazonaws.com/serviceaccount/token | |||
[profile {{ $.Values.serviceAccount.sqsProfileName }}] | |||
role_arn = {{ index $annotations "eks.amazonaws.com/role-arn" }} | |||
web_identity_token_file = /var/run/secrets/eks.amazonaws.com/serviceaccount/token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally consolidate this to one profile- need to verify that the SQS profile is assumable / check whether it has to match $profileName
labels: | ||
product: common | ||
team: infra | ||
annotations: | ||
"helm.sh/hook": pre-install | ||
"helm.sh/hook": pre-install,pre-upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this you can't update the vllm tags on upgrades
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, internally, we have a separate upgrade process for vllm that doesn't require install helm charts. If you're interested, we can socialize that a bit better
vllm_batch_v2: "latest" | ||
vllm: "{{ .Values.vllmTag }}" | ||
vllm_batch: "{{ .Values.vllmTag }}" | ||
vllm_batch_v2: "{{ .Values.vllmTag }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice to specify / could probably default to latest if not set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are different images so we would want vllmTag, vllmBatchTag, vllmBatchV2Tag; and yes let's default to latest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
{{- range .Values.virtualservice.gateways }} | ||
- {{ . | quote }} | ||
{{- end }} | ||
- {{ $.Values.global.networking.internalGateway }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im open to scrapping this- this was part of a DNS refactor we did to make specification of virtual services simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious to see what the rest of the dns refactor changes look like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i made the rest of the refactor- it helps when model engine is a subchart to prevent duplication of changing the core domain a bunch of places
env: {{ .Values.context | quote }} | ||
{{- with .Values.config.values.infra }} | ||
{{- range $key, $value := . }} | ||
{{ $key }}: {{ $value | quote }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This broke for some reason, I think related to how helm / go templating was handling certain types. I also don't like it because it doesn't clarify what actually needs to be in the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo we shouldn't hardcode the values that go into launch_service_config and infra_service_config; doing so introduces a coupling between the helm chart and the model engine code, e.g. if the values that need to go in either of the two configs changes. Probably good to see what ended up breaking though and fixing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prefer a single source of truth for configuring the deployment of a new instance- manually changing the yaml files is also a coupling between config and code, and similarly error prone-
i think there's probably a bigger refactor necessary either way to reduce the complexity of interacting with those configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can pull your changes to service/infra configs to a separate file for now for documentation purpose as we figure out a proper way to do this
"infra.scale.com/team": "${team}", | ||
"infra.scale.com/contact": "{{ .Values.contactEmail }}", | ||
"infra.scale.com/customer": "AllCustomers", | ||
"infra.scale.com/financialOwner": "{{ .Values.contactEmail}}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to template "infra.scale.com" to something or allow these as additional tags set in the values.yaml
# {{- if empty $node_selector }} | ||
# nodeSelector: | ||
# {{- end }} | ||
# k8s.amazonaws.com/accelerator: ${GPU_TYPE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will re-add this, just serves as a note that need to add documentation that this tag must be set on nodegroups
tolerations: | ||
- key: "nvidia.com/gpu" | ||
operator: "Exists" | ||
effect: "NoSchedule" | ||
- key: "gpu_a100_multi" | ||
operator: "Exists" | ||
effect: "NoSchedule" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hack, need to delete
@@ -522,6 +525,7 @@ data: | |||
loadBalancer: | |||
simple: LEAST_REQUEST | |||
{{- end }} | |||
{{- if and (.Capabilities.APIVersions.Has "autoscaling.k8s.io/v1") (.Values.autoscaling.vertical.enabled) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is important if you don't have autoscaling
charts/model-engine/values.yaml
Outdated
node-lifecycle: normal | ||
nodeSelector: | ||
node-lifecycle: normal | ||
# balloonNodeSelector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note- need to build out a good default values.yaml with comments specifying the purpose of each configuration option.
@@ -1,25 +0,0 @@ | |||
{{- if and (.Values.serviceTemplate) (.Values.serviceTemplate.createServiceAccount) (.Values.serviceTemplate.serviceAccountAnnotations) (.Values.serviceTemplate.serviceAccountName) (.Values.config.values.launch.endpoint_namespace)}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we're removing this? seems like we could just add some more conditions to the if to achieve the same effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i'm fixing this- didn't realize this "createServiceAccount" was just for the inference account, which i was getting a naming conflict with so i removed for a quick fix
just updating with explicit naming, since we create another model-engine service account by default
@@ -1,24 +1,20 @@ | |||
{{- if .Values.virtualservice.enabled -}} | |||
{{- if .values.virtualService.enabled -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was Values => values intentional?
port: 80 | ||
|
||
# Creates istio virtual services for the Model Engine using the global domain nome and gateway specified below | ||
virtualService: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is a backwards incompatible change; can we leave this as virtualservice
enabled: false | ||
redis: | ||
auth: | ||
# If specified, will override the name of the deployed services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comprehensive documentation here. It's very much appreciated.
Pull Request Summary
Add templating for configmaps so that they don't have to be built separately. Aiming to comment a default
values.yaml
file so we can clearly show what needs to be configured for working deployment.Test Plan and Usage Guide
Need to test/deploy from scratch prior to merge.