-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Templating for Configmaps, Default Values file #647
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,6 +20,9 @@ data: | |
[profile {{ $profileName }}] | ||
role_arn = {{ index $annotations "eks.amazonaws.com/role-arn" }} | ||
web_identity_token_file = /var/run/secrets/eks.amazonaws.com/serviceaccount/token | ||
[profile {{ $.Values.serviceAccount.sqsProfileName }}] | ||
role_arn = {{ index $annotations "eks.amazonaws.com/role-arn" }} | ||
web_identity_token_file = /var/run/secrets/eks.amazonaws.com/serviceaccount/token | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally consolidate this to one profile- need to verify that the SQS profile is assumable / check whether it has to match $profileName |
||
--- | ||
{{- end }} | ||
{{- end }} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,17 +2,18 @@ apiVersion: v1 | |
kind: ConfigMap | ||
metadata: | ||
name: {{ include "modelEngine.fullname" . }}-inference-framework-latest-config | ||
namespace: {{ .Release.Namespace }} | ||
labels: | ||
product: common | ||
team: infra | ||
annotations: | ||
"helm.sh/hook": pre-install | ||
"helm.sh/hook": pre-install,pre-upgrade | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. without this you can't update the vllm tags on upgrades There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, internally, we have a separate upgrade process for vllm that doesn't require install helm charts. If you're interested, we can socialize that a bit better There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh rip i'll wrap this in a config then, it's easiest for use to change with upgrades but curious about your process |
||
"helm.sh/hook-weight": "-2" | ||
data: | ||
deepspeed: "latest" | ||
text_generation_inference: "latest" | ||
vllm: "latest" | ||
vllm_batch: "latest" | ||
vllm_batch_v2: "latest" | ||
vllm: "{{ .Values.vLLM.primaryTag }}" | ||
vllm_batch: "{{ .Values.vLLM.batchTag }}" | ||
vllm_batch_v2: "{{ .Values.vLLM.batchV2Tag }}" | ||
lightllm: "latest" | ||
tensorrt_llm: "latest" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,20 @@ | ||
{{- if .Values.virtualservice.enabled -}} | ||
{{- if .values.virtualService.enabled -}} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. was Values => values intentional? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no thanks for the catch |
||
{{- $fullName := include "modelEngine.fullname" . -}} | ||
apiVersion: networking.istio.io/v1alpha3 | ||
kind: VirtualService | ||
metadata: | ||
name: {{ $fullName }} | ||
labels: | ||
{{- include "modelEngine.labels" . | nindent 4 }} | ||
{{- with .Values.virtualservice.annotations }} | ||
{{- with .values.virtualService.annotations }} | ||
annotations: | ||
{{- toYaml . | nindent 4 }} | ||
{{- end }} | ||
spec: | ||
hosts: | ||
{{- range .Values.virtualservice.hostDomains }} | ||
- "{{ $fullName }}.{{ . }}" | ||
{{- end }} | ||
- model-engine.{{ $.Values.global.networking.internalDomain }} | ||
gateways: | ||
{{- range .Values.virtualservice.gateways }} | ||
- {{ . | quote }} | ||
{{- end }} | ||
- {{ $.Values.global.networking.internalGateway }} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. im open to scrapping this- this was part of a DNS refactor we did to make specification of virtual services simpler There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. curious to see what the rest of the dns refactor changes look like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i made the rest of the refactor- it helps when model engine is a subchart to prevent duplication of changing the core domain a bunch of places |
||
http: | ||
- route: | ||
- destination: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
{{- if and (.Values.serviceTemplate) (.Values.serviceTemplate.createServiceAccount) (.Values.serviceTemplate.serviceAccountAnnotations) (.Values.serviceTemplate.serviceAccountName) (.Values.config.values.launch.endpoint_namespace)}} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a reason we're removing this? seems like we could just add some more conditions to the if to achieve the same effect There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea i'm fixing this- didn't realize this "createServiceAccount" was just for the inference account, which i was getting a naming conflict with so i removed for a quick fix just updating with explicit naming, since we create another model-engine service account by default |
||
{{- if and (.Values.serviceTemplate) (.Values.serviceTemplate.createInferenceServiceAccount) (.Values.serviceTemplate.serviceAccountAnnotations) (.Values.serviceTemplate.serviceAccountName) (.Values.config.values.launch.endpoint_namespace)}} | ||
{{- $annotations := .Values.serviceTemplate.serviceAccountAnnotations }} | ||
{{- $inferenceServiceAccountName := .Values.serviceTemplate.serviceAccountName }} | ||
{{- $inferenceServiceAccountNamespace := .Values.config.values.launch.endpoint_namespace }} | ||
|
@@ -22,4 +22,4 @@ imagePullSecrets: | |
- name: egp-ecr-regcred | ||
{{- end }} | ||
--- | ||
{{- end }} | ||
{{- end }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,6 +3,7 @@ apiVersion: v1 | |
kind: ConfigMap | ||
metadata: | ||
name: {{ include "modelEngine.fullname" . }}-service-config | ||
namespace: {{ .Release.Namespace }} | ||
labels: | ||
{{- include "modelEngine.labels" . | nindent 4 }} | ||
annotations: | ||
|
@@ -11,46 +12,110 @@ metadata: | |
data: | ||
launch_service_config: |- | ||
dd_trace_enabled: {{ .Values.dd_trace_enabled | default false | quote }} | ||
|
||
# Config to know where model-engine is running | ||
gateway_namespace: {{ .Release.Namespace | quote }} | ||
{{- with .Values.config.values.launch }} | ||
{{- range $key, $value := . }} | ||
{{ $key }}: {{ $value | quote }} | ||
{{- end }} | ||
{{- end }} | ||
infra_service_config: |- | ||
env: {{ .Values.context | quote }} | ||
{{- with .Values.config.values.infra }} | ||
{{- range $key, $value := . }} | ||
{{ $key }}: {{ $value | quote }} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This broke for some reason, I think related to how helm / go templating was handling certain types. I also don't like it because it doesn't clarify what actually needs to be in the config There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. imo we shouldn't hardcode the values that go into launch_service_config and infra_service_config; doing so introduces a coupling between the helm chart and the model engine code, e.g. if the values that need to go in either of the two configs changes. Probably good to see what ended up breaking though and fixing that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i would prefer a single source of truth for configuring the deployment of a new instance- manually changing the yaml files is also a coupling between config and code, and similarly error prone- i think there's probably a bigger refactor necessary either way to reduce the complexity of interacting with those configs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can pull your changes to service/infra configs to a separate file for now for documentation purpose as we figure out a proper way to do this |
||
{{- end }} | ||
{{- end }} | ||
|
||
--- | ||
# Config for scale-hosted Hosted Model Inference in the prod cluster, plus a bunch of other config-ish notes | ||
# NOTE: If you add/change values inside this file that need to apply to all clusters, please make changes in | ||
# all service_config_{env}.yaml files as well. | ||
|
||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: {{ include "modelEngine.fullname" . }}-service-config | ||
namespace: {{ .Values.config.values.launch.endpoint_namespace }} | ||
labels: | ||
{{- include "modelEngine.labels" . | nindent 4 }} | ||
annotations: | ||
"helm.sh/hook": pre-install,pre-upgrade | ||
"helm.sh/hook-weight": "-2" | ||
data: | ||
launch_service_config: |- | ||
dd_trace_enabled: {{ .Values.dd_trace_enabled | default false | quote }} | ||
gateway_namespace: {{ .Release.Namespace | quote }} | ||
{{- with .Values.config.values.launch }} | ||
{{- range $key, $value := . }} | ||
{{ $key }}: {{ $value | quote }} | ||
{{- end }} | ||
{{- end }} | ||
# Config for scale-hosted Hosted Model Inference in the prod cluster, see `service_config` for more details | ||
model_primitive_host: model-server.{{ .Release.Namespace }}.svc.cluster.local | ||
|
||
# # Endpoint config | ||
# K8s namespace the endpoints will be created in | ||
endpoint_namespace: {{ .Release.Namespace | quote }} | ||
|
||
# Asynchronous endpoints | ||
sqs_profile: {{ $.Values.serviceAccount.sqsProfileName }} | ||
sqs_queue_policy_template: |- | ||
{ | ||
"Version": "2012-10-17", | ||
"Id": "__default_policy_ID", | ||
"Statement": [ | ||
{ | ||
"Sid": "__owner_statement", | ||
"Effect": "Allow", | ||
"Principal": { | ||
"AWS": "arn:{{ .Values.aws.partition }}:iam::{{ .Values.aws.accountId }}:root" | ||
}, | ||
"Action": "sqs:*", | ||
"Resource": "arn:{{ .Values.aws.partition }}:sqs:{{ .Values.aws.region }}:{{ .Values.aws.accountId }}:${queue_name}" | ||
}, | ||
{ | ||
"Effect": "Allow", | ||
"Principal": { | ||
"AWS": "arn:{{ .Values.aws.partition }}:iam::{{ .Values.aws.accountId }}:role/{{ $.Values.serviceAccount.sqsProfileName }}" | ||
}, | ||
"Action": "sqs:*", | ||
"Resource": "arn:{{ .Values.aws.partition }}:sqs:{{ .Values.aws.region }}:{{ .Values.aws.accountId }}:${queue_name}" | ||
}, | ||
{ | ||
"Effect": "Allow", | ||
"Principal": { | ||
"AWS": "arn:{{ .Values.aws.partition }}:iam::{{ .Values.aws.accountId }}:role/ml_hosted_model_inference" | ||
}, | ||
"Action": "sqs:*", | ||
"Resource": "arn:{{ .Values.aws.partition }}:sqs:{{ .Values.aws.region }}:{{ .Values.aws.accountId }}:${queue_name}" | ||
} | ||
] | ||
} | ||
|
||
sqs_queue_tag_template: |- | ||
{ | ||
"{{ .Values.tagging.organization }}/product": "{{ .Values.tagging.productTag }}", | ||
"{{ .Values.tagging.organization }}/team": "${team}", | ||
"{{ .Values.tagging.organization }}/contact": "{{ .Values.tagging.contactEmail }}", | ||
"{{ .Values.tagging.organization }}/customer": "AllCustomers", | ||
"{{ .Values.tagging.organization }}/financialOwner": "{{ .Values.tagging.contactEmail }}", | ||
"Launch-Endpoint-Id": "${endpoint_id}", | ||
"Launch-Endpoint-Name": "${endpoint_name}", | ||
"Launch-Endpoint-Created-By": "${endpoint_created_by}" | ||
} | ||
|
||
# Billing | ||
billing_queue_arn: arn:aws:events:{{ .Values.aws.region }}:{{ .Values.aws.accountId }}:event-bus/money | ||
|
||
# The below redis URL would not work if we needed auth, which we do, so we have to pull cache_url from the cache_redis_aws_secret_name | ||
cache_redis_aws_secret_name: "{{ .Values.secrets.redisAwsSecretName }}" | ||
|
||
cloud_file_llm_fine_tune_repository: "s3://{{ .Values.aws.s3Bucket }}/hosted-model-inference/llm-ft-job-repository/prod" | ||
|
||
dd_trace_enabled: true | ||
istio_enabled: true | ||
sensitive_log_mode: true | ||
tgi_repository: "text-generation-inference" | ||
vllm_repository: "vllm" | ||
lightllm_repository: "lightllm" | ||
tensorrt_llm_repository: "tensorrt-llm" | ||
batch_inference_vllm_repository: "llm-engine/batch-infer-vllm" | ||
user_inference_base_repository: "launch/inference" | ||
user_inference_pytorch_repository: "hosted-model-inference/async-pytorch" | ||
user_inference_tensorflow_repository: "hosted-model-inference/async-tensorflow-cpu" | ||
docker_image_layer_cache_repository: "kaniko-cache" | ||
|
||
# S3 access | ||
hf_user_fine_tuned_weights_prefix: "s3://{{ .Values.aws.s3Bucket }}/hosted-model-inference/fine_tuned_weights" | ||
infra_service_config: |- | ||
env: {{ .Values.context | quote }} | ||
{{- with .Values.config.values.infra }} | ||
{{- range $key, $value := . }} | ||
{{ $key }}: {{ $value | quote }} | ||
{{- end }} | ||
{{- end }} | ||
cloud_provider: "aws" | ||
env: "prod" | ||
k8s_cluster_name: "{{ .Values.clusterName }}" | ||
dns_host_domain: "model-engine.{{ $.Values.global.networking.internalDomain }}" | ||
default_region: "{{ .Values.aws.region }}" | ||
ml_account_id: "{{ .Values.aws.accountId }}" | ||
docker_repo_prefix: "{{ .Values.aws.accountId }}.dkr.ecr.{{ .Values.aws.region }}.amazonaws.com" | ||
redis_host: "{{ .Values.redis.hostname }}" | ||
s3_bucket: "{{ .Values.aws.s3Bucket }}" | ||
profile_ml_worker: "ml-worker" | ||
profile_ml_inference_worker: "ml-worker" | ||
identity_service_url: "{{ .Values.identityServiceUrl }}" | ||
firehose_role_arn: "arn:{{ .Values.aws.partition }}:iam::{{ .Values.aws.accountId }}:role/firehose-stream-logging-role" | ||
firehose_stream_name: "{{ .Values.firehoseStreamName }}" | ||
db_engine_pool_size: 20 | ||
db_engine_max_overflow: 10 | ||
db_engine_echo: false | ||
db_engine_echo_pool: true | ||
db_engine_disconnect_strategy: "pessimistic" | ||
{{- end }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -486,7 +486,7 @@ data: | |
protocol: TCP | ||
name: http | ||
${NODE_PORT_DICT} | ||
{{- if .Values.virtualservice.enabled }} | ||
{{- if .values.virtualService.enabled }} | ||
virtual-service.yaml: |- | ||
apiVersion: networking.istio.io/v1alpha3 | ||
kind: VirtualService | ||
|
@@ -522,6 +522,7 @@ data: | |
loadBalancer: | ||
simple: LEAST_REQUEST | ||
{{- end }} | ||
{{- if and (.Capabilities.APIVersions.Has "autoscaling.k8s.io/v1") (.Values.autoscaling.vertical.enabled) }} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is important if you don't have autoscaling |
||
vertical-pod-autoscaler.yaml: |- | ||
apiVersion: "autoscaling.k8s.io/v1" | ||
kind: VerticalPodAutoscaler | ||
|
@@ -548,6 +549,7 @@ data: | |
cpu: ${CPUS} | ||
memory: ${MEMORY} | ||
controlledResources: ["cpu", "memory"] | ||
{{- end }} | ||
pod-disruption-budget.yaml: |- | ||
apiVersion: policy/v1 | ||
kind: PodDisruptionBudget | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need refactor/replace above, this should come from secret. Also need to verify this is actually used. Vaguely remember seeing that it was unnecessary (because we wind up putting the redis auth token in the fully specified URL in AWS secrets manager instead)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's apply this change