Move CI (prod) to run on K8s #36

xtreme-sameer-vohra · 2019-11-06T18:48:17Z

We would like to migrate the CI (prod) deployment to run on K8s

TODOs v1

TODOS v2

The text was updated successfully, but these errors were encountered:

deniseyu · 2019-11-07T15:57:11Z

Deployed a new Vault instance in the gke-hush-house-generic-<some stuff>-2k5g node and it is the only pod in the vault namespace. All prod secrets (at least, the ones we grabbed around 3pm yesterday) have been restored onto it!!

deniseyu · 2019-11-07T16:21:26Z

If anything goes wrong with this restore and we have to do it again, here's updated operating docs on backing up and restoring Vault from BOSH prod:

https://github.com/pivotal/concourse-ops/wiki/Operating-Vault#backing-up-vault-secrets

xtreme-sameer-vohra · 2019-11-08T19:38:05Z

Concourse has been deployed using https://github.com/concourse/hush-house/tree/master/deployments/with-creds/ci and is available at https://nci.concourse-ci.org/

xtreme-sameer-vohra · 2019-11-08T19:38:24Z

Plan is to move it over to ci.concourse-ci.org when we're ready for the switch over

cirocosta · 2019-11-08T20:30:51Z

Hey, I noticed that vault is not reserving & limiting resources (https://github.com/concourse/hush-house/blob/master/deployments/with-creds/vault/values.yaml#L3) - it'd be good to add that to the list 😁

something similar to how we do for web

    resources:
      limits:   { cpu: 500m, memory: 256Mi }
      requests: { cpu: 500m, memory: 256Mi  }

https://github.com/concourse/hush-house/blob/9d4c53fd7c3da2f5477be594492c38cda1b05ddf/deployments/with-creds/ci/values.yaml#L55-L57

cirocosta · 2019-11-08T20:32:26Z

Add the papertrail

currently, we use stackdriver logs for hush-house 🤔

it might we worth considering whether we have discounts for that (in which case, getting out of papertrail would then be considered a $ pro 😁 )

at some point, we'd need to get the main pipeline continuously redeploying this environment - it might be a thing for another issue (as there are details of like, which resource-type to use), but it'll definitely be something to think about at some point

cirocosta · 2019-11-08T20:37:31Z

Move metrics for CI over to K8s

that's (in theory) all automatically set up 😁 if you go to metrics-hush-house and change the values in the dropdown to point to the desired namespace, you'll see the metrics for those installations.

naturally, that's just "in theory" hahah, we never really exercised those dashboards with more than 1 deployment (hush-house)

deniseyu · 2019-11-08T20:39:52Z

I think the callback URL for login is misconfigured - I know it's not ready yet but I was chomping at the bit to try to get 5.7.1 kicked off 😂 and tried to do GitHub login, but got an error and was redirected to Hush House.

vito · 2019-11-08T20:43:05Z

I've updated the client ID and secret in LastPass (hush-house-values-ci). Once the chart is re-deployed with that the redirect should be unborked. (I made a new OAuth application for it.)

xtreme-sameer-vohra · 2019-11-08T21:31:09Z

#Update

The new github client is configured, so folks can login and it redirects properly
Going to look at the Concourse + Vault integration now

@cirocosta @kcmannem was mentioning that stackdriver is super slow and we'd prefer papertrail, something to dicuss

xtreme-sameer-vohra · 2019-11-08T22:53:51Z

Is there a good checklist we can use to ensure vault is configured securely

Of the top of my head, but not substantive by any means;

vault is not reachable from outside of the cluster
vault is not reachable by any workers in the ci and hush-house and any other deployments that may be added to the k8s cluster
vault is configured with tls

xtreme-sameer-vohra · 2019-11-08T23:38:25Z

Another thought, should we even use vault on K8s rather than K8s secrets ?

cirocosta · 2019-11-09T00:06:11Z

vault is not reachable from outside of the cluster
vault is not reachable by any workers in the ci [...]

when it comes to reachability, I'd also add:

vault is reachable only by web pods
vault (the pod as a whole) cannot reach any services (all eggress blocked)¹,
otherwise we're pretty much allowing ingress indirectly (as TCP conns are
bidirectional once established)

(to enable these ^ we'd need to enable the use of net policies in the cluster though)

however, should we really care? if we're already assuming that we face
intra-network threats and should protect ourselves against it, that's pretty
much already distrusting the network completly, to a point where we're protected
enough by using techniques such as mtls. Thus, should we care where in the
network we are?

vito · 2019-11-09T00:09:43Z

Another thought, should we even use vault on K8s rather than K8s secrets ?

I'd like to keep using Vault for dogfooding purposes mainly.

zoetian · 2019-11-11T21:50:04Z

Since the /vault/data/auth directory got copied directly over from the BOSH-deployed Vault server, the auth policies were preserved so it should be possible for Concourse to use the same TLS cert to authenticate as before - however, when testing this we started to see this error:

$ vault login -method=cert
Error authenticating: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/cert/login
Code: 400. Errors:

* tls connection required

(and a similar error appears in the logs of the web pod when it is configured to use the same cert for authentication). It seems reasonable to conclude that TLS must be enabled in order to use a TLS cert for authentication. We generated a self-signed cert with vault.vault.svc.cluster.local as a Common Name (not a SAN) and were in the process of adding the vaultCaCert secret into the Concourse chart, but we got a bit stuck figuring out what the fields of the k8s secret required for the vault server's TLS configuration needed to be. If anyone can suggest a template for this secret, that would be very helpful. (@cirocosta?) We will pick this work back up tomorrow in the late morning.

In general, this work is pretty significantly slowed down by other interruptions.

pivotal-bin-ju · 2019-11-11T22:02:04Z

Would `prod.concourse-ci.org` make more sense than `nci.concourse-ci.org`? We already have `ci` in the domain name.

…

On Mon, Nov 11, 2019 at 4:50 PM Zoe Tian ***@***.***> wrote: Since the /vault/data/auth directory got copied directly over from the BOSH-deployed Vault server, the auth policies were preserved so it should be possible for Concourse to use the same TLS cert to authenticate as before - however, when testing this we started to see this error: $ vault login -method=cert Error authenticating: Error making API request. URL: PUT http://127.0.0.1:8200/v1/auth/cert/login Code: 400. Errors: * tls connection required (and a similar error appears in the logs of the web pod when it is configured to use the same cert for authentication). It seems reasonable to conclude that TLS must be enabled in order to use a TLS cert for authentication. We generated a self-signed cert with vault.vault.svc.cluster.local as a Common Name (not a SAN) and were in the process of adding the vaultCaCert secret into the Concourse chart, but we got a bit stuck figuring out what the fields of the k8s secret required for the vault server's TLS configuration <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.vaultproject.io_docs_platform_k8s_helm.html-23standalone-2Dserver-2Dwith-2Dtls&d=DwMFaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=x2xYZB6vYNrWd2g7NJBsBg&m=EpEjenpVSboLCYK4Pn61z2A7VTMV3vAq3ONo5iAFt80&s=hinz-p1BmvTnK0BHyC5C_hwoCUxQx3uhqCnxdyWiobs&e=> needed to be. If anyone can suggest a template for this secret, that would be very helpful. ***@***.*** <https://github.com/cirocosta>?) We will pick this work back up tomorrow in the late morning. In general, this work is pretty significantly slowed down by other interruptions. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#36?email_source=notifications&email_token=AEVDUM2M5BNOTZM7ABBEPHDQTHHQ3A5CNFSM4JJ2RYO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDYIAKQ#issuecomment-552632362>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVDUM33EGAREXL2AJ5EVH3QTHHQ3ANCNFSM4JJ2RYOQ> .

-- Bin Ju 鞠国滨 +1(647)835 6177

jamieklassen · 2019-11-12T22:11:44Z

learnings from yesterday:

the test pipeline referred to a secret under /concourse/resources so we had to move it into a resources team
we didn't pinpoint the exact circumstances when this was required, but a couple times we got a deploy of the ci chart working by incrementing web.annotations.rollingUpdate.

learnings from today:

when deploying the vault chart, we had to set HELM_FLAGS=--recreate-pods in some cases, otherwise our changes to the chart did not get honored.
we had to make sure the self-signed cert we generated for use by the vault server had a DNS SAN that matched the host in CONCOURSE_VAULT_URL and an IP SAN of 127.0.0.1. Using the openssl CLI didn't seem to support specifying SANs very easily, and using bosh int resulted in certs with no SANs too. https://certificatetools.com was eventually what we used.
any vault CLI commands run while execed into the vault pod need the -ca-cert=/vault/userconfig/vault-server-tls/vault.ca flag in order to work.

next steps:

put the new secrets (vault server ca/cert/key, concourse client cert/key) into lpass
commit configuration changes
parameterize vault/templates/vault-tls-secret.yml so that the values can be populated from a .values.yaml file
update the wiki entry appropriately - describe how we generated vault's certs (since that cert will expire in 1 year), mention the -ca-cert flag in the existing instructions.

zoetian · 2019-11-13T22:10:26Z

today we committed our changes to the chart and documented the process of rotating the vault TLS cert.
tomorrow, we will work on updating https://github.com/concourse/ci/blob/master/pipelines/reconfigure.yml to include a new job, reconfigure-resource-pipelines, which will have, for each base resource type:

a task step to render the jsonnet template for that resource type's pipeline
a put step to set the pipeline for that resource type

cirocosta · 2019-11-20T15:36:06Z

Hey,

Aside from those tasks above, there's a set of small fixes that we need to apply
to make the pipeline move to further steps:

k8s-smoke currently fails due to chart path lookup
- perhaps Migrate to the new chart repo ci#200?
k8s-check-helm-params
- perhaps Migrate to the new chart repo ci#200?

Error: found some variables supported by the Concourse binary that are missing from the helm packaging:

CONCOURSE_CONFIG_RBAC
CONCOURSE_GARDEN_REQUEST_TIMEOUT
CONCOURSE_GARDEN_USE_CONTAINERD
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_ANY_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_AUDITOR_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_DEVELOPER_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_MANAGER_ROLE
CONCOURSE_NEWRELIC_BATCH_DISABLE_COMPRESSION
CONCOURSE_NEWRELIC_BATCH_DURATION
CONCOURSE_NEWRELIC_BATCH_SIZE

bosh-check-props


Error: found some variables in the bosh packaging that might not be supported by the Concourse binary:

CONCOURSE_MAIN_TEAM_CONFIG
CONCOURSE_MAIN_TEAM_MICROSOFT_GROUP
CONCOURSE_MAIN_TEAM_MICROSOFT_USER
CONCOURSE_MICROSOFT_CLIENT_ID
CONCOURSE_MICROSOFT_CLIENT_SECRET
CONCOURSE_MICROSOFT_GROUPS
CONCOURSE_MICROSOFT_ONLY_SECURITY_GROUPS
CONCOURSE_MICROSOFT_TENANT
Error: found some variables supported by the Concourse binary that are missing from the bosh packaging:
CONCOURSE_CONFIG_RBAC
CONCOURSE_GARDEN_USE_CONTAINERD
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_ANY_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_AUDITOR_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_DEVELOPER_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_MANAGER_ROLE

thanks!

cirocosta · 2019-11-20T19:01:57Z

update: we got most of those reduced to a fewer number of flags, but concourse/ci#200 is still not merged yet, so we stopped going forward w/ the helm-related changes

vito · 2019-11-22T14:17:10Z

The Vault node got bounced last night and became sealed, so I manually went in and unsealed it using the credentials in LastPass. We should probably find a way to auto-unseal or something so this isn't a constant burden. 🤔

deniseyu · 2019-11-22T14:35:59Z

https://learn.hashicorp.com/vault/operations/autounseal-gcp-kms

jamieklassen · 2019-11-22T15:07:14Z

@vito @deniseyu I edited @xtreme-sameer-vohra's top comment with some useful links

kcmannem · 2019-11-22T16:21:27Z

I've added a task to migrate example pipelines used in https://concourse-ci.org/examples.html to the new cluster.

though I'm not sure where the configs for these lie.

kcmannem · 2019-11-25T15:32:37Z

Took a look at the diffs between ci-house and prod configs. Here's the diff:

MISSING FROM CI-HOUSE

For untrusted workers we have to setup deny networks

garden:
    deny_networks:
    - 10.0.0.0/16

We deny our host network on the pr workers as we don't want to expose communication to workloads coming externally. I don't know the host network pool in gke we use but we already deny a 169.x.x.x subnet so this might already be taken care of.

On the ATC:

default_task_cpu_limit: 1024
default_task_memory_limit: 5GB

x_frame_options: ""

Idk if we wanna continue using these limits.

On the worker:

volume_sweeper_max_in_flight: 3

I'm going to choose to skip this, by default this value is set to 3. It was set manually because btrfs use to be unstable when we hit the driver too hard.

cirocosta · 2019-11-25T15:38:48Z

Hey,

I don't know the host network pool in gke

module "vpc" {
  source = "./vpc"

  name   = "${var.name}"
  region = "${var.region}"

  vms-cidr      = "10.10.0.0/16"
  pods-cidr     = "10.11.0.0/16"
  services-cidr = "10.12.0.0/16"
}

https://github.com/concourse/hush-house/blob/a14d0832ecac5753c138a9287e12a3be375cc1a5/terraform/cluster/main.tf#L7-L9

we use but we already deny a 169.x.x.x subnet so this might already be taken care of.

the block to 169.254.169.254/32 is only to avoid queries to GCP's metadata server.

(https://github.com/concourse/hush-house/blob/a14d0832ecac5753c138a9287e12a3be375cc1a5/deployments/with-creds/ci-pr/values.yaml#L33-L34)

in concourse/hush-house#75 we tackled most of the issues w/ regards to reaching out to other workloads in the cluster, but I don't think a block on all 10.0.0.0/8 might hurt with the current configuration (DNS would go to a 10.x.x.x address - kube-dns - but that'd originate from from the concourse process through the dns forwarding that we perform).

(on concourse/hush-house#80 I describe how we could & should protect that a bit more)

cirocosta · 2019-11-25T15:40:32Z

Idk if we wanna continue using these limits.

as long as we're using COS (IIRC, we are for nci), we'll be able to set those values

	onGke(func() {
		containerLimitsWork(COS, TaskCPULimit, TaskMemoryLimit)
		containerLimitsFail(UBUNTU, TaskCPULimit, TaskMemoryLimit)
	})

(from https://github.com/concourse/concourse/blob/a8e001f8e655b442f34ebe8909267747f897469b/topgun/k8s/container_limits_test.go#L26-L29)

but yeah, I'd personally not set them - @vito might have opinions on it? I don't think I was around when we put that on ci

kcmannem · 2019-11-25T15:55:00Z

@cirocosta thanks!

kcmannem · 2019-11-25T15:56:52Z

idk if this convo happened already but i'd like to keep using papertrail, i find stackdriver really slow and hard to search.

Here's a link on how to set it up, if we still want to use papertrail:
https://help.papertrailapp.com/kb/configuration/configuring-centralized-logging-from-kubernetes/

cirocosta · 2019-11-25T18:14:04Z

idk if this convo happened already but i'd like to keep using papertrail, i find stackdriver really slow and hard to search.

@kcmannem , my biggest reason for going w/ stackdriver would be to leverage logs-based metrics (which we already do for hush-house, but not for ci yet - concourse/hush-house#78). I'm not sure if there'd be a way of doing container logs -> log aggregation service -> grafana graphing in a simple way for papertrail (if we went full datadog, that'd be possible using datadog's log parsing, etc).

I found that at least for hush-house having the logs graphed made me pretty much never have to go search for log messages - when I did, I knew exactly what to search for (and under which range - as the dashboard would already show)

jamieklassen mentioned this issue Nov 14, 2019

track base resource pipelines in reconfigure.yml concourse/ci#207

Merged

cirocosta mentioned this issue Nov 22, 2019

add ci-pr deployment concourse/hush-house#75

Merged

cirocosta mentioned this issue Nov 27, 2019

release pipelines use cluster-1, rather than topgun gke cluster concourse/ci#195

Closed

cirocosta mentioned this issue Feb 11, 2020

Investigate kubernetes-native continuous deployment tooling to use to automate our environment deployments concourse/concourse#5163

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move CI (prod) to run on K8s #36

Move CI (prod) to run on K8s #36

xtreme-sameer-vohra commented Nov 6, 2019 •

edited by cirocosta

Loading

deniseyu commented Nov 7, 2019 •

edited

Loading

deniseyu commented Nov 7, 2019

xtreme-sameer-vohra commented Nov 8, 2019

xtreme-sameer-vohra commented Nov 8, 2019

cirocosta commented Nov 8, 2019 •

edited

Loading

cirocosta commented Nov 8, 2019 •

edited

Loading

cirocosta commented Nov 8, 2019

deniseyu commented Nov 8, 2019

vito commented Nov 8, 2019 •

edited

Loading

xtreme-sameer-vohra commented Nov 8, 2019

xtreme-sameer-vohra commented Nov 8, 2019

xtreme-sameer-vohra commented Nov 8, 2019

cirocosta commented Nov 9, 2019 •

edited

Loading

vito commented Nov 9, 2019

zoetian commented Nov 11, 2019

pivotal-bin-ju commented Nov 11, 2019 via email

jamieklassen commented Nov 12, 2019

zoetian commented Nov 13, 2019

cirocosta commented Nov 20, 2019 •

edited

Loading

cirocosta commented Nov 20, 2019

vito commented Nov 22, 2019

deniseyu commented Nov 22, 2019

jamieklassen commented Nov 22, 2019

kcmannem commented Nov 22, 2019 •

edited

Loading

kcmannem commented Nov 25, 2019 •

edited

Loading

cirocosta commented Nov 25, 2019 •

edited

Loading

cirocosta commented Nov 25, 2019 •

edited

Loading

kcmannem commented Nov 25, 2019

kcmannem commented Nov 25, 2019

cirocosta commented Nov 25, 2019 •

edited

Loading

Move CI (prod) to run on K8s #36

Move CI (prod) to run on K8s #36

Comments

xtreme-sameer-vohra commented Nov 6, 2019 • edited by cirocosta Loading

deniseyu commented Nov 7, 2019 • edited Loading

deniseyu commented Nov 7, 2019

xtreme-sameer-vohra commented Nov 8, 2019

xtreme-sameer-vohra commented Nov 8, 2019

cirocosta commented Nov 8, 2019 • edited Loading

cirocosta commented Nov 8, 2019 • edited Loading

cirocosta commented Nov 8, 2019

deniseyu commented Nov 8, 2019

vito commented Nov 8, 2019 • edited Loading

xtreme-sameer-vohra commented Nov 8, 2019

xtreme-sameer-vohra commented Nov 8, 2019

xtreme-sameer-vohra commented Nov 8, 2019

cirocosta commented Nov 9, 2019 • edited Loading

vito commented Nov 9, 2019

zoetian commented Nov 11, 2019

pivotal-bin-ju commented Nov 11, 2019 via email

jamieklassen commented Nov 12, 2019

zoetian commented Nov 13, 2019

cirocosta commented Nov 20, 2019 • edited Loading

cirocosta commented Nov 20, 2019

vito commented Nov 22, 2019

deniseyu commented Nov 22, 2019

jamieklassen commented Nov 22, 2019

kcmannem commented Nov 22, 2019 • edited Loading

kcmannem commented Nov 25, 2019 • edited Loading

cirocosta commented Nov 25, 2019 • edited Loading

cirocosta commented Nov 25, 2019 • edited Loading

kcmannem commented Nov 25, 2019

kcmannem commented Nov 25, 2019

cirocosta commented Nov 25, 2019 • edited Loading

xtreme-sameer-vohra commented Nov 6, 2019 •

edited by cirocosta

Loading

deniseyu commented Nov 7, 2019 •

edited

Loading

cirocosta commented Nov 8, 2019 •

edited

Loading

cirocosta commented Nov 8, 2019 •

edited

Loading

vito commented Nov 8, 2019 •

edited

Loading

cirocosta commented Nov 9, 2019 •

edited

Loading

cirocosta commented Nov 20, 2019 •

edited

Loading

kcmannem commented Nov 22, 2019 •

edited

Loading

kcmannem commented Nov 25, 2019 •

edited

Loading

cirocosta commented Nov 25, 2019 •

edited

Loading

cirocosta commented Nov 25, 2019 •

edited

Loading

cirocosta commented Nov 25, 2019 •

edited

Loading