Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move CI (prod) to run on K8s #36

Open
20 of 28 tasks
xtreme-sameer-vohra opened this issue Nov 6, 2019 · 30 comments
Open
20 of 28 tasks

Move CI (prod) to run on K8s #36

xtreme-sameer-vohra opened this issue Nov 6, 2019 · 30 comments

Comments

@xtreme-sameer-vohra
Copy link
Contributor

xtreme-sameer-vohra commented Nov 6, 2019

We would like to migrate the CI (prod) deployment to run on K8s

TODOs v1

TODOS v2

@deniseyu
Copy link
Contributor

deniseyu commented Nov 7, 2019

Deployed a new Vault instance in the gke-hush-house-generic-<some stuff>-2k5g node and it is the only pod in the vault namespace. All prod secrets (at least, the ones we grabbed around 3pm yesterday) have been restored onto it!!

@deniseyu
Copy link
Contributor

deniseyu commented Nov 7, 2019

If anything goes wrong with this restore and we have to do it again, here's updated operating docs on backing up and restoring Vault from BOSH prod:

https://github.com/pivotal/concourse-ops/wiki/Operating-Vault#backing-up-vault-secrets

@xtreme-sameer-vohra
Copy link
Contributor Author

Concourse has been deployed using https://github.com/concourse/hush-house/tree/master/deployments/with-creds/ci and is available at https://nci.concourse-ci.org/

@xtreme-sameer-vohra
Copy link
Contributor Author

Plan is to move it over to ci.concourse-ci.org when we're ready for the switch over

@cirocosta
Copy link
Member

cirocosta commented Nov 8, 2019

Hey, I noticed that vault is not reserving & limiting resources (https://github.com/concourse/hush-house/blob/master/deployments/with-creds/vault/values.yaml#L3) - it'd be good to add that to the list 😁

something similar to how we do for web

    resources:
      limits:   { cpu: 500m, memory: 256Mi }
      requests: { cpu: 500m, memory: 256Mi  }

https://github.com/concourse/hush-house/blob/9d4c53fd7c3da2f5477be594492c38cda1b05ddf/deployments/with-creds/ci/values.yaml#L55-L57

@cirocosta
Copy link
Member

cirocosta commented Nov 8, 2019

Add the papertrail

currently, we use stackdriver logs for hush-house 🤔

it might we worth considering whether we have discounts for that (in which case, getting out of papertrail would then be considered a $ pro 😁 )


at some point, we'd need to get the main pipeline continuously redeploying this environment - it might be a thing for another issue (as there are details of like, which resource-type to use), but it'll definitely be something to think about at some point

@cirocosta
Copy link
Member

Move metrics for CI over to K8s

that's (in theory) all automatically set up 😁 if you go to metrics-hush-house and change the values in the dropdown to point to the desired namespace, you'll see the metrics for those installations.

naturally, that's just "in theory" hahah, we never really exercised those dashboards with more than 1 deployment (hush-house)

@deniseyu
Copy link
Contributor

deniseyu commented Nov 8, 2019

I think the callback URL for login is misconfigured - I know it's not ready yet but I was chomping at the bit to try to get 5.7.1 kicked off 😂 and tried to do GitHub login, but got an error and was redirected to Hush House.

@vito
Copy link
Member

vito commented Nov 8, 2019

I've updated the client ID and secret in LastPass (hush-house-values-ci). Once the chart is re-deployed with that the redirect should be unborked. (I made a new OAuth application for it.)

@xtreme-sameer-vohra
Copy link
Contributor Author

#Update

  • The new github client is configured, so folks can login and it redirects properly
  • Going to look at the Concourse + Vault integration now

@cirocosta @kcmannem was mentioning that stackdriver is super slow and we'd prefer papertrail, something to dicuss

@xtreme-sameer-vohra
Copy link
Contributor Author

Is there a good checklist we can use to ensure vault is configured securely

Of the top of my head, but not substantive by any means;

  • vault is not reachable from outside of the cluster
  • vault is not reachable by any workers in the ci and hush-house and any other deployments that may be added to the k8s cluster
  • vault is configured with tls

@xtreme-sameer-vohra
Copy link
Contributor Author

Another thought, should we even use vault on K8s rather than K8s secrets ?

@cirocosta
Copy link
Member

cirocosta commented Nov 9, 2019

vault is not reachable from outside of the cluster
vault is not reachable by any workers in the ci [...]

when it comes to reachability, I'd also add:

  • vault is reachable only by web pods
  • vault (the pod as a whole) cannot reach any services (all eggress blocked)1,
    otherwise we're pretty much allowing ingress indirectly (as TCP conns are
    bidirectional once established)

(to enable these ^ we'd need to enable the use of net policies in the cluster though)

however, should we really care? if we're already assuming that we face
intra-network threats and should protect ourselves against it, that's pretty
much already distrusting the network completly, to a point where we're protected
enough by using techniques such as mtls. Thus, should we care where in the
network we are?

@vito
Copy link
Member

vito commented Nov 9, 2019

Another thought, should we even use vault on K8s rather than K8s secrets ?

I'd like to keep using Vault for dogfooding purposes mainly.

@zoetian
Copy link
Member

zoetian commented Nov 11, 2019

Since the /vault/data/auth directory got copied directly over from the BOSH-deployed Vault server, the auth policies were preserved so it should be possible for Concourse to use the same TLS cert to authenticate as before - however, when testing this we started to see this error:

$ vault login -method=cert
Error authenticating: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/cert/login
Code: 400. Errors:

* tls connection required

(and a similar error appears in the logs of the web pod when it is configured to use the same cert for authentication). It seems reasonable to conclude that TLS must be enabled in order to use a TLS cert for authentication. We generated a self-signed cert with vault.vault.svc.cluster.local as a Common Name (not a SAN) and were in the process of adding the vaultCaCert secret into the Concourse chart, but we got a bit stuck figuring out what the fields of the k8s secret required for the vault server's TLS configuration needed to be. If anyone can suggest a template for this secret, that would be very helpful. (@cirocosta?) We will pick this work back up tomorrow in the late morning.

In general, this work is pretty significantly slowed down by other interruptions.

@pivotal-bin-ju
Copy link
Contributor

pivotal-bin-ju commented Nov 11, 2019 via email

@jamieklassen
Copy link
Member

learnings from yesterday:

  • the test pipeline referred to a secret under /concourse/resources so we had to move it into a resources team
  • we didn't pinpoint the exact circumstances when this was required, but a couple times we got a deploy of the ci chart working by incrementing web.annotations.rollingUpdate.

learnings from today:

  • when deploying the vault chart, we had to set HELM_FLAGS=--recreate-pods in some cases, otherwise our changes to the chart did not get honored.
  • we had to make sure the self-signed cert we generated for use by the vault server had a DNS SAN that matched the host in CONCOURSE_VAULT_URL and an IP SAN of 127.0.0.1. Using the openssl CLI didn't seem to support specifying SANs very easily, and using bosh int resulted in certs with no SANs too. https://certificatetools.com was eventually what we used.
  • any vault CLI commands run while execed into the vault pod need the -ca-cert=/vault/userconfig/vault-server-tls/vault.ca flag in order to work.

next steps:

  • put the new secrets (vault server ca/cert/key, concourse client cert/key) into lpass
  • commit configuration changes
  • parameterize vault/templates/vault-tls-secret.yml so that the values can be populated from a .values.yaml file
  • update the wiki entry appropriately - describe how we generated vault's certs (since that cert will expire in 1 year), mention the -ca-cert flag in the existing instructions.

@zoetian
Copy link
Member

zoetian commented Nov 13, 2019

today we committed our changes to the chart and documented the process of rotating the vault TLS cert.
tomorrow, we will work on updating https://github.com/concourse/ci/blob/master/pipelines/reconfigure.yml to include a new job, reconfigure-resource-pipelines, which will have, for each base resource type:

  • a task step to render the jsonnet template for that resource type's pipeline
  • a put step to set the pipeline for that resource type

@cirocosta
Copy link
Member

cirocosta commented Nov 20, 2019

Hey,

Aside from those tasks above, there's a set of small fixes that we need to apply
to make the pipeline move to further steps:

Error: found some variables supported by the Concourse binary that are missing from the helm packaging:

CONCOURSE_CONFIG_RBAC
CONCOURSE_GARDEN_REQUEST_TIMEOUT
CONCOURSE_GARDEN_USE_CONTAINERD
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_ANY_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_AUDITOR_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_DEVELOPER_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_MANAGER_ROLE
CONCOURSE_NEWRELIC_BATCH_DISABLE_COMPRESSION
CONCOURSE_NEWRELIC_BATCH_DURATION
CONCOURSE_NEWRELIC_BATCH_SIZE
  • bosh-check-props

Error: found some variables in the bosh packaging that might not be supported by the Concourse binary:

CONCOURSE_MAIN_TEAM_CONFIG
CONCOURSE_MAIN_TEAM_MICROSOFT_GROUP
CONCOURSE_MAIN_TEAM_MICROSOFT_USER
CONCOURSE_MICROSOFT_CLIENT_ID
CONCOURSE_MICROSOFT_CLIENT_SECRET
CONCOURSE_MICROSOFT_GROUPS
CONCOURSE_MICROSOFT_ONLY_SECURITY_GROUPS
CONCOURSE_MICROSOFT_TENANT
Error: found some variables supported by the Concourse binary that are missing from the bosh packaging:
CONCOURSE_CONFIG_RBAC
CONCOURSE_GARDEN_USE_CONTAINERD
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_ANY_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_AUDITOR_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_DEVELOPER_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_MANAGER_ROLE

thanks!

@cirocosta
Copy link
Member

update: we got most of those reduced to a fewer number of flags, but concourse/ci#200 is still not merged yet, so we stopped going forward w/ the helm-related changes

@vito
Copy link
Member

vito commented Nov 22, 2019

The Vault node got bounced last night and became sealed, so I manually went in and unsealed it using the credentials in LastPass. We should probably find a way to auto-unseal or something so this isn't a constant burden. 🤔

@deniseyu
Copy link
Contributor

@jamieklassen
Copy link
Member

@vito @deniseyu I edited @xtreme-sameer-vohra's top comment with some useful links

@kcmannem
Copy link
Member

kcmannem commented Nov 22, 2019

I've added a task to migrate example pipelines used in https://concourse-ci.org/examples.html to the new cluster.

though I'm not sure where the configs for these lie.

@kcmannem
Copy link
Member

kcmannem commented Nov 25, 2019

Took a look at the diffs between ci-house and prod configs. Here's the diff:

MISSING FROM CI-HOUSE

For untrusted workers we have to setup deny networks

garden:
    deny_networks:
    - 10.0.0.0/16

We deny our host network on the pr workers as we don't want to expose communication to workloads coming externally. I don't know the host network pool in gke we use but we already deny a 169.x.x.x subnet so this might already be taken care of.

On the ATC:

default_task_cpu_limit: 1024
default_task_memory_limit: 5GB

x_frame_options: ""

Idk if we wanna continue using these limits.

On the worker:

volume_sweeper_max_in_flight: 3

I'm going to choose to skip this, by default this value is set to 3. It was set manually because btrfs use to be unstable when we hit the driver too hard.

@cirocosta
Copy link
Member

cirocosta commented Nov 25, 2019

Hey,

I don't know the host network pool in gke

module "vpc" {
  source = "./vpc"

  name   = "${var.name}"
  region = "${var.region}"

  vms-cidr      = "10.10.0.0/16"
  pods-cidr     = "10.11.0.0/16"
  services-cidr = "10.12.0.0/16"
}

https://github.com/concourse/hush-house/blob/a14d0832ecac5753c138a9287e12a3be375cc1a5/terraform/cluster/main.tf#L7-L9


we use but we already deny a 169.x.x.x subnet so this might already be taken care of.

the block to 169.254.169.254/32 is only to avoid queries to GCP's metadata server.

(https://github.com/concourse/hush-house/blob/a14d0832ecac5753c138a9287e12a3be375cc1a5/deployments/with-creds/ci-pr/values.yaml#L33-L34)

in concourse/hush-house#75 we tackled most of the issues w/ regards to reaching out to other workloads in the cluster, but I don't think a block on all 10.0.0.0/8 might hurt with the current configuration (DNS would go to a 10.x.x.x address - kube-dns - but that'd originate from from the concourse process through the dns forwarding that we perform).

(on concourse/hush-house#80 I describe how we could & should protect that a bit more)

@cirocosta
Copy link
Member

cirocosta commented Nov 25, 2019

Idk if we wanna continue using these limits.

as long as we're using COS (IIRC, we are for nci), we'll be able to set those values

	onGke(func() {
		containerLimitsWork(COS, TaskCPULimit, TaskMemoryLimit)
		containerLimitsFail(UBUNTU, TaskCPULimit, TaskMemoryLimit)
	})

(from https://github.com/concourse/concourse/blob/a8e001f8e655b442f34ebe8909267747f897469b/topgun/k8s/container_limits_test.go#L26-L29)

but yeah, I'd personally not set them - @vito might have opinions on it? I don't think I was around when we put that on ci

@kcmannem
Copy link
Member

@cirocosta thanks!

@kcmannem
Copy link
Member

idk if this convo happened already but i'd like to keep using papertrail, i find stackdriver really slow and hard to search.

Here's a link on how to set it up, if we still want to use papertrail:
https://help.papertrailapp.com/kb/configuration/configuring-centralized-logging-from-kubernetes/

@cirocosta
Copy link
Member

cirocosta commented Nov 25, 2019

idk if this convo happened already but i'd like to keep using papertrail, i find stackdriver really slow and hard to search.

@kcmannem , my biggest reason for going w/ stackdriver would be to leverage logs-based metrics (which we already do for hush-house, but not for ci yet - concourse/hush-house#78). I'm not sure if there'd be a way of doing container logs -> log aggregation service -> grafana graphing in a simple way for papertrail (if we went full datadog, that'd be possible using datadog's log parsing, etc).

I found that at least for hush-house having the logs graphed made me pretty much never have to go search for log messages - when I did, I knew exactly what to search for (and under which range - as the dashboard would already show)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants