-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move CI (prod) to run on K8s #36
Comments
Deployed a new Vault instance in the |
If anything goes wrong with this restore and we have to do it again, here's updated operating docs on backing up and restoring Vault from BOSH prod: https://github.com/pivotal/concourse-ops/wiki/Operating-Vault#backing-up-vault-secrets |
Concourse has been deployed using |
Plan is to move it over to ci.concourse-ci.org when we're ready for the switch over |
Hey, I noticed that something similar to how we do for resources:
limits: { cpu: 500m, memory: 256Mi }
requests: { cpu: 500m, memory: 256Mi } |
currently, we use it might we worth considering whether we have discounts for that (in which case, getting out of papertrail would then be considered a $ pro 😁 ) at some point, we'd need to get the main pipeline continuously redeploying this environment - it might be a thing for another issue (as there are details of like, which |
that's (in theory) all automatically set up 😁 if you go to naturally, that's just "in theory" hahah, we never really exercised those dashboards with more than 1 deployment (hush-house) |
I think the callback URL for login is misconfigured - I know it's not ready yet but I was chomping at the bit to try to get 5.7.1 kicked off 😂 and tried to do GitHub login, but got an error and was redirected to Hush House. |
I've updated the client ID and secret in LastPass ( |
#Update
@cirocosta @kcmannem was mentioning that stackdriver is super slow and we'd prefer papertrail, something to dicuss |
Is there a good checklist we can use to ensure vault is configured securely Of the top of my head, but not substantive by any means;
|
Another thought, should we even use vault on K8s rather than K8s secrets ? |
when it comes to reachability, I'd also add:
(to enable these ^ we'd need to enable the use of net policies in the cluster though) however, should we really care? if we're already assuming that we face |
I'd like to keep using Vault for dogfooding purposes mainly. |
Since the
(and a similar error appears in the logs of the web pod when it is configured to use the same cert for authentication). It seems reasonable to conclude that TLS must be enabled in order to use a TLS cert for authentication. We generated a self-signed cert with In general, this work is pretty significantly slowed down by other interruptions. |
Would `prod.concourse-ci.org` make more sense than `nci.concourse-ci.org`?
We already have `ci` in the domain name.
…On Mon, Nov 11, 2019 at 4:50 PM Zoe Tian ***@***.***> wrote:
Since the /vault/data/auth directory got copied directly over from the
BOSH-deployed Vault server, the auth policies were preserved so it should
be possible for Concourse to use the same TLS cert to authenticate as
before - however, when testing this we started to see this error:
$ vault login -method=cert
Error authenticating: Error making API request.
URL: PUT http://127.0.0.1:8200/v1/auth/cert/login
Code: 400. Errors:
* tls connection required
(and a similar error appears in the logs of the web pod when it is
configured to use the same cert for authentication). It seems reasonable to
conclude that TLS must be enabled in order to use a TLS cert for
authentication. We generated a self-signed cert with
vault.vault.svc.cluster.local as a Common Name (not a SAN) and were in
the process of adding the vaultCaCert secret into the Concourse chart,
but we got a bit stuck figuring out what the fields of the k8s secret
required for the vault server's TLS configuration
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.vaultproject.io_docs_platform_k8s_helm.html-23standalone-2Dserver-2Dwith-2Dtls&d=DwMFaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=x2xYZB6vYNrWd2g7NJBsBg&m=EpEjenpVSboLCYK4Pn61z2A7VTMV3vAq3ONo5iAFt80&s=hinz-p1BmvTnK0BHyC5C_hwoCUxQx3uhqCnxdyWiobs&e=>
needed to be. If anyone can suggest a template for this secret, that would
be very helpful. ***@***.*** <https://github.com/cirocosta>?) We will
pick this work back up tomorrow in the late morning.
In general, this work is pretty significantly slowed down by other
interruptions.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#36?email_source=notifications&email_token=AEVDUM2M5BNOTZM7ABBEPHDQTHHQ3A5CNFSM4JJ2RYO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDYIAKQ#issuecomment-552632362>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVDUM33EGAREXL2AJ5EVH3QTHHQ3ANCNFSM4JJ2RYOQ>
.
--
Bin Ju 鞠国滨
+1(647)835 6177
|
learnings from yesterday:
learnings from today:
next steps:
|
today we committed our changes to the chart and documented the process of rotating the vault TLS cert.
|
Hey, Aside from those tasks above, there's a set of small fixes that we need to apply
thanks! |
update: we got most of those reduced to a fewer number of flags, but concourse/ci#200 is still not merged yet, so we stopped going forward w/ the helm-related changes |
The Vault node got bounced last night and became sealed, so I manually went in and unsealed it using the credentials in LastPass. We should probably find a way to auto-unseal or something so this isn't a constant burden. 🤔 |
@vito @deniseyu I edited @xtreme-sameer-vohra's top comment with some useful links |
I've added a task to migrate example pipelines used in https://concourse-ci.org/examples.html to the new cluster. though I'm not sure where the configs for these lie. |
Took a look at the diffs between ci-house and prod configs. Here's the diff: MISSING FROM CI-HOUSE For untrusted workers we have to setup deny networks
We deny our host network on the pr workers as we don't want to expose communication to workloads coming externally. I don't know the host network pool in gke we use but we already deny a 169.x.x.x subnet so this might already be taken care of. On the ATC:
Idk if we wanna continue using these limits. On the worker:
I'm going to choose to skip this, by default this value is set to 3. It was set manually because btrfs use to be unstable when we hit the driver too hard. |
Hey,
module "vpc" {
source = "./vpc"
name = "${var.name}"
region = "${var.region}"
vms-cidr = "10.10.0.0/16"
pods-cidr = "10.11.0.0/16"
services-cidr = "10.12.0.0/16"
}
the block to 169.254.169.254/32 is only to avoid queries to GCP's metadata server. in concourse/hush-house#75 we tackled most of the issues w/ regards to reaching out to other workloads in the cluster, but I don't think a block on all (on concourse/hush-house#80 I describe how we could & should protect that a bit more) |
as long as we're using COS (IIRC, we are for onGke(func() {
containerLimitsWork(COS, TaskCPULimit, TaskMemoryLimit)
containerLimitsFail(UBUNTU, TaskCPULimit, TaskMemoryLimit)
}) but yeah, I'd personally not set them - @vito might have opinions on it? I don't think I was around when we put that on |
@cirocosta thanks! |
idk if this convo happened already but i'd like to keep using papertrail, i find stackdriver really slow and hard to search. Here's a link on how to set it up, if we still want to use papertrail: |
@kcmannem , my biggest reason for going w/ stackdriver would be to leverage logs-based metrics (which we already do for I found that at least for |
We would like to migrate the CI (prod) deployment to run on K8s
TODOs v1
examples
team pipelines to new CI #38ci.concourse-ci.org
#39baggageclaim
pipeline setupbaggageclaim
to reconfigure-pipelines ci#221TODOS v2
Point workers to the new CI env
Move untrusted workers into K8s: move PR pipeline and checks over to new CI ci#211
Move metrics for CI over to K8s
@clarafu @vito check that there's a backup before upgrading to v6
Scale down BOSH deployed
prod/ci
env to 1 or 0 ?? scale down old prod #46switch deployment to use the concourse-rc, Setup CI to auto deploy CI make a job that will deploy the same version of Concourse to prod and all its worker pools #45
add well-supported storage backend for vault #44 Pick a scalable storage backend for vault ( such that backups, snapshots are much easier to manage ), auto-unseal with Google Cloud Key Management Service (extraEnvVars in values.yaml, auto-unseal settings in vault config file, sample terraform config for gcpkms, terraform docs for gcpkms)
switch to datadawg if grafana gets hard to manage
The text was updated successfully, but these errors were encountered: