Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove NATing from BOSH networks #35

Open
cirocosta opened this issue Nov 4, 2019 · 2 comments
Open

Remove NATing from BOSH networks #35

cirocosta opened this issue Nov 4, 2019 · 2 comments

Comments

@cirocosta
Copy link
Member

Hey,

We've been recently receiving complaints that resources like docker-image and
registry-image have been failing with "429 Too Many Requests".

While we did introduce retries at the resource-type level for registry-image, (see
concourse/registry-image-resource#69) those using
docker-image (or trying to reach dockerhub directly) would still suffer from the
limit being place on our IP.

My hypothesis is that by removing the NAT machine that we have in the bosh
network
(which ends up making every request from any of the 40+ machines we
have going out from that single IP), we can then get rid of the problems we're
currently facing w/ regards to limits on the number of requests
(aside from
reducing one hop and a single point of failure).

Last week, I naively tried just removing the routes that we have set at the
network level

prod/iaas/bosh.tf

Lines 135 to 153 in 92cf177

resource "google_compute_route" "internal_nat" {
name = "internal-nat-route"
dest_range = "0.0.0.0/0"
network = "${google_compute_network.bosh.name}"
next_hop_instance = "${google_compute_instance.nat.name}"
next_hop_instance_zone = "${google_compute_instance.nat.zone}"
priority = 800
tags = ["internal"]
}
resource "google_compute_route" "vault_nat" {
name = "vault-nat-route"
dest_range = "0.0.0.0/0"
network = "${google_compute_network.bosh.name}"
next_hop_instance = "${google_compute_instance.nat.name}"
next_hop_instance_zone = "${google_compute_instance.nat.zone}"
priority = 800
tags = ["vault"]
}

but that didn't really work as expected as the machines that we create in the
bosh network do not assign ephemeral external IPs:

"The instance must have an external IP address. An external IP can be assigned
to an instance when it is created or after it has been created."

(from https://cloud.google.com/vpc/docs/vpc#internet_access_reqs)

- name: private
type: dynamic
subnets:
- azs: [z1, z2]
cloud_properties:
network_name: bosh
subnetwork_name: internal
tags: [internal]

Given that we're on GCP, we can overcome that by using the
ephemeral_external_ip property - see https://bosh.io/docs/google-cpi/#networks.

Should we do that? I think so - if we don't have the requirement of having those
machines completely unreachable at all (not really true in our case), I think we
should just drop it.

Thanks!

@xtreme-sameer-vohra
Copy link
Contributor

We could use firewall rules & tags to ensure only outbound requests are allowed from the workers.

However, we don't have anyway of enforcing that those remain in place. For example, someone would be able to remove those rules or inadvertently change the tags/network name etc and we wouldn't know about it.

@cirocosta
Copy link
Member Author

cirocosta commented Nov 4, 2019

However, we don't have anyway of enforcing that those remain in place

yeah, while I do agree that that's indeed true and easy to misconfigure, I think it's just inevitable that our move to "protect the endpoints as if you were already compromised", and this can be a motivator to getting better at this (w/ e.g. issues like concourse/concourse#2415 and not exposing endpoints w/out auth in general) 🤔

(my point being that by forcing ourselves to rely less on a "perimeter of protection", we can be even more motivated to get our infra better protected to any scenario)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants