Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce post-mortem for HE.net outage in Amsterdam - 15 December 2024 #1196

Open
Firefishy opened this issue Dec 18, 2024 · 2 comments
Open

Comments

@Firefishy
Copy link
Member

Firefishy commented Dec 18, 2024

We has a full service outage from HE.net from approximately 15 December 2024 3:51am until 18 December 2024 00:29am in Amsterdam.

Our equipment continued running. Fibre links remained up, but upstream HE.net equipment failed. We lost all internet connectivity.

  • 15 December 2024 03:53 - HE.net Outage starts.
  • 15 December 2024 04:18 - OSM Ops Initial Email to HE.net NOC, asking if unplanned maintenance.
  • 15 December 2024 04:24 - HE.net email response from HE.net NOC, not maintenance. Investigating issue. (Investigate why email receipt delayed to osmfoundation.org email)
  • 15 December 2024 04:43 - OSM Ops phone HE.net for status update. Response paraphrase: "Outage in Amsterdam, we investigating. No Estimated time to recover. Equinix remote hands on-route"
  • 15 December 2024 07:24 - OSM Ops ask for Estimated time to recover.
  • 15 December 2024 11:12 - OSM Ops phone HE.net for status update. Response paraphrase: "No Estimated time to recover"
  • ...
  • 17 December 2024 12:21 - New Equinix Internet up and running. OpenStreetMap.org services restored.
  • 18 December 2024 00:29 - HE.net confirm restoration of service.
    TBC.
@Firefishy Firefishy changed the title Produce post-mortem for HE.net outage - 15 December 2024 Produce post-mortem for HE.net outage in Amsterdam - 15 December 2024 Dec 18, 2024
@drolbr
Copy link

drolbr commented Dec 20, 2024

FYI: Last known good connection at 03:53:15, first failed connection at 03:53:24:

2024-12-15 03:53:15 URL:https://osm-planet-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/planet/replication/minute/state.txt [86/86] -> "/srv/overpass/diffs/state.txt"

The asked-for server is planet.openstreetmap.org, then usually redirected.

@flohoff
Copy link

flohoff commented Dec 20, 2024

Just as a Datapoint - no clue what HE said the cause was or what you gathered yourself. 2024-12-16 08:01 (CET) i had a look at a HE BGP Looking Glass for the route to planet.openstreetmap.org - at that time - 184.104.179.145. I have a screenshot of that output from core2.ams1.he.net showing 6 paths with an age of11 Days.
At the same time i did an "mtr" to the same address which was dropped on the first HE hop in my path in Düsseldorf, which i found astonishing. It may be that HE uses MPLS for forwarding and has ttl propagation off. Otherwise my guess would have been that some filtering on HEs routers went bad for that 184.104.176.0/21 on their border routers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants