Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed memory in ysera #962

Closed
pnorman opened this issue Sep 25, 2023 · 11 comments
Closed

Failed memory in ysera #962

pnorman opened this issue Sep 25, 2023 · 11 comments
Labels
hardware:ram-failure location:ucl The data centre sponsored by UCL service:tiles The raster map on tile.openstreetmap.org

Comments

@pnorman
Copy link
Collaborator

pnorman commented Sep 25, 2023

The hardware site identifies ysera as having 220GB RAM in the form

7 x 32GB 2666 MT/s DDR4 DIMM
1 x 2666 MT/s DDR4 DIMM
4 x Spare DIMM slot

It should have 256GB, identical to odin. SSHing also shows only 220GB with free. This is causing about a 10% reduction in performance, based on rerenders.

@pnorman pnorman added location:ucl The data centre sponsored by UCL hardware:ram-failure labels Sep 25, 2023
@pnorman
Copy link
Collaborator Author

pnorman commented Sep 25, 2023

Looking at munin, it's been this way since at least July 2022.

@pnorman
Copy link
Collaborator Author

pnorman commented Sep 25, 2023

Server was purchased in March 2019 by Sentral. Or at least odin was, and I think they were purchased at the same time. This puts it well out of warranty.

@pnorman pnorman added the service:tiles The raster map on tile.openstreetmap.org label Sep 26, 2023
@Firefishy
Copy link
Member

I have ordered replacement memory.

@Firefishy
Copy link
Member

ysera has started crashing due to uncorrected ECC errors. I will try visit in the next week.

@Firefishy
Copy link
Member

Uncorrectable ECC / other uncorrectable memory error @P2-DIMMA1(CPU2) - Assertion
Uncorrectable ECC / other uncorrectable memory error @P2-DIMMB1(CPU2) - Assertion

@Firefishy
Copy link
Member

Waiting on restored access to UCL - Slough. Currently I do not have access.

@Firefishy Firefishy self-assigned this Apr 22, 2024
@grischard
Copy link
Collaborator

Blocked by #1060

@Firefishy
Copy link
Member

Firefishy commented May 8, 2024

2 weeks ago I enabled Adaptive Double DRAM Device Correction (ADDDC) in the BIOS and down clocked the RAM speed. The machine has now been stable for 2 weeks which is an improvement.

@Firefishy Firefishy removed the blocked label May 28, 2024
@grischard
Copy link
Collaborator

Probably needs BIOS update too.

@Firefishy
Copy link
Member

I have updated BIOS and BMC. BIOS requires reboot complete update which I will do shortly once on-site.
I have also updated snap-02 and eddie in the same way.

@Firefishy
Copy link
Member

All upgraded. Faulty RAM replaced and extra RAM installed.

@Firefishy Firefishy removed their assignment Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hardware:ram-failure location:ucl The data centre sponsored by UCL service:tiles The raster map on tile.openstreetmap.org
Projects
None yet
Development

No branches or pull requests

3 participants