Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karm - network firmware - Warning firmware error detected FWSM: 0x8118801B / 0x8118801F #997

Closed
Firefishy opened this issue Nov 20, 2023 · 22 comments

Comments

@Firefishy
Copy link
Member

There is an unknown issue with the AOC-2UR6N4-i4XT network card riser in karm. The kernel logs are being flooded with the following kernel error. The issue is not new.

Nov 20 21:45:11 karm kernel: [994560.467844] ixgbe 0000:02:00.1: Warning firmware error detected FWSM: 0x8118801B
Nov 20 21:45:12 karm kernel: [994560.915876] ixgbe 0000:02:00.0: Warning firmware error detected FWSM: 0x8118801B
Nov 20 21:45:12 karm kernel: [994561.139871] ixgbe 0000:01:00.1: Warning firmware error detected FWSM: 0x8118801F
Nov 20 21:45:12 karm kernel: [994561.655867] ixgbe 0000:01:00.0: Warning firmware error detected FWSM: 0x8118801F

We have previously tried to get a firmware update from Supermicro to fix the issue, but both of the updates they supplied would not load and returned errors.

I have reached out to Supermicro support again.

@Firefishy
Copy link
Member Author

Firefishy commented Nov 20, 2023

Supermicro support previously supplied:

  • UR6N4X3C1_NUP.zip: error "Error: OROM image is not allowed for device 'X550 NCSI'."
  • PXE1681.zip error: "Enabling Boot Rom on port 1 - Error: Unsupported feature. Option ROM area in the flash is not supported for this device on port 1"

@Firefishy
Copy link
Member Author

Supermicro have come back and recommended RMA'ing the AOC-2UR6N4-i4XT riser / network card. "NIC PM informed to RMA the AOC. The FW in your AOC is the old PR FW that disables OPROM."

@Firefishy Firefishy added hardware location:amsterdam Equinix AM6 data centre labels Nov 30, 2023
@Firefishy
Copy link
Member Author

Replacement riser is NOT cheap at ~£500, and no guarantee it is running the corrected firmware: https://www.ebay.co.uk/itm/225350050919

@Firefishy
Copy link
Member Author

USA is cheaper, but no shipping to UK / EU. https://www.ebay.com/itm/363404625954

@pnorman
Copy link
Collaborator

pnorman commented Dec 2, 2023

I mean, we could ship via someone in the US, but that's a lot of effort for something that might not even fix the problem. Can we cross-ship with supermicro?

If we can't reasonably fix it, can we get a separate PCIe network card and either remove or disable the AOC network card?

@Firefishy
Copy link
Member Author

Firefishy commented Dec 2, 2023

I mean, we could ship via someone in the US, but that's a lot of effort for something that might not even fix the problem. Can we cross-ship with supermicro?

For now I have asked if I can sign their magic NDA for them to release the firmware update tool to me. If that is a no-go I will find out what RMA options there are.

If we can't reasonably fix it, can we get a separate PCIe network card and either remove or disable the AOC network card?

Yes, this is an option.

@Firefishy
Copy link
Member Author

Supermicro now report the firmware we have is not field upgradeable and have offered an advance swap-out RMA (receive, before send).

@Firefishy
Copy link
Member Author

Ops to decide if to RMA now or to wait until next site visit. No site visits planned at the moment.

@Firefishy
Copy link
Member Author

Proceeding with RMA now. Will remote hands the swap-out work.

@Firefishy
Copy link
Member Author

RMA submitted. Waiting for approval.

@Firefishy
Copy link
Member Author

Firefishy commented Jan 18, 2024

Supermicro Europe RMA team are refusing to process the RMA. They say to get the reseller to process the RMA. The system was sold to us by Sentral Systems (an authorised Supermicro reseller) but Sentral Systems is no longer trading.

Rock and a hard place.

Screenshot 2024-01-18 at 12 42 46

@Firefishy
Copy link
Member Author

Supermicro approved the RMA. Non-advance.

@Firefishy
Copy link
Member Author

I have booked smart hands for karm and arranged DHL collection on Monday.

@Firefishy
Copy link
Member Author

Karm has been powered down in prep.

@Firefishy
Copy link
Member Author

Remote hands have removed the card and boxes it for collection on Monday by DHL.

@grischard
Copy link
Collaborator

grischard commented Feb 12, 2024

The card has been shipped. Marking as blocked until we get the card returned.

@Firefishy
Copy link
Member Author

The card arrived at RMA centre and is being processed.

@Firefishy
Copy link
Member Author

Supermicro have confirmed receipt of the RMA. They will update the firmware and confirm when ready for return.

@Firefishy
Copy link
Member Author

Supermicro have repaired the riser. The card should be returned in the next few days. I have created a combined inbound equinix / smart-hands ticket.

@Firefishy
Copy link
Member Author

Server is back online with updated riser/nic. All good, no more kernel errors.

@WarmWelcome
Copy link

Sorry to resurrect this but I was wondering if they disclosed what they did to the riser to bring it back to working order. I have seen many servers with this riser come through and fill dmesg, but am unaware of any fix. Here is another user experiencing what appears to be the same in the FAQ: https://www.supermicro.com/support/faqs/faq.cfm?faq=38678
This does not disclose the fix, however.

@Firefishy
Copy link
Member Author

Sorry to resurrect this but I was wondering if they disclosed what they did to the riser to bring it back to working order. I have seen many servers with this riser come through and fill dmesg, but am unaware of any fix. Here is another user experiencing what appears to be the same in the FAQ: https://www.supermicro.com/support/faqs/faq.cfm?faq=38678 This does not disclose the fix, however.

@WarmWelcome Prior to the RMA they sent me a variety of firmware updates. None worked, regardless of install method (Linux, DOS, UEFI). They said something to the effect that the NIC updates were locked as per the Support FAQ you linked.

The RMA, I believe they must have used an external device programmer to update the firmware chip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants