Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HARDWARE ISSUE: Failures with SOQuartz & TuringPi2 #40

Closed
acelinkio opened this issue May 31, 2023 · 8 comments
Closed

HARDWARE ISSUE: Failures with SOQuartz & TuringPi2 #40

acelinkio opened this issue May 31, 2023 · 8 comments

Comments

@acelinkio
Copy link

acelinkio commented May 31, 2023

EDIT: PROBLEM WAS ISOLATED TO ONE COMPUTE MODULE. IDENTIFIED TO BE HARDWARE ISSUE.

Hey!

I recently got started with the Plebian on SOQuartz compute modules hosted in a TuringPi2. Everything appears to work except when utilizing SOQuartz slot3 of the TuringPi2, where the compute module will crash along while also bringing down the rest of the networking on the TuringPi2.

Able to reproduce with the following:

  • Install SOQuartz module (Plebian v2023-04-30-1) into slot 3 of TuringPi2
  • install curl, open-iscsi, and then k3s
  • install longhorn to kubernetes cluster
  • wait 60-120 minutes

Compute Module crashes & all devices on TuringPi2 become unreachable. Everything recovers when the module in slot3 is powered off. Powering back on the compute module in slot3 causes another crash with ~15 minutes.

I have no issues with SOQuartz in slots 1,2, and 4. The major difference between those and slot 3 is two SATA ports exposed. https://help.turingpi.com/hc/en-us/articles/8685766680477-Specifications-and-I-O-Ports. M.2 is not exposed in any port because there is only one PCIE lane on SOQuartz. https://turingpi.com/product/cm4-adapter/ is being used for connecting the SOQuartz in.

Unsure if it is related but there is an warning during the open-iscsi installation.

W: Possible missing firmware /lib/firmware/rockchip/dptx.bin for module rockchipdrm
update-initramfs: Generating /boot/initrd.img-6.1.0-7-arm64
@CounterPillow
Copy link
Collaborator

CounterPillow commented May 31, 2023

Do they have schematics available somewhere? The way the CM4 image works might not be compatible with this carrier board, and I'd need to take a closer look at how things are wired up to figure out what's going on.

Though I assume the SATA is done through a PCIe SATA controller, which might make this related to the PCIe ranges bug which I should finally upstream the fix for.

@acelinkio
Copy link
Author

Reaching out to folks to see if they can provide schematics.

Also forgot to mention this issue that seems related. wenyi0421/turing-pi#13 although this image appears to be u-boot and able to successfully load.

@daniel-kukiela
Copy link

daniel-kukiela commented May 31, 2023

Hi!
To start, I'm not a part of the Turing Machines team, but I know a lot about the Turing Pi 2 board and can answer some questions. I also talked about this problem with @acelinkit over the Turing Pi Discord server to narrow down the possible cause.

Node 3 indeed has a SATA controlled hooked up using the PCIe. The chip used is ASM1061. It works fine (and out of the box) with CM4 + Raspberry Pi OS and CM4 + DietPi. Raspberry Pi + Ubuntu needs an additional package (linux-modules-extra-raspi). Then, Nvidia Jetson modules also work out of the box with this controller.
In case you wonder how you mix CM4 (and CM4-compatible) modules with the Jetson modules is that the TPI2 board has Jetson-compatible SODIMM connectors and to insert a CM4 module you use a so-called adapter board.

The schematics are not available publicly but do not hesitate to ask any follow-up questions if you have any.

@CounterPillow
Copy link
Collaborator

Okay, I'll make a patched plebian devicetree package with the PCIe ranges fix in the coming days and have you try that out, it's a shot in the dark but it might be related. Basically, right now, the memory ranges set for PCIe are a bit scuffed in the mainline kernel, which wreaks havoc with some PCIe devices.

@CounterPillow
Copy link
Collaborator

Okay, here's a devicetree deb for you to try out with fixed PCIe ranges: https://overviewer.org/~pillow/up/75bea78e59/devicetrees-plebian-quartz64-20230601130309-arm64.deb

Install with sudo dpkg -i devicetrees-plebian-quartz64-20230601130309-arm64.deb and then reboot.

Let me know if this improves things in any way.

@acelinkio
Copy link
Author

Reimaged each of the SOQuartz and applied that package. Kubernetes and Longhorn have been running stable for the last 2 hours.

Will follow up tomorrow and let you know if anything comes up.

@acelinkio
Copy link
Author

Had a couple of errors this morning. Decided to swap some of the modules around and noticed the problem following one specific module. Installed it into slot 1 which has HDMI output and captured this.

20230602_095331

From there I also grabbed some of the logs from journalctl -p 3 -x

Jun 02 05:26:06 soquartz3 kernel: bluetooth hci0: firmware: failed to load brcm/BCM.pine64,soquartz-cm4io.hcd (-2)
Jun 02 05:26:06 soquartz3 kernel: bluetooth hci0: firmware: failed to load brcm/BCM.hcd (-2)
Jun 02 05:26:06 soquartz3 kernel: bluetooth hci0: firmware: failed to load brcm/BCM.hcd (-2)
Jun 02 05:26:06 soquartz3 kernel: Bluetooth: hci0: BCM: firmware Patch file not found, tried:
Jun 02 05:26:06 soquartz3 kernel: Bluetooth: hci0: BCM: 'brcm/BCM4345C0.pine64,soquartz-cm4io.hcd'
Jun 02 05:26:06 soquartz3 kernel: Bluetooth: hci0: BCM: 'brcm/BCM4345C0.hcd'
Jun 02 05:26:06 soquartz3 kernel: Bluetooth: hci0: BCM: 'brcm/BCM.pine64,soquartz-cm4io.hcd'
Jun 02 05:26:06 soquartz3 kernel: Bluetooth: hci0: BCM: 'brcm/BCM.hcd'
Jun 02 05:26:06 soquartz3 kernel: brcmfmac: brcmf_sdio_htclk: HT Avail timeout (1000000): clkctl 0x50
Jun 02 05:26:06 soquartz3 bluetoothd[523]: src/plugin.c:plugin_init() Failed to init vcp plugin
Jun 02 05:26:06 soquartz3 bluetoothd[523]: src/plugin.c:plugin_init() Failed to init mcp plugin
Jun 02 05:26:06 soquartz3 bluetoothd[523]: src/plugin.c:plugin_init() Failed to init bap plugin
Jun 02 05:26:07 soquartz3 bluetoothd[523]: profiles/sap/server.c:sap_server_register() Sap driver initialization failed.
Jun 02 05:26:07 soquartz3 bluetoothd[523]: sap-server: Operation not permitted (1)
Jun 02 05:26:07 soquartz3 systemctl[547]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jun 02 06:24:05 soquartz3 kernel: Unable to handle kernel paging request at virtual address ffff8000015ec028
Jun 02 06:24:05 soquartz3 kernel: Mem abort info:
Jun 02 06:24:05 soquartz3 kernel:   ESR = 0x0000000086000004
Jun 02 06:24:05 soquartz3 kernel:   EC = 0x21: IABT (current EL), IL = 32 bits
Jun 02 06:24:05 soquartz3 kernel:   SET = 0, FnV = 0
Jun 02 06:24:05 soquartz3 kernel:   EA = 0, S1PTW = 0
Jun 02 06:24:05 soquartz3 kernel:   FSC = 0x04: level 0 translation fault
Jun 02 06:24:05 soquartz3 kernel: swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000003c69000
Jun 02 06:24:05 soquartz3 kernel: [ffff8000015ec028] pgd=10000000effff003, p4d=10000000effff003, pud=10000000efffe003, pmd=1000000004ca7003, pte=004000000dd02783
Jun 02 06:24:05 soquartz3 kernel: Internal error: Oops: 0000000086000004 [#1] SMP

and another snippet

-- Boot 8ba7fa9e3d84472b9d57ca56501398f2 --
Feb 28 11:15:47 soquartz3 kernel: arm-scmi firmware:scmi: Failed. SCMI protocol 22 not active.
Feb 28 11:15:47 soquartz3 kernel: arm-scmi firmware:scmi: Failed. SCMI protocol 17 not active.
Feb 28 11:15:47 soquartz3 kernel: rockchip-naneng-combphy fe840000.phy: failed to create combphy
Feb 28 11:15:47 soquartz3 kernel: rockchip-naneng-combphy fe840000.phy: failed to create combphy
Feb 28 11:15:47 soquartz3 kernel: rockchip-naneng-combphy fe840000.phy: failed to create combphy
Feb 28 11:15:47 soquartz3 kernel: rtc-pcf85063 1-0051: RTC chip is not present
Feb 28 11:15:47 soquartz3 kernel: rk_gmac-dwmac fe010000.ethernet: phy regulator is not available yet, deferred probing
Feb 28 11:15:47 soquartz3 kernel: rk_gmac-dwmac fe010000.ethernet: phy regulator is not available yet, deferred probing
Feb 28 11:15:47 soquartz3 kernel: rk_gmac-dwmac fe010000.ethernet: phy regulator is not available yet, deferred probing
Feb 28 11:15:47 soquartz3 kernel: rk_gmac-dwmac fe010000.ethernet: phy regulator is not available yet, deferred probing
Jun 02 08:41:23 soquartz3 kernel: brcmfmac mmc2:0001:1: firmware: failed to load brcm/brcmfmac43455-sdio.pine64,soquartz-cm4io.bin (-2)
Jun 02 08:41:23 soquartz3 kernel: brcmfmac mmc2:0001:1: firmware: failed to load brcm/brcmfmac43455-sdio.pine64,soquartz-cm4io.txt (-2)
Jun 02 08:41:23 soquartz3 kernel: brcmfmac mmc2:0001:1: firmware: failed to load brcm/brcmfmac43455-sdio.pine64,soquartz-cm4io.txt (-2)
Jun 02 08:41:23 soquartz3 kernel: brcmfmac mmc2:0001:1: firmware: failed to load brcm/brcmfmac43455-sdio.txt (-2)
Jun 02 08:41:23 soquartz3 kernel: brcmfmac mmc2:0001:1: firmware: failed to load brcm/brcmfmac43455-sdio.txt (-2)
Jun 02 08:41:24 soquartz3 kernel: of_dma_request_slave_channel: dma-names property of node '/serial@fe650000' missing or empty
Jun 02 08:41:24 soquartz3 kernel: brcmfmac: brcmf_sdio_htclk: HT Avail timeout (1000000): clkctl 0x50
Jun 02 08:41:25 soquartz3 bluetoothd[554]: src/plugin.c:plugin_init() Failed to init vcp plugin
Jun 02 08:41:25 soquartz3 bluetoothd[554]: src/plugin.c:plugin_init() Failed to init mcp plugin
Jun 02 08:41:25 soquartz3 bluetoothd[554]: src/plugin.c:plugin_init() Failed to init bap plugin
Jun 02 08:41:25 soquartz3 systemctl[567]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jun 02 08:41:26 soquartz3 kernel: Bluetooth: hci0: command 0x0c03 tx timeout
Jun 02 08:41:34 soquartz3 kernel: Bluetooth: hci0: BCM: Reset failed (-110)
Jun 02 14:15:47 soquartz3 kernel: Unable to handle kernel paging request at virtual address ffff8000015d8028
Jun 02 14:15:47 soquartz3 kernel: Mem abort info:
Jun 02 14:15:47 soquartz3 kernel:   ESR = 0x0000000086000004
Jun 02 14:15:47 soquartz3 kernel:   EC = 0x21: IABT (current EL), IL = 32 bits
Jun 02 14:15:47 soquartz3 kernel:   SET = 0, FnV = 0
Jun 02 14:15:47 soquartz3 kernel:   EA = 0, S1PTW = 0
Jun 02 14:15:47 soquartz3 kernel:   FSC = 0x04: level 0 translation fault
Jun 02 14:15:47 soquartz3 kernel: swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000003c69000
Jun 02 14:15:47 soquartz3 kernel: [ffff8000015d8028] pgd=10000000effff003, p4d=10000000effff003, pud=10000000efffe003, pmd=1000000006080003, pte=0040000004bd6783
Jun 02 14:15:47 soquartz3 kernel: Internal error: Oops: 0000000086000004 [#1] SMP
-- Boot c3d9b0fdc1c84c34a853c6f364c89122 --

Currently running some memtester commands to try testing memory.

@acelinkio
Copy link
Author

Closing ticket. This problem is hardware failure on one SOQuartz module hardware. The same module is failing no matter which slot it is in. Running memtester I was able to have the node hang/crash.

Rotated the other 2 modules I have through slot 3 in the TuringPi and each one ran 3+ hours without any issue.

Appreciate your help troubleshooting this issue!

@acelinkio acelinkio changed the title Failures with SOQuartz & TuringPi2 HARDWARE ISSUE: Failures with SOQuartz & TuringPi2 Jun 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants