Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Provider Audit] provider.akash.colnetwork.com #751

Open
jars101 opened this issue Dec 5, 2024 · 7 comments
Open

[Provider Audit] provider.akash.colnetwork.com #751

jars101 opened this issue Dec 5, 2024 · 7 comments
Assignees

Comments

@jars101
Copy link

jars101 commented Dec 5, 2024

Prerequisite Steps:

  1. Make sure your provider has community provider attributes and your contact details (email, website):
    Done

  2. Make sure your provider *.ingress resolves to your provider IP (ideally worker node IP)
    root@akash-1:/home/user# host aaa.ingress.akash.colnetwork.com
    aaa.ingress.akash.colnetwork.com is an alias for akash.colnetwork.com.
    akash.colnetwork.com has address 161.10.254.22

  3. Please make sure your Akash provider doesn't block any Akash specific ports.
    Confirmed

Leave contact information (optional)
Name: Jairo Rizzo
Discord handle or Telegram handle: Colnet in Discord
Contact email address: [email protected]

@jars101
Copy link
Author

jars101 commented Dec 5, 2024

@andy108369 @shimpa1 I've been highlighting the lack of support on the Discord channel, to the point of getting blocked. I've been waiting for an audit to review storage persistence and resolve issues with nginx (k3s) - nginx/nginx#226 (comment) chainzero/provider-build-scripts#20. Additionally, I reported a critical error with NVIDIA GPU worker node that still remains unanswered:

[Thu Dec 5 10:02:01 2024] WARNING: kernel stack frame pointer at 00000000ba47703c in akash:9329 has bad value 00000000f55ca2b7 [Thu Dec 5 10:02:01 2024] unwind stack type:0 next_sp:0000000000000000 mask:0x2 graph_idx:0 [Thu Dec 5 10:02:01 2024] 000000004576e520: ffffa95daf083b70 (0xffffa95daf083b70) [Thu Dec 5 10:02:01 2024] 00000000a3c50e5b: ffffffffc114b98a (os_release_spinlock+0x1a/0x30 [nvidia]) [Thu Dec 5 10:02:01 2024] 00000000ba47703c: ffff8a860547aed0 (0xffff8a860547aed0) [Thu Dec 5 10:02:01 2024] 00000000b3f6145a: ffffffffc116c362 (_nv014133rm+0x4e2/0xb90 [nvidia])

Despite reaching out, I'm not seeing any progress. As a provider, this is not only frustrating but also causing me financial losses. Notably, my experience with Filecoin has been seamless, with responsive support and profit. It's concerning to see such a disparity in support between networks. Can I reasonably expect support from the Akash Network?

@vigneshv-ocl
Copy link

did a quick check - everything is ok apart from the below.

  - key: host
    value: akash

Incase if you would like to set up Persistent storage - please follow this link https://akash.network/docs/providers/build-a-cloud-provider/helm-based-provider-persistent-storage-enablement/

@jars101
Copy link
Author

jars101 commented Dec 7, 2024

Thank you for the update. Regarding the "key-host value: akash," could you please share your recommendations for resolving this inconsitency? Additionally, in regards to persistent storage I deployed approximately a month ago:

root@akash-1:/home/user# kubectl -n rook-ceph get cephclusters
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
rook-ceph /var/lib/rook 3 29d Ready Cluster created successfully HEALTH_OK 27e45654-8fbf-46fb-8d66-dcdf09ed5697

Are there any additional considerations to help ensure a successful audit outcome? Thanks again

@jars101
Copy link
Author

jars101 commented Dec 7, 2024

Im getting blocked again for 24h on akash > provider uptime channel for posting the following:

These segfaults im getting are indeed becaus of 6.11 kernel all of my original control-plane nodes are running Ubuntu 24.10 6.11.0-9-generic, so i added a forth node Ubuntu 24.04.1 LTS 6.8.0-49-generic and tested again via curl see results below:

6.11 node:
`root@akash-1:~# curl https://127.0.0.1 -kv

  • Trying 127.0.0.1:443...
  • Connected to 127.0.0.1 (127.0.0.1) port 443
  • ALPN: curl offers h2,http/1.1
  • TLSv1.3 (OUT), TLS handshake, Client hello (1):
  • TLSv1.3 (IN), TLS handshake, Server hello (2):
  • TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
  • TLSv1.3 (IN), TLS handshake, Certificate (11):
  • TLSv1.3 (IN), TLS handshake, CERT verify (15):
  • TLSv1.3 (IN), TLS handshake, Finished (20):
  • TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
  • TLSv1.3 (OUT), TLS handshake, Finished (20):
  • SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / x25519 / RSASSA-PSS
  • ALPN: server accepted h2
  • Server certificate:
  • subject: O=Acme Co; CN=Kubernetes Ingress Controller Fake Certificate
  • start date: Dec 7 16:02:40 2024 GMT
  • expire date: Dec 7 16:02:40 2025 GMT
  • issuer: O=Acme Co; CN=Kubernetes Ingress Controller Fake Certificate
  • SSL certificate verify result: self-signed certificate (18), continuing anyway.
  • Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
  • using HTTP/2
  • [HTTP/2] [1] OPENED stream for https://127.0.0.1/
  • [HTTP/2] [1] [:method: GET]
  • [HTTP/2] [1] [:scheme: https]
  • [HTTP/2] [1] [:authority: 127.0.0.1]
  • [HTTP/2] [1] [:path: /]
  • [HTTP/2] [1] [user-agent: curl/8.9.1]
  • [HTTP/2] [1] [accept: /]

GET / HTTP/2
Host: 127.0.0.1
User-Agent: curl/8.9.1
Accept: /

  • TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
  • TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
  • Request completely sent off
  • OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 0
  • Failed receiving HTTP2 data: 56(Failure when receiving data from the peer)
  • Connection #0 to host 127.0.0.1 left intact
    curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 0
    `

6.8.0 node:
`root@akash-12:~# curl https://127.0.0.1 -kv

  • Trying 127.0.0.1:443...
  • Connected to 127.0.0.1 (127.0.0.1) port 443
  • ALPN: curl offers h2,http/1.1
  • TLSv1.3 (OUT), TLS handshake, Client hello (1):
  • TLSv1.3 (IN), TLS handshake, Server hello (2):
  • TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
  • TLSv1.3 (IN), TLS handshake, Certificate (11):
  • TLSv1.3 (IN), TLS handshake, CERT verify (15):
  • TLSv1.3 (IN), TLS handshake, Finished (20):
  • TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
  • TLSv1.3 (OUT), TLS handshake, Finished (20):
  • SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / X25519 / RSASSA-PSS
  • ALPN: server accepted h2
  • Server certificate:
  • subject: O=Acme Co; CN=Kubernetes Ingress Controller Fake Certificate
  • start date: Dec 7 19:24:19 2024 GMT
  • expire date: Dec 7 19:24:19 2025 GMT
  • issuer: O=Acme Co; CN=Kubernetes Ingress Controller Fake Certificate
  • SSL certificate verify result: self-signed certificate (18), continuing anyway.
  • Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
  • using HTTP/2
  • [HTTP/2] [1] OPENED stream for https://127.0.0.1/
  • [HTTP/2] [1] [:method: GET]
  • [HTTP/2] [1] [:scheme: https]
  • [HTTP/2] [1] [:authority: 127.0.0.1]
  • [HTTP/2] [1] [:path: /]
  • [HTTP/2] [1] [user-agent: curl/8.5.0]
  • [HTTP/2] [1] [accept: /]

GET / HTTP/2
Host: 127.0.0.1
User-Agent: curl/8.5.0
Accept: /

  • TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
  • TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
  • old SSL session ID is stale, removing
    < HTTP/2 404
    < date: Sat, 07 Dec 2024 21:38:47 GMT
    < content-type: text/html
    < content-length: 146
    < strict-transport-security: max-age=31536000; includeSubDomains
    <
<title>404 Not Found</title>

404 Not Found


nginx * Connection #0 to host 127.0.0.1 left intact `

As you can see here, 6.11 control plane node generates the segfaults:

[Sat Dec 7 14:06:05 2024] nginx[18336]: segfault at 15f ip 000000000000015f sp 00007ffcef6341c0 error 14 likely on CPU 2 (core 2, socket 0) [Sat Dec 7 14:06:05 2024] Code: Unable to access opcode bytes at 0x135. [Sat Dec 7 16:12:31 2024] nginx[117010]: segfault at 40 ip 0000000000000040 sp 00007ffcef634170 error 14 likely on CPU 0 (core 0, socket 0) [Sat Dec 7 16:12:31 2024] Code: Unable to access opcode bytes at 0x16. [Sat Dec 7 16:29:59 2024] nginx[19210]: segfault at 40 ip 0000000000000040 sp 00007ffcef634170 error 14 likely on CPU 2 (core 2, socket 0) [Sat Dec 7 16:29:59 2024] Code: Unable to access opcode bytes at 0x16. [Sat Dec 7 16:38:14 2024] nginx[194410]: segfault at 40 ip 0000000000000040 sp 00007ffcef634170 error 14 likely on CPU 2 (core 2, socket 0) [Sat Dec 7 16:38:14 2024] Code: Unable to access opcode bytes at 0x16.

@jars101
Copy link
Author

jars101 commented Dec 7, 2024

`
root@akash-12:~# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
akash-1 Ready control-plane,etcd,master 24d v1.31.2+k3s1 10.240.0.62 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1
akash-10 Ready control-plane,etcd,master 36d v1.31.2+k3s1 10.240.0.60 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1
akash-11 Ready 29d v1.31.2+k3s1 10.240.0.61 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1
akash-12 Ready etcd 147m v1.31.2+k3s1 10.240.0.33 Ubuntu 24.04.1 LTS 6.8.0-49-generic containerd://1.7.22-k3s1
akash-2 Ready 36d v1.31.2+k3s1 10.240.0.52 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1
akash-3 Ready 36d v1.31.2+k3s1 10.240.0.53 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1
akash-4 Ready 36d v1.31.2+k3s1 10.240.0.54 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1
akash-5 Ready 36d v1.31.2+k3s1 10.240.0.55 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1
akash-6 Ready 45h v1.31.2+k3s1 10.240.0.56 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1
akash-7 Ready 36d v1.31.2+k3s1 10.240.0.57 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1
akash-8 Ready 36d v1.31.2+k3s1 10.240.0.58 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1
akash-9 Ready control-plane,etcd,master 36d v1.31.2+k3s1 10.240.0.59 Ubuntu 24.10 6.11.0-9-generic containerd://1.7.22-k3s1

`

As you can validate, all nodes have same k3s version ( v1.31.2+k3s1 ) and only differ on OS. The hw specs im using for each node is identical.

@jars101
Copy link
Author

jars101 commented Dec 9, 2024

I've replaced my 3 control-plane nodes to run on OS version 6.8.0-49, moving away from the previous version 6.11.0-9. This ensures compatibility with incoming SSL traffic, eliminating any potential SSL errors.

@andy108369
Copy link
Contributor

@jars101 your provider seem to be offline:

user@laptop:~$ curl -s -k https://provider.akash.colnetwork.com:8443/status
user@laptop:~$ 

Are you planning on bringing it back up again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants