Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OIDC filter randomly failing in v1.2 due to missing information in token request #4718

Open
jaynis opened this issue Nov 14, 2024 · 23 comments
Labels

Comments

@jaynis
Copy link
Contributor

jaynis commented Nov 14, 2024

Hi. With envoy gateway version 1.1.2 (envoy version 1.31.2) all my OIDC filters were working fine. I havent touched any of my SecurityPolicies and updated to 1.2.1 (envoy version 1.32.1) and now I get a OAuth flow failed error in the browser with oauth.missing_credentials in the envoy logs. When I enable debug logging I additionally see my IDP (Microsoft Entra ID fka AAD) complaining about a missing client_id in the POST body during the OIDC token request: AADSTS900144: The request body must contain the following parameter: 'client_id'.

I further troubleshooted this by doing a fresh envoy gateway installation in version 1.2.1 on a separate cluster and gradually applying configuration to it. While doing this I noticed that apparently OIDC is working fine in the beginning and then fails at some point. Furthermore, this seems to happen quite randomly: Sometimes a certain OIDC filter is working, then after some time of usage or after applying some unreleated envoy resources to the cluster it stops working and after some time it might even recover and be working again. Generally I have the feeling that the more different OIDC filters are in place on a cluster the more likely it is to cause this issue.

This issue might be related to #4625 which describes similar behavior with a different IDP, but suspects the nonce as the cause.

#4706, which is about a general instability of OIDC (in conjunction with JWT authorization), could be related as well.

@zhaohuabing
Copy link
Member

@jaynis Could you please share your SecurityPolicy (Sensitive info can be extracted)? Do you use OIDC with JWT?

@jaynis
Copy link
Contributor Author

jaynis commented Nov 15, 2024

@zhaohuabing I use the following configuration which I dont consider special and which has been working fine in previous envoy gateway versions. Also, as I have mentioned before, this configuration initially works even with version 1.2.1 and then stops working after some point. Havent figured out yet when exactly, but adding further SecurityPolicys and HTTPRoutes might cause this.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
  name: dashboard-security-policy 
  namespace: tools
spec:
  oidc:
    clientID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
    clientSecret:
      kind: Secret
      name: dashboard-security-policy-oidc-secret
    provider:
      issuer: https://sts.windows.net/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
    redirectURL: https://dashboard.example.com/oidc/callback
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: dashboard-route 
---
apiVersion: v1
data:
  client-secret: cmVkYWN0ZWQK 
kind: Secret
metadata:
  name: dashboard-security-policy-oidc-secret 
  namespace: tools
type: Opaque
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: dashboard-route
  namespace: tools
spec:
  hostnames:
  - dashboard.example.com
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: envoy-gateway-https
    namespace: default
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: dashboard-service
      port: 8080
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /

@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 15, 2024

@jaynis Thanks for providing the SecurityPolicy. Also if you can use egctl config envoy-proxy all to get the Envoy configuration form the time when the OIDC is failing, it would be helpful for finding the root cause.

@zhaohuabing
Copy link
Member

@zetaab
Copy link
Contributor

zetaab commented Nov 18, 2024

we are facing same issue in multiple clusters OAuth flow failed. From debug logs I can see that envoyproxy tries to fetch token from token endpoint using credential that does not contain client_secret at all client_id is there. So I was wondering why secret is not there:

with egctl I can see following

          "dynamicWarmingSecrets": [
            {
              "lastUpdated": "2024-11-18T07:55:08.679Z",
              "name": "oauth2/client_secret/securitypolicy/demo/demo-auth",
              "secret": {
                "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret",
                "name": "oauth2/client_secret/securitypolicy/demo/demo-auth"
              },
              "versionInfo": "uninitialized"
            },
            {
              "lastUpdated": "2024-11-18T07:55:08.679Z",
              "name": "oauth2/hmac_secret/securitypolicy/demo/demo-auth",
              "secret": {
                "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret",
                "name": "oauth2/hmac_secret/securitypolicy/demo/demo-auth"
              },
              "versionInfo": "uninitialized"
            }
          ]

then in envoyproxy logs

[2024-11-18 08:00:30.500][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret
[2024-11-18 08:00:30.500][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret

rollbacking envoy gateway makes everything working.

@zetaab
Copy link
Contributor

zetaab commented Nov 18, 2024

@jaynis could you check do you see similar behaviour in your installation?

@zetaab
Copy link
Contributor

zetaab commented Nov 18, 2024

@plnordquist as you did that other ticket, can you check your setup as well. Do you see similar in envoyproxy logs and with egctl?

@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 18, 2024

we are facing same issue in multiple clusters OAuth flow failed. From debug logs I can see that envoyproxy tries to fetch token from token endpoint using credential that does not contain client_secret at all client_id is there. So I was wondering why secret is not there:

with egctl I can see following

          "dynamicWarmingSecrets": [
            {
              "lastUpdated": "2024-11-18T07:55:08.679Z",
              "name": "oauth2/client_secret/securitypolicy/demo/demo-auth",
              "secret": {
                "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret",
                "name": "oauth2/client_secret/securitypolicy/demo/demo-auth"
              },
              "versionInfo": "uninitialized"
            },
            {
              "lastUpdated": "2024-11-18T07:55:08.679Z",
              "name": "oauth2/hmac_secret/securitypolicy/demo/demo-auth",
              "secret": {
                "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret",
                "name": "oauth2/hmac_secret/securitypolicy/demo/demo-auth"
              },
              "versionInfo": "uninitialized"
            }
          ]

then in envoyproxy logs

[2024-11-18 08:00:30.500][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret
[2024-11-18 08:00:30.500][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret

rollbacking envoy gateway makes everything working.

@zetaab This looks like different than the one reported by @jaynis and caused by a failure to initialize xDS secret. Are there any errors/warnings in the EG logs when this happened?

This could be caused by the same reason of #4706 .

@zetaab
Copy link
Contributor

zetaab commented Nov 18, 2024

@zhaohuabing I went through the commit list in release/v1.2 and tracked the breaking commit which is #4227

before that everything works fine, and with that pr commit I start to see errors in logs and oidc will not work. I could not find any commit after that which works.

@jaynis
Copy link
Contributor Author

jaynis commented Nov 18, 2024

Hi @zetaab . I can see the same things with egctl and in the logs. However, as I described before, in my case the identity provider complains about a missing client_id and not client_secret. What is the error message you get from your identity provider (and which one do you use)?

@zhaohuabing I have a config dump created with egctl config envoy-proxy all. Is there anything you are particular interested in? If I need to provide the entire file I have to go through 4700 lines of JSON and redact sensitive information.

@zetaab
Copy link
Contributor

zetaab commented Nov 18, 2024

but yes, with 1.2.1 I can see

% kubectl logs envoy-gateway-59849c8649-rwr7c|grep "xds cluster exists"
2024-11-18T09:03:30.919Z        ERROR   xds-translator  runner/runner.go:85     failed to translate xds ir      {"runner": "xds-translator", "error": "xds cluster exists"}
2024-11-18T09:03:30.920Z        ERROR   watchable       message/watchutil.go:56 observed an error       {"runner": "xds-translator", "error": "xds cluster exists"}

with last working commit 5375cf0

% kubectl logs envoy-gateway-79dfddcf7b-v8czx|grep "xds cluster exists"|wc -l
       0

breaking commit a351c4b (PR #4227)

% kubectl logs envoy-gateway-8688459cbc-ptkn5|grep "xds cluster exists"|wc -l
       2

@zetaab
Copy link
Contributor

zetaab commented Nov 18, 2024

@jaynis I use cognito and will get back that my client_id is not working. However, with debug log I can see that client_id is there and client_secret is empty

@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 18, 2024

@zetaab Thanks for digging into this. #4707

but yes, with 1.2.1 I can see

% kubectl logs envoy-gateway-59849c8649-rwr7c|grep "xds cluster exists"
2024-11-18T09:03:30.919Z        ERROR   xds-translator  runner/runner.go:85     failed to translate xds ir      {"runner": "xds-translator", "error": "xds cluster exists"}
2024-11-18T09:03:30.920Z        ERROR   watchable       message/watchutil.go:56 observed an error       {"runner": "xds-translator", "error": "xds cluster exists"}

with last working commit 5375cf0

% kubectl logs envoy-gateway-79dfddcf7b-v8czx|grep "xds cluster exists"|wc -l
       0

breaking commit a351c4b (PR #4227)

% kubectl logs envoy-gateway-8688459cbc-ptkn5|grep "xds cluster exists"|wc -l
       2

Thanks for testing. This is a known issue and should be fixed by #4707.

@zetaab
Copy link
Contributor

zetaab commented Nov 18, 2024

I can confirm that #4707 will fix this issue at least for me

@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 18, 2024

@zhaohuabing I have a config dump created with egctl config envoy-proxy all. Is there anything you are particular interested in? If I need to provide the entire file I have to go through 4700 lines of JSON and redact sensitive information.

@jaynis Sharing the output of theegctl config envoy-proxy listener should. be good if the entire configuration is too big.

@missBerg
Copy link
Contributor

The missing client ID is the same issue I faced as well @zhaohuabing

@zhaohuabing
Copy link
Member

The missing client ID is the same issue I faced as well @zhaohuabing

Yeah, somehow Entra asked a client id in the request body. I'm still trying to figure it out.

@jaynis
Copy link
Contributor Author

jaynis commented Nov 18, 2024

@zhaohuabing I have attached the egctl config envoy-proxy listener output here. This dump has been created with v1.2.1 and the issue being present.

Seemingly #4707 fixes the issue for me as well. But as already described initially, there is some randomness involved and therefore it is hard to say this for certain.

Is there any ETA for merging #4707 respectively for v1.2.2?

@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 18, 2024

Is there any ETA for merging #4707 respectively for v1.2.2?

We'll try to relese v1.2.2 by the end of this week.

@zhaohuabing
Copy link
Member

zhaohuabing commented Dec 11, 2024

@jaynis @zetaab @missBerg Could you please confirm if the latest EG image resolves your issue?

State encoding issue has been fixed by envoyproxy/envoy#37473 .

@zetaab
Copy link
Contributor

zetaab commented Dec 11, 2024

@zhaohuabing yeah we are using 1.2.3 without issues. However, there is/was another issue which prevented us running multiple controllers. Its maybe fixed already, cannot find it now

@zhaohuabing
Copy link
Member

zhaohuabing commented Dec 11, 2024

@zhaohuabing yeah we are using 1.2.3 without issues. However, there is/was another issue which prevented us running multiple controllers. Its maybe fixed already, cannot find it now

I think it's this one, and it should be fixed by #4767 .

@jaynis
Copy link
Contributor Author

jaynis commented Dec 11, 2024

Could you please confirm if the latest EG image resolves your issue?

I can confirm that OIDC filters are working again with v1.2.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants