Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ets table not being cleaned up, causes application crash #40

Open
dantswain opened this issue Feb 23, 2023 · 14 comments
Open

ets table not being cleaned up, causes application crash #40

dantswain opened this issue Feb 23, 2023 · 14 comments

Comments

@dantswain
Copy link

Hi, I've noticed that my application is crashing periodically. I've tracked down this library as at least contributing to the crash. For whatever reason my supervisor is attempting to restart my JWK strategy but it cannot because the ETS table that it wants to create already exists:

** (exit) an exception was raised:
    ** (ArgumentError) errors were found at the given arguments:

  * 1st argument: table name already exists

        (stdlib 4.2) :ets.new(MyApp.Auth.JwkStrategy.EtsCache, [:set, :public, :named_table, {:read_concurrency, true}, {:write_concurrency, true}])
        (my_app 0.1.0) lib/my_app/auth/jwk_strategy.ex:11: MyApp.Auth.JwkStrategy.EtsCache.new/0
        ( my_app 0.1.0) lib/my_app/auth/jwk_strategy.ex:11: MyApp.Auth.JwkStrategy.start_link/1

I'm not sure why it's being restarted in the first place (I think it is happening after a recompile in dev), but the supervisor is unable to start the JWK strategy and it reaches the retry policy limits which in turn crashes the parent supervisor and ultimately the whole application.

Looking at the implementation I see that the ets table is being created in the start_link function of the strategy: https://github.com/joken-elixir/joken_jwks/blob/master/lib/joken_jwks/default_strategy_template.ex#L190

Since that start_link is being called by my supervisor, it means that the supervisor owns the table. Therefore whenever the strategy itself restarts under the supervisor, it tries to re-create the ets table and fails. I think the ets table should probably be started by the actual worker process here.

I think a workaround for me is to check if the table exists in an init_opts callback and delete it if it already exists. I would be glad to submit a PR to fix the underlying problem, but looking at the actual code I'm a bit curious why it's not just using genserver? That would make this a bit simpler to solve since the table could be created in the genserver's init callback.

@victorolinasc
Copy link
Collaborator

Ouch... this really is a bad strategy... I will probably try to address this on next version (2.0).

The proper fix here might be to handle initialization asynchronously. We could use handle_continue for that.

There is some churn in this repo right now and I am gathering some feedback about breaking changes here so that I can release a 2.0 version of this library. This will, certainly, get into that list but I can't promise a schedule right now...

If you want to try handling this on a PR with an async initialization it will be most welcome.

@bryannaegele
Copy link

Something to note: this also happens where the table doesn't exist. Doing a quick view through the code, this likely related to the initialization of the table being done in a task and creating a race condition on restarts, as well. I can't find the exact reason for the table not existing as this is currently happening in local dev and it just randomly occurs with no logs generated. I don't think making initialization async (task or handle_continue) for any of this is feasible as it introduces another race condition. It can be presumed that users will not want their app to start if user access verification isn't available, so it should always be a blocking operation by default.

 * 1st argument: the table identifier does not refer to an existing ETS table

        (stdlib 4.2) :ets.lookup(***.Jwks.Strategy.EtsCache, :signers)
        (policies 0.1.0) lib/***/auth.ex:23: ***.Jwks.Strategy.EtsCache.get_signers/0
        (policies 0.1.0) lib/***/auth.ex:23: ***Jwks.Strategy.match_signer_for_kid/2
        (joken_jwks 1.6.0) lib/joken_jwks.ex:54: JokenJwks.before_verify/2
        (joken 2.6.0) lib/joken/hooks.ex:199: anonymous fn/3 in Joken.Hooks.run_before_hook/3

@Sgoettschkes
Copy link

Doing a quick view through the code, this is likely related to the initialization of the table being done in a task and creating a race condition on restarts, as well.

From the code, it seems like the :ets.new is called directly in start_link/1 through the call to EtsCache.new(): https://github.com/joken-elixir/joken_jwks/blob/v1.6.0/lib/joken_jwks/default_strategy_template.ex#L190

I don't think a race condition can exist there.

The reason I looked this up is because we have the problem of our Elixir Phoenix app not starting (sometimes) and the error we get is that put_signers/1 (https://github.com/joken-elixir/joken_jwks/blob/v1.6.0/lib/joken_jwks/default_strategy_template.ex#L165) fails because the table does not exist, which seems odd given that it was called a few lines above. Not sure if it's related at all 🤔

@victorolinasc
Copy link
Collaborator

@Sgoettschkes said:

We use joken and joken_jwks in our application and it works very well.

Our usage is very basic (I believe). We have the following Token implementation:

defmodule App.Accounts.Tokens.Google do
@moduledoc false
use Joken.Config, default_signer: nil

@isS "https://accounts.google.com"

defp aud, do: Application.fetch_env!(:elixir_auth_google, :client_id)

add_hook(JokenJwks, strategy: App.Accounts.Tokens.Strategy)

@impl Joken.Config
def token_config do
default_claims()
|> add_claim("iss", nil, &(&1 == @isS))
|> add_claim("aud", nil, &(&1 == aud()))
end
end

And the following strategy:

defmodule App.Accounts.Tokens.Strategy do
@moduledoc false
use JokenJwks.DefaultStrategyTemplate

def init_opts(opts) do
url = "https://www.googleapis.com/oauth2/v3/certs"
Keyword.merge(opts, jwks_url: url)
end
end

We face some issues with our app refusing to start (in staging, mostly) with the following error message:

[error] Task #PID<0.845.0> started from App.Supervisor terminating
** (ArgumentError) errors were found at the given arguments:

  • 1st argument: the table identifier does not refer to an existing ETS table

    (stdlib 5.0) :ets.insert(App.Accounts.Tokens.Strategy.EtsCache, {:signers, %{PRUNED}})
    (app 0.1.0) lib/app/accounts/tokens/strategy.ex:3: App.Accounts.Tokens.Strategy.EtsCache.put_signers/1
    (app 0.1.0) lib/app/accounts/tokens/strategy.ex:3: App.Accounts.Tokens.Strategy.fetch_signers/2
    (elixir 1.15.4) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
    Function: #Function<1.47344484/0 in App.Accounts.Tokens.Strategy.start_fetch_signers/2>
    Args: []

It seemed like a race condition because it worked most of the time. We were not able to figure out what went wrong and diving into the joken_jwks code, it seems strange that the ETS table would not be created because we'd expect the process to fail earlier when setting the counter to 0.

After some more investigating, the problem seems to be related to the compilation step, not the runtime. The reason we suspect this is that a build that throws the error above does it every time it's started and a build that does not throw on the first start does not throw at all. Locally it's the same, as soon as the app throws the error at startup each mix phx.server will throw the error. Recompiling the app will solve this, but changing unrelated files (which only trigger a partial recompile) does not.

@victorolinasc
Copy link
Collaborator

So, I've used the base work from @lovebes and refactored it a bit. I believe that what we have currently in #48 should solve issues here. The only way it would STILL show this exception is by registering two process by doing something like:

      Supervisor.child_spec(MyStrategy, id: :id_1),
      Supervisor.child_spec(MyStrategy, id: :id_2)

This would still make it impossible to initialize the cache because it we are using the module name to do that and on both the module is the same. It would still break in an awkward way with 1st argument: table name already exists but I believe this is the only way to reproduce this now. All initialization work is on the init/1 GenServer callback which is run on the new process that is tied to the table lifecycle (if it dies, the table dies).

Can anyone here try it locally before I release a new version? You can do that with a github dep passing the PR branch like:

  # in mix deps
  {:joken_jwks, github: "joken-elixir/joken_jwks", branch: "feat/better_process_integration"}

P.S.: thanks once again to everyone's trying to figure this one out! OSS for the win!

@bryannaegele
Copy link

I have only seen this in local dev with code reloading on, so this makes sense. 👍🏻

@alolis
Copy link

alolis commented Jan 6, 2024

I am experiencing the same issue in a Phoenix application. For the time being I am initializing the strategy with first_fetch_sync: true which seems to work but I am not sure how this affects the rest of my application.

I also want to mention that I am seeing this in my local dev with code reloading enabled.

@victorolinasc
Copy link
Collaborator

Hi @alolis ! Can you try the branch I've mentioned? Just to check if it works :)

THanks!

@alolis
Copy link

alolis commented Jan 7, 2024

Hi @victorolinasc! Just gave it a try and I am getting (UndefinedFunctionError) function :hackney.request/5 is undefined (module :hackney is not available).

Am I supposed to install something else in order for the branch to work?

@victorolinasc
Copy link
Collaborator

@alolis this is strange as we haven't touched adapter configuration.

By default we use a Tesla request configuration and it uses :hackney as the client if nothing else is configured. Were you passing adapter options? This was supposed to work just the same as before...

@alolis
Copy link

alolis commented Jan 12, 2024

@victorolinasc , I have completely removed my deps and did a fresh mix deps.get but unfortunately the same issue. It seems that hackney is not getting installed if I use the branch but I can see it in my deps if I use {:joken_jwks, "~> 1.6.0"} in my mix.es.

If I keep both dependencies (probably is wrong) in my mix.es and run it then I get a different error when I start my phoenix app with mix phx.server:

** (Mix) Could not start application my_app: MyApp.Application.start(:normal, []) returned an error: shutdown: failed to start child: JokenJwksStrategy
    ** (EXIT) an exception was raised:
        ** (ArgumentError) errors were found at the given arguments:

  * 1st argument: the table identifier does not refer to an existing ETS table

The refered JokenJwksStrategy in the error above is my custom module:

defmodule JokenJwksStrategy do
  use JokenJwks.DefaultStrategyTemplate

  def init_opts(opts) do
    url = "MY URL HERE"
    Keyword.merge(opts, jwks_url: url)
  end
end

Let me know if I am doing something really really wrong :)

@wbyrnetx
Copy link

wbyrnetx commented Aug 8, 2024

Any update on the fix for this? I'm having to restart mix phx.server every few minutes

[error] Child MyAppWeb.FirebaseJwks of Supervisor MyApp.Supervisor terminated
** (exit) killed
Pid: #PID<0.592.0>
Start Call: MyAppWeb.FirebaseJwks.start_link([])
Restart: :transient
Shutdown: 5000
Type: :worker
[error] Child MyAppWeb.FirebaseJwks of Supervisor MyApp.Supervisor failed to start
** (exit) an exception was raised:
    ** (ArgumentError) errors were found at the given arguments:

  * 1st argument: table name already exists

        (stdlib 5.2.3) :ets.new(MyAppWeb.FirebaseJwks.EtsCache, [:set, :public, :named_table, {:read_concurrency, true}, {:write_concurrency, true}])
        (my_app 0.1.0) lib/my_app_web/tokens/firebase_jwks.ex:6: MyAppWeb.FirebaseJwks.EtsCache.new/0
        (my_app 0.1.0) lib/my_app_web/tokens/firebase_jwks.ex:6: MyAppWeb.FirebaseJwks.start_link/1
        (stdlib 5.2.3) supervisor.erl:420: :supervisor.do_start_child_i/3
        (stdlib 5.2.3) supervisor.erl:406: :supervisor.do_start_child/2
        (stdlib 5.2.3) supervisor.erl:845: :supervisor.restart/3
        (stdlib 5.2.3) supervisor.erl:800: :supervisor.restart/2
        (stdlib 5.2.3) supervisor.erl:615: :supervisor.handle_info/2
        (stdlib 5.2.3) gen_server.erl:1095: :gen_server.try_handle_info/3
        (stdlib 5.2.3) gen_server.erl:1183: :gen_server.handle_msg/6
        (stdlib 5.2.3) proc_lib.erl:251: :proc_lib.wake_up/3

@lovebes
Copy link
Contributor

lovebes commented Aug 9, 2024

I created a fix PR: #68 to prevent table creation if ETS table already exists. That will at least not trigger the table name already exists error. FYI @victorolinasc @wbyrnetx

@victorolinasc
Copy link
Collaborator

All the process work is now published as 1.7.0-rc.0.

I f you people can test it It'd be very helpful! A small changelog is:

  • No more custom logging (use Logger facilities)

If you need to disable specific logging from the library, you can use default Logger facilities like Logger.put_application_level/2 (which accepts :none) and so on. There is no need to use custom logging here anymore.

  • No more custom telemetry (use Teslas built-in)

Tesla has its own middleware for telemetry events. We should just use that :)

  • Different process strategy (thanks @lovebes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants