ets table not being cleaned up, causes application crash #40

dantswain · 2023-02-23T19:51:53Z

Hi, I've noticed that my application is crashing periodically. I've tracked down this library as at least contributing to the crash. For whatever reason my supervisor is attempting to restart my JWK strategy but it cannot because the ETS table that it wants to create already exists:

** (exit) an exception was raised:
    ** (ArgumentError) errors were found at the given arguments:

  * 1st argument: table name already exists

        (stdlib 4.2) :ets.new(MyApp.Auth.JwkStrategy.EtsCache, [:set, :public, :named_table, {:read_concurrency, true}, {:write_concurrency, true}])
        (my_app 0.1.0) lib/my_app/auth/jwk_strategy.ex:11: MyApp.Auth.JwkStrategy.EtsCache.new/0
        ( my_app 0.1.0) lib/my_app/auth/jwk_strategy.ex:11: MyApp.Auth.JwkStrategy.start_link/1

I'm not sure why it's being restarted in the first place (I think it is happening after a recompile in dev), but the supervisor is unable to start the JWK strategy and it reaches the retry policy limits which in turn crashes the parent supervisor and ultimately the whole application.

Looking at the implementation I see that the ets table is being created in the start_link function of the strategy: https://github.com/joken-elixir/joken_jwks/blob/master/lib/joken_jwks/default_strategy_template.ex#L190

Since that start_link is being called by my supervisor, it means that the supervisor owns the table. Therefore whenever the strategy itself restarts under the supervisor, it tries to re-create the ets table and fails. I think the ets table should probably be started by the actual worker process here.

I think a workaround for me is to check if the table exists in an init_opts callback and delete it if it already exists. I would be glad to submit a PR to fix the underlying problem, but looking at the actual code I'm a bit curious why it's not just using genserver? That would make this a bit simpler to solve since the table could be created in the genserver's init callback.

The text was updated successfully, but these errors were encountered:

victorolinasc · 2023-03-31T00:42:25Z

Ouch... this really is a bad strategy... I will probably try to address this on next version (2.0).

The proper fix here might be to handle initialization asynchronously. We could use handle_continue for that.

There is some churn in this repo right now and I am gathering some feedback about breaking changes here so that I can release a 2.0 version of this library. This will, certainly, get into that list but I can't promise a schedule right now...

If you want to try handling this on a PR with an async initialization it will be most welcome.

bryannaegele · 2023-06-06T17:18:10Z

Something to note: this also happens where the table doesn't exist. Doing a quick view through the code, this likely related to the initialization of the table being done in a task and creating a race condition on restarts, as well. I can't find the exact reason for the table not existing as this is currently happening in local dev and it just randomly occurs with no logs generated. I don't think making initialization async (task or handle_continue) for any of this is feasible as it introduces another race condition. It can be presumed that users will not want their app to start if user access verification isn't available, so it should always be a blocking operation by default.

 * 1st argument: the table identifier does not refer to an existing ETS table

        (stdlib 4.2) :ets.lookup(***.Jwks.Strategy.EtsCache, :signers)
        (policies 0.1.0) lib/***/auth.ex:23: ***.Jwks.Strategy.EtsCache.get_signers/0
        (policies 0.1.0) lib/***/auth.ex:23: ***Jwks.Strategy.match_signer_for_kid/2
        (joken_jwks 1.6.0) lib/joken_jwks.ex:54: JokenJwks.before_verify/2
        (joken 2.6.0) lib/joken/hooks.ex:199: anonymous fn/3 in Joken.Hooks.run_before_hook/3

Sgoettschkes · 2023-08-29T18:44:35Z

Doing a quick view through the code, this is likely related to the initialization of the table being done in a task and creating a race condition on restarts, as well.

From the code, it seems like the :ets.new is called directly in start_link/1 through the call to EtsCache.new(): https://github.com/joken-elixir/joken_jwks/blob/v1.6.0/lib/joken_jwks/default_strategy_template.ex#L190

I don't think a race condition can exist there.

The reason I looked this up is because we have the problem of our Elixir Phoenix app not starting (sometimes) and the error we get is that put_signers/1 (https://github.com/joken-elixir/joken_jwks/blob/v1.6.0/lib/joken_jwks/default_strategy_template.ex#L165) fails because the table does not exist, which seems odd given that it was called a few lines above. Not sure if it's related at all 🤔

victorolinasc · 2023-09-15T15:32:53Z

@Sgoettschkes said:

We use joken and joken_jwks in our application and it works very well.

Our usage is very basic (I believe). We have the following Token implementation:

defmodule App.Accounts.Tokens.Google do
@moduledoc false
use Joken.Config, default_signer: nil

@isS "https://accounts.google.com"

defp aud, do: Application.fetch_env!(:elixir_auth_google, :client_id)

add_hook(JokenJwks, strategy: App.Accounts.Tokens.Strategy)

@impl Joken.Config
def token_config do
default_claims()
|> add_claim("iss", nil, &(&1 == @isS))
|> add_claim("aud", nil, &(&1 == aud()))
end
end

And the following strategy:

defmodule App.Accounts.Tokens.Strategy do
@moduledoc false
use JokenJwks.DefaultStrategyTemplate

def init_opts(opts) do
url = "https://www.googleapis.com/oauth2/v3/certs"
Keyword.merge(opts, jwks_url: url)
end
end

We face some issues with our app refusing to start (in staging, mostly) with the following error message:

[error] Task #PID<0.845.0> started from App.Supervisor terminating
** (ArgumentError) errors were found at the given arguments:

1st argument: the table identifier does not refer to an existing ETS table

(stdlib 5.0) :ets.insert(App.Accounts.Tokens.Strategy.EtsCache, {:signers, %{PRUNED}})
(app 0.1.0) lib/app/accounts/tokens/strategy.ex:3: App.Accounts.Tokens.Strategy.EtsCache.put_signers/1
(app 0.1.0) lib/app/accounts/tokens/strategy.ex:3: App.Accounts.Tokens.Strategy.fetch_signers/2
(elixir 1.15.4) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
Function: #Function<1.47344484/0 in App.Accounts.Tokens.Strategy.start_fetch_signers/2>
Args: []

It seemed like a race condition because it worked most of the time. We were not able to figure out what went wrong and diving into the joken_jwks code, it seems strange that the ETS table would not be created because we'd expect the process to fail earlier when setting the counter to 0.

After some more investigating, the problem seems to be related to the compilation step, not the runtime. The reason we suspect this is that a build that throws the error above does it every time it's started and a build that does not throw on the first start does not throw at all. Locally it's the same, as soon as the app throws the error at startup each mix phx.server will throw the error. Recompiling the app will solve this, but changing unrelated files (which only trigger a partial recompile) does not.

victorolinasc · 2023-09-15T21:14:55Z

So, I've used the base work from @lovebes and refactored it a bit. I believe that what we have currently in #48 should solve issues here. The only way it would STILL show this exception is by registering two process by doing something like:

      Supervisor.child_spec(MyStrategy, id: :id_1),
      Supervisor.child_spec(MyStrategy, id: :id_2)

This would still make it impossible to initialize the cache because it we are using the module name to do that and on both the module is the same. It would still break in an awkward way with 1st argument: table name already exists but I believe this is the only way to reproduce this now. All initialization work is on the init/1 GenServer callback which is run on the new process that is tied to the table lifecycle (if it dies, the table dies).

Can anyone here try it locally before I release a new version? You can do that with a github dep passing the PR branch like:

  # in mix deps
  {:joken_jwks, github: "joken-elixir/joken_jwks", branch: "feat/better_process_integration"}

P.S.: thanks once again to everyone's trying to figure this one out! OSS for the win!

bryannaegele · 2023-09-25T19:04:14Z

I have only seen this in local dev with code reloading on, so this makes sense. 👍🏻

alolis · 2024-01-06T20:15:46Z

I am experiencing the same issue in a Phoenix application. For the time being I am initializing the strategy with first_fetch_sync: true which seems to work but I am not sure how this affects the rest of my application.

I also want to mention that I am seeing this in my local dev with code reloading enabled.

victorolinasc · 2024-01-07T16:34:33Z

Hi @alolis ! Can you try the branch I've mentioned? Just to check if it works :)

THanks!

alolis · 2024-01-07T17:05:47Z

Hi @victorolinasc! Just gave it a try and I am getting (UndefinedFunctionError) function :hackney.request/5 is undefined (module :hackney is not available).

Am I supposed to install something else in order for the branch to work?

victorolinasc · 2024-01-09T15:41:22Z

@alolis this is strange as we haven't touched adapter configuration.

By default we use a Tesla request configuration and it uses :hackney as the client if nothing else is configured. Were you passing adapter options? This was supposed to work just the same as before...

alolis · 2024-01-12T07:30:09Z

@victorolinasc , I have completely removed my deps and did a fresh mix deps.get but unfortunately the same issue. It seems that hackney is not getting installed if I use the branch but I can see it in my deps if I use {:joken_jwks, "~> 1.6.0"} in my mix.es.

If I keep both dependencies (probably is wrong) in my mix.es and run it then I get a different error when I start my phoenix app with mix phx.server:

** (Mix) Could not start application my_app: MyApp.Application.start(:normal, []) returned an error: shutdown: failed to start child: JokenJwksStrategy
    ** (EXIT) an exception was raised:
        ** (ArgumentError) errors were found at the given arguments:

  * 1st argument: the table identifier does not refer to an existing ETS table

The refered JokenJwksStrategy in the error above is my custom module:

defmodule JokenJwksStrategy do
  use JokenJwks.DefaultStrategyTemplate

  def init_opts(opts) do
    url = "MY URL HERE"
    Keyword.merge(opts, jwks_url: url)
  end
end

Let me know if I am doing something really really wrong :)

wbyrnetx · 2024-08-08T18:58:12Z

Any update on the fix for this? I'm having to restart mix phx.server every few minutes

[error] Child MyAppWeb.FirebaseJwks of Supervisor MyApp.Supervisor terminated
** (exit) killed
Pid: #PID<0.592.0>
Start Call: MyAppWeb.FirebaseJwks.start_link([])
Restart: :transient
Shutdown: 5000
Type: :worker
[error] Child MyAppWeb.FirebaseJwks of Supervisor MyApp.Supervisor failed to start
** (exit) an exception was raised:
    ** (ArgumentError) errors were found at the given arguments:

  * 1st argument: table name already exists

        (stdlib 5.2.3) :ets.new(MyAppWeb.FirebaseJwks.EtsCache, [:set, :public, :named_table, {:read_concurrency, true}, {:write_concurrency, true}])
        (my_app 0.1.0) lib/my_app_web/tokens/firebase_jwks.ex:6: MyAppWeb.FirebaseJwks.EtsCache.new/0
        (my_app 0.1.0) lib/my_app_web/tokens/firebase_jwks.ex:6: MyAppWeb.FirebaseJwks.start_link/1
        (stdlib 5.2.3) supervisor.erl:420: :supervisor.do_start_child_i/3
        (stdlib 5.2.3) supervisor.erl:406: :supervisor.do_start_child/2
        (stdlib 5.2.3) supervisor.erl:845: :supervisor.restart/3
        (stdlib 5.2.3) supervisor.erl:800: :supervisor.restart/2
        (stdlib 5.2.3) supervisor.erl:615: :supervisor.handle_info/2
        (stdlib 5.2.3) gen_server.erl:1095: :gen_server.try_handle_info/3
        (stdlib 5.2.3) gen_server.erl:1183: :gen_server.handle_msg/6
        (stdlib 5.2.3) proc_lib.erl:251: :proc_lib.wake_up/3

lovebes · 2024-08-09T01:09:07Z

I created a fix PR: #68 to prevent table creation if ETS table already exists. That will at least not trigger the table name already exists error. FYI @victorolinasc @wbyrnetx

victorolinasc · 2024-08-10T21:12:21Z

All the process work is now published as 1.7.0-rc.0.

I f you people can test it It'd be very helpful! A small changelog is:

No more custom logging (use Logger facilities)

If you need to disable specific logging from the library, you can use default Logger facilities like Logger.put_application_level/2 (which accepts :none) and so on. There is no need to use custom logging here anymore.

No more custom telemetry (use Teslas built-in)

Tesla has its own middleware for telemetry events. We should just use that :)

Different process strategy (thanks @lovebes)

dantswain mentioned this issue Feb 23, 2023

Plug.Cowboy.Drainer shut down abnormally elixir-plug/plug_cowboy#92

Open

victorolinasc mentioned this issue Mar 31, 2023

Joken JWKS version 2.0 #43

Open

victorolinasc mentioned this issue Sep 15, 2023

ETS table does not exist (sometimes) #46

Closed

lovebes mentioned this issue Aug 9, 2024

UPDATE: EtsCache.new() to create only if table doesn't exist #68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ets table not being cleaned up, causes application crash #40

ets table not being cleaned up, causes application crash #40

dantswain commented Feb 23, 2023

victorolinasc commented Mar 31, 2023

bryannaegele commented Jun 6, 2023

Sgoettschkes commented Aug 29, 2023

victorolinasc commented Sep 15, 2023

victorolinasc commented Sep 15, 2023

bryannaegele commented Sep 25, 2023

alolis commented Jan 6, 2024 •

edited

Loading

victorolinasc commented Jan 7, 2024

alolis commented Jan 7, 2024 •

edited

Loading

victorolinasc commented Jan 9, 2024

alolis commented Jan 12, 2024 •

edited

Loading

wbyrnetx commented Aug 8, 2024 •

edited

Loading

lovebes commented Aug 9, 2024 •

edited

Loading

victorolinasc commented Aug 10, 2024

ets table not being cleaned up, causes application crash #40

ets table not being cleaned up, causes application crash #40

Comments

dantswain commented Feb 23, 2023

victorolinasc commented Mar 31, 2023

bryannaegele commented Jun 6, 2023

Sgoettschkes commented Aug 29, 2023

victorolinasc commented Sep 15, 2023

victorolinasc commented Sep 15, 2023

bryannaegele commented Sep 25, 2023

alolis commented Jan 6, 2024 • edited Loading

victorolinasc commented Jan 7, 2024

alolis commented Jan 7, 2024 • edited Loading

victorolinasc commented Jan 9, 2024

alolis commented Jan 12, 2024 • edited Loading

wbyrnetx commented Aug 8, 2024 • edited Loading

lovebes commented Aug 9, 2024 • edited Loading

victorolinasc commented Aug 10, 2024

alolis commented Jan 6, 2024 •

edited

Loading

alolis commented Jan 7, 2024 •

edited

Loading

alolis commented Jan 12, 2024 •

edited

Loading

wbyrnetx commented Aug 8, 2024 •

edited

Loading

lovebes commented Aug 9, 2024 •

edited

Loading