-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal error: unable to initialize the JIT #155
Comments
Pretty light on details here ? What else is happening before this ? |
In particular, what OS are we talking about here? FWIW I don't find the error message string "unable to initialize the JIT" anywhere in either PostgreSQL or PL/R sources. But in any case if the OS is Linux, perhaps you are getting hit by the OOM killer? Look for references to "signal 9" in the postgres logs and/or "sigkill" in dmesg output (I seem to recall dmesg uses the name rather than the number for the signal, but I think it also explicitly says OOM kill or something similar too). Also this link points to Flask and R interaction: https://stackoverflow.com/questions/62928973/rpy2-in-a-flask-app-fatal-error-unable-to-initialize-the-jit Hope this helps |
This one seems to indicate "path to libR.so not known to the linker (e.g., not in ldconf or in LD_LIBRARY_PATH)" In that case the solution might be to create "/etc/ld.so.conf.d/libR.conf" with contents something like: Again, this assumes your OS is Linux of course. |
Hi, sorry this is linux Debian in a container on Kubernetes. I don't see the containers being restarted with the usual OOM code. It looks like its exit code 2
As for the LD_LIBRARY_PATH, I have had a look. When I run R in the container
Whereas in PLR If I run the following
I have tried setting the LD_LIBRARY_PATH environment variable manually to what R itself reports, (I am not sure where it gets it from). But we are still seeing the same problem. |
Sounds like you need to take this up with whomever is providing the container. This is not a PL/R issue as far as I can tell, it is more of a configuration problem. With a container there is not much you can do unless you have the ability to modify the container itself. FWIW, you can use plr_environ() to see the environment from the Postgres point of view. The OOM stuff I mention is in reference to "we have a problem in which we call pg.spi.exec and if the data returned is over a certain size postgres crashes." You need to check the logs to confirm that though. I don't see any evidence in your reply that you have looked at the actual postgres logs. The "exit 2" is not coming from postgres and seems to be a client exit code. Is this happening every time, i.e. PL/R never runs, or is it sporadic, as in PL/R usually works but sometimes you get a crash with this error? If the latter, it still makes me suspect OOM. Do you know what cgroup version is being used? (v1 or v2)? |
Here is a
The function and query I am running looks like this
If I run the query not inside PLR it runs fine. I can also run the query in plpgsql and plython without issue. |
The error occurs inside postgres when the parallel worker is started. We have never tried to make PL/R parallel safe -- is the PL/R function explicitly created with "PARALLEL SAFE"? The default (as in not specified) would be "PARALLEL UNSAFE", but it isn't clear to me what happens to the query being executed via SPI. Another thing to try would be to add "SET max_parallel_workers_per_gather = 0" to the PL/R function definition and see if the problem goes away. Any chance you can capture a core file from the process that gets the abort (signal 6)? BTW, I assume you built postgres with "--enable-cassert" and turned on debug logging specifically for the troubleshooting, correct? You would not ordinarily want either of those for a prod system. |
This postgres is from this Docker image: https://github.com/cloudnative-pg/postgres-containers/blob/main/Debian/16/Dockerfile which in turn is from Which I assume is just the official postgres install from apt. I will attempt to extract a core file. Thanks for the PARALLEL SAFE and max_parallel_workers_per_gather pointers, I will implement that and see if that helps with our issues. |
The functions were not defined with PARALLEL SAFE, I tested: PARALLEL UNSAFE: Crash still happens I will try deploying with the workers setting and see if our intermittent issues are resolved. I've been trying to get a core dump but unfortunately the root filesystem is set to read only and I think that its preventing the core dumps from being written. |
Ok based on the above I am guessing that even though the PL/R function is PARALLEL UNSAFE (the default), the SPI query execution is done with parallel query, and somehow that tickles a bug. What is not clear to me is whether the bug is in PL/R or in Postgres itself. If you cannot get a core file, can you create a self contained example (table def, sample data -- perhaps generated with generate_series -- and PL/R function) that I can use to duplicate the crash? |
I'm still trying to create a self contained example. However, I did deploy the fixes for pg.spi.exec, but we are still seeing the https://github.com/wch/r-source/blob/trunk/src/main/main.c#L1193 I am now trying our postgres deployment with the environment var |
Hmmm, is JIT support in R new? I don't ever remember R doing JIT before. What version of R are you using? Perhaps there is something PL/R ought to be doing to play nice with a new R feature? (looks) Seems like R has had the ability to do JIT since 2010 so not exactly new but perhaps no one typically enables it? I wonder if maybe R was compiled to use a different LLVM version than Postgres and that is causing the issue. Anyway, I think I would need a way to reproduce the issue in order to get to the bottom of it. |
We are using
I believe the default is JIT enabled. After disabling the JIT we are no longer seeing a "unable to initialize the JIT" it has now been replaced by We have just deployed a new way to hopefully get a core dump, so just waiting on a crash now. |
I faced the same problem but it was caused by the parallelism, Once I disabled the I am using PG 16 postgres=# SELECT plr_version();
plr_version
-------------
8.4
(1 row)
postgres=# SELECT r_version();
r_version
----------------------------------------------------------------
(platform,x86_64-pc-linux-gnu)
(arch,x86_64)
(os,linux-gnu)
(system,"x86_64, linux-gnu")
(status,Patched)
(major,4)
(minor,2.2)
(year,2022)
(month,11)
(day,10)
("svn rev",83330)
(language,R)
(version.string,"R version 4.2.2 Patched (2022-11-10 r83330)")
(nickname,"Innocent and Trusting")
(14 rows)
postgres=# select version();
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------
version | PostgreSQL 16.3 (Debian 16.3-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit |
We are trying to debug some issues in our PG16 environment, they manifest themselves as a crash of postgres and they all appear to be related to PLR. Often the crash is so abrupt that we get no logging from it at all, sometimes we see log
Sometimes we see a log with
*** stack smashing detected ***
This appears to be from PLR/R. Other projects issues have mentioned perhaps it can't find libR.so. I am unsure. Can you point me in any direction to try and figure out what is causing this?
As an aside we have a problem in which we call
pg.spi.exec
and if the data returned is over a certain size postgres crashes.Any sort of way to diagnose this would be fantastic.
Thanks
The text was updated successfully, but these errors were encountered: