-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
postgres process killed with killed -9 command (look like it is bug) #396
Comments
This is very urgent issue as we are testing new postgresql solution |
Hello @jatindersandhu |
No I basically shift from operator solution . It is something between
Chruncy operator and red hat ..
On Wed, Nov 24, 2021 at 11:34 AM Jiri Popelka ***@***.***> wrote:
Hello @jatindersandhu <https://github.com/jatindersandhu>
Have you figured out anything since reporting this? We're seeing this too
after moving to a new Openshift cluster and I'm quite clueless.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#396 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AK5VZWZOFVWNW7STFJ346LLUNUHYPANCNFSM4XDQMXJA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
JATINDER SINGH SANDHU
DATABASE ADMINISTRATOR
PHONE NO. - 905 460 7955
|
@jatindersandhu Thanks for posting this issue/bug. Can you help me, how to debug and/or simulate the error? |
It's difficult to say what's causing this.
It doesn't matter whether we started with an empty database or imported |
Sorry for confusion relating this openshift and chruncy operator. It seems it is resource issue. Kill 9 can be issue by only kubernete as postgres does not use kill -9 to kill the process because it can restart the whole database. I think we need check if there is any liveness probe or readiness probe configured. Some time openshift has bug where probe timeout and kubernate kill the postgres. Solution is for this increase probe timeout like 5 min |
Experimenting with the probes was the first thing I did and from what I remember I was seeing the problem even without any probe, so the probe shouldn't be a problem per se. |
I wonder if we should increase logging and analyze those logs, it could reveal how much memory postgres tries to acquire and why Pity this one isn't merged yet: #366 |
reporter of this issue did that, see the first comment |
FTR, service file gets this:
Is this kernel OOM killer enabled in the OpenShift environment? |
Good point Pavel, thanks. Managing Kernel Resources chapter says:
and yes, the postgresql.service follows that with
In K8s/Openshift the OOM score adjustment can't be turned off completely (i.e. set to -1000), but can be set to a minimum if the pod has Guaranteed QoS, i.e. when memory limit and memory request sizes are set the same. The I tried that, gave the pod 2Gi mem requested&limit, even set the Also, when I gave the pod a really small amount of memory, like (requests&limits) 60Mi, because I explicitly wanted to see the OOMkiller in action, it was running just fine for quite some time before the killer showed up. I more and more tend to think that it's the memory overcommit which Kubernetes actively sets to vm.overcommit_memory=1 (always overcommit) but PostgreSQL community recommends it to be set to vm.overcommit_memory=2 (don't overcommit). Also this article summarizes it very well. The result is that I still don't have any clue how to make it work reliably :( |
Looks like it really was a readiness probe (related) issue after all. After I, once more, tried to remove the probe, the problem has disappeared for good. I have no idea why it hadn't worked the first time I tried it. Maybe, it's the combination of all the changes I did, which is: |
we are running postgresql 12.1 inside the container (we use this image https://catalog.redhat.com/software/containers/rhel8/postgresql-12/5db133bd5a13461646df330b). We are facing strange situation in which some postgre process killed with kill -9 command. I enable the debug in postgresql. First I thought it is OOM kill then I raise the resource in container. It is not anything related to resource as untilization of memory and CPU is way under the allocated resource (like cpu are 8 and memory is 100GB allocated and used only cpu 1 and 37.8GB memory). This issue randomly appear (some time DB is not doing anything). One more strange thing process ID which killed by Kill -9 , it never appear before in logs like if process never is forked. Due to postgres run under container (Open shift envirnoment) we can not get kernal dmesg.
Here is snippet of log and full log also attached
:28:06 EST,session-id=600ccd26.50f5cLOCATION: SharedInvalBackendInit, sinvaladt.c:324
2021-01-23 20:28:07.648 EST [46]=[33197-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:08 EST,session-id=6009272c.2eDEBUG: 00000: recycled write-ahead log file "000000010000014900000021"
2021-01-23 20:28:07.648 EST [46]=[33198-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:08 EST,session-id=6009272c.2eLOCATION: RemoveXlogFile, xlog.c:4106
2021-01-23 20:28:07.805 EST [331612]=[13-1]user=postgres,db=postgres,app=[unknown],client=[local],session-start=2021-01-23 20:28:06 EST,session-id=600ccd26.50f5cDEBUG: 00000: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2021-01-23 20:28:07.805 EST [331612]=[14-1]user=postgres,db=postgres,app=[unknown],client=[local],session-start=2021-01-23 20:28:06 EST,session-id=600ccd26.50f5cLOCATION: ShowTransactionStateRec, xact.c:5327
2021-01-23 20:28:08.142 EST [1]=[292190-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1DEBUG: 00000: reaping dead processes
2021-01-23 20:28:08.142 EST [1]=[292191-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1LOCATION: reaper, postmaster.c:2888
2021-01-23 20:28:08.142 EST [1]=[292192-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1DEBUG: 00000: server process (PID 331611) was terminated by signal 9: Killed
2021-01-23 20:28:08.142 EST [1]=[292193-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1LOCATION: LogChildExit, postmaster.c:3686
2021-01-23 20:28:08.142 EST [1]=[292194-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1LOG: 00000: server process (PID 331611) was terminated by signal 9: Killed
2021-01-23 20:28:08.142 EST [1]=[292195-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1LOCATION: LogChildExit, postmaster.c:3686
2021-01-23 20:28:08.142 EST [1]=[292196-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1LOG: 00000: terminating any other active server processes
The text was updated successfully, but these errors were encountered: