G1 crash in mark_in_next_bitmap #867

Adam- · 2023-08-11T13:33:42Z

Please provide a brief summary of the bug

We observe rare G1 crashes in G1ConcurrentMark::mark_in_next_bitmap in at least AdoptOpenJDK/Temurin 11.0.4, 11.0.8, 11.0.16, 11.0.16.1, 11.0.18, and 11.0.19. We don't have a way to reproduce the issue, and it seemingly happens at random based on the reports sent to us by users. We have observed this specific crash 200 times on 144 different machines in the last 3 weeks.

I have included the full crash report of one of these crashes here, they are all nearly identical and have an identical native frame stack.

They look like this:

Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.19+7 (11.0.19+7, mixed mode, tiered, compressed oops, g1 gc, windows-amd64)

Current thread (0x000001e522b25000):  ConcurrentGCThread "G1 Conc#0" [stack: 0x0000003ed5500000,0x0000003ed5600000] [id=2332]

Stack: [0x0000003ed5500000,0x0000003ed5600000],  sp=0x0000003ed55ffb90,  free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [jvm.dll+0x3041e5]
V  [jvm.dll+0x3040c2]
V  [jvm.dll+0x2fbdcd]
V  [jvm.dll+0x2fb491]
V  [jvm.dll+0x30ce70]
V  [jvm.dll+0x30c764]
V  [jvm.dll+0x310c36]
V  [jvm.dll+0x812840]
V  [jvm.dll+0x79d6e4]
V  [jvm.dll+0x64e915]
C  [ucrtbase.dll+0x29363]
C  [KERNEL32.DLL+0x126ad]
C  [ntdll.dll+0x5aa68]


siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), reading address 0x0000000000000160

Mapping the dll offsets to symbols via the pdb files Adoptium provides, yields this stack:

bool G1ConcurrentMark::mark_in_next_bitmap(unsigned int,class oopDesc * __ptr64 const,unsigned __int64) __ptr64
bool G1CMTask::make_reference_grey(class oopDesc * __ptr64) __ptr64
static void OopOopIterateDispatch<class G1CMOopClosure>::Table::oop_oop_iterate<class ObjArrayKlass,unsigned int>(class G1CMOopClosure * __ptr64,class oopDesc * __ptr64,class Klass * __ptr64)
int oopDesc::oop_iterate_size<class G1CMOopClosure>(class G1CMOopClosure * __ptr64) __ptr64
void G1CMTask::drain_local_queue(bool) __ptr64
void G1CMTask::do_marking_step(double,bool,bool) __ptr64
virtual void G1CMConcurrentMarkingTask::work(unsigned int) __ptr64
virtual void GangWorker::loop(void) __ptr64
void Thread::call_run(void) __ptr64
static __int64 os::thread_cpu_time(class Thread * __ptr64,bool)

For reference, the code surrounding the crash is:

inline bool G1ConcurrentMark::mark_in_next_bitmap(uint const worker_id, oop const obj, size_t const obj_size) {
  HeapRegion* const hr = _g1h->heap_region_containing(obj);
  return mark_in_next_bitmap(worker_id, hr, obj, obj_size);
}

inline bool G1ConcurrentMark::mark_in_next_bitmap(uint const worker_id, HeapRegion* const hr, oop const obj, size_t const obj_size) {
  assert(hr != NULL, "just checking");
  assert(hr->is_in_reserved(obj), "Attempting to mark object at " PTR_FORMAT " that is not contained in the given region %u", p2i(obj), hr->hrm_index());

  if (hr->obj_allocated_since_next_marking(obj)) {
    return false;
  }

I have disassembled jvm.dll to determine what is happening.

       1803041be 48 8b 41 08     MOV        RAX,qword ptr [RCX + this->_g1h]                 RAX=this + _g1h
       1803041c2 4c 8b f9        MOV        R15,this
       1803041c5 4d 8b d0        MOV        R10,param_2                                      R10=param_2
       1803041c8 44 8b e2        MOV        R12D,param_1
       1803041cb 49 8b e9        MOV        RBP,param_3
       1803041ce 49 8b f8        MOV        RDI,param_2
       1803041d1 8b 88 c0        MOV        this,dword ptr [RAX + 0x2c0]                     RCX=_regions._shift_by
                 02 00 00
       1803041d7 48 8b 80        MOV        RAX,qword ptr [RAX + 0x2b0]                      RAX=_regions._biased_base
                 b0 02 00 00
       1803041de 49 d3 ea        SHR        R10,this                                         R10=param_2 >> RCX
       1803041e1 4e 8b 14 d0     MOV        R10,qword ptr [RAX + R10*0x8]                    R10=biased_base[R10*8]
       1803041e5 4d 3b 82        CMP        param_2,qword ptr [R10 + 0x160]                  crash here; cmp param_2 and ->_next_top_at_mark_start
                 60 01 00 00

The compiled code is somewhat dense because the compiler inlines the call to heap_region_containing, addr_to_region, get_by_address, shift_by and biased_base, as well as the overloaded call to mark_in_next_bitmap.

Inlining them in source form would look something like this:

inline bool G1ConcurrentMark::mark_in_next_bitmap(uint const worker_id, oop const obj, size_t const obj_size) {
  if (obj >= _g1h->_hrm._regions._biased_base[obj >> _g1h->_hrm._regions._shift_by]->_next_top_at_mark_start) {
    return false;
  }

  ...
}

Note that the CMP instruction accesses qword ptr [R10 + 0x160] and also the crash log shows EXCEPTION_ACCESS_VIOLATION (0xc0000005), reading address 0x0000000000000160. As far as I can tell, this means the value loaded from the _biased_base array is 0x0, which means hr is null, and is crashing when doing the access to _next_top_at_mark_start due to a null pointer dereference.

I have almost no understanding of the G1 GC or most of the JDK so I don't know where to go from here.

Please provide steps to reproduce where possible

No response

Expected Results

No crash

Actual Results

Crash

What Java Version are you using?

Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.19+7 (11.0.19+7, mixed mode, tiered, compressed oops, g1 gc, windows-amd64)

What is your operating system and platform?

No response

How did you install Java?

No response

Did it work before?

No response

Did you test with the latest update version?

No response

Did you test with other Java versions?

No response

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

karianna · 2023-08-16T00:39:22Z

https://bugs.openjdk.org/browse/JDK-8210557 is possible culprit (and only fixed in 12+). There's nothing in the 11.0.20 release notes that's related so it's unlikely that upgrading to that point release will fix. Are you able to run with 17.0.8?

Adam- · 2023-08-16T01:34:49Z

https://bugs.openjdk.org/browse/JDK-8210557 is possible culprit (and only fixed in 12+). There's nothing in the 11.0.20 release notes that's related so it's unlikely that upgrading to that point release will fix. Are you able to run with 17.0.8?

I think that https://bugs.openjdk.org/browse/JDK-8210557 cannot be the culprit because it is removing an assert() https://hg.openjdk.org/jdk/jdk/rev/b177af763b82, which are not compiled into the official releases from adoptium. As my software is deployed to many end-user systems instead of running on a bunch of servers I own, I cannot change the Java version en-masse easily.

Adam- · 2023-08-17T01:19:05Z

Here is a similar crash in 11.0.20 - however its stack is different:

bool G1ConcurrentMark::mark_in_next_bitmap(unsigned int,class oopDesc * __ptr64 const,unsigned __int64) __ptr64
bool G1CMTask::make_reference_grey(class oopDesc * __ptr64) __ptr64
static void OopOopIterateDispatch<class G1CMOopClosure>::Table::oop_oop_iterate<class InstanceKlass,unsigned int>(class G1CMOopClosure * __ptr64,class oopDesc * __ptr64,class Klass * __ptr64)
int oopDesc::oop_iterate_size<class G1CMOopClosure>(class G1CMOopClosure * __ptr64) __ptr64
void G1CMTask::scan_task_entry(class G1TaskQueueEntry) __ptr64
bool G1CMBitMapClosure::do_addr(class HeapWord * __ptr64 const) __ptr64
bool G1CMBitMap::iterate(class G1CMBitMapClosure * __ptr64,class MemRegion) __ptr64
void G1CMTask::do_marking_step(double,bool,bool) __ptr64
virtual void G1CMConcurrentMarkingTask::work(unsigned int) __ptr64
virtual void GangWorker::loop(void) __ptr64
void Thread::call_run(void) __ptr64
static __int64 os::thread_cpu_time(class Thread * __ptr64,bool)

g1_crash_11020.txt

karianna · 2023-08-20T22:25:16Z

https://bugs.openjdk.org/browse/JDK-8314619

karianna · 2023-08-21T09:10:19Z

@Adam- Can you try running with -XX:+VerifyBeforeGC and -XX:+VerifyAfterGC and send in the results after a crash?

Adam- · 2023-08-21T17:11:06Z

@Adam- Can you try running with -XX:+VerifyBeforeGC and -XX:+VerifyAfterGC and send in the results after a crash?

The performance degradation that is caused by these options is not something that we can widely deploy. I will try to get some individual users who are experiencing this issue to run with that.

github-actions · 2023-11-20T00:38:37Z

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable.
It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

karianna · 2023-11-20T22:29:48Z

@Adam- Any luck debugging this? It might also be worth trying 11.0.21 and seeing if it was fixed some other way

Adam- · 2023-11-20T23:12:56Z

@Adam- Any luck debugging this? It might also be worth trying 11.0.21 and seeing if it was fixed some other way

I have not made any further progress debugging this. However we have linked AMD CPB to some of the lower-volume crashes we receive (not this crash specifically - I only filed a bug for this one issue because it is observed on many machines on various processors, including many Intel). I still see this crash commonly, with 84 crashes in the last 2 weeks. I only just began deploying 11.0.21 yesterday and so only about 3% of my VMs have it, but I have not observed this crash on it yet.

Adam- · 2023-11-25T00:57:29Z

I still see this crash, here is one from a macos machine: g1-11.0.21.txt. The stack is slightly different but it still crashes in mark_in_next_bitmap

V  [libjvm.dylib+0x3113a2]  G1ConcurrentMark::mark_in_next_bitmap(unsigned int, HeapRegion*, oopDesc*, unsigned long)+0xa
V  [libjvm.dylib+0x31c0f7]  G1CMTask::make_reference_grey(oopDesc*)+0x43
V  [libjvm.dylib+0x312833]  void OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<InstanceKlass, unsigned int>(G1CMOopClosure*, oopDesc*, Klass*)+0x7b
V  [libjvm.dylib+0x31b194]  void G1CMTask::process_grey_task_entry<true>(G1TaskQueueEntry)+0x158
V  [libjvm.dylib+0x315b57]  G1CMBitMapClosure::do_addr(HeapWord*)+0x27
V  [libjvm.dylib+0x31a8d0]  G1CMBitMap::iterate(G1CMBitMapClosure*, MemRegion)+0x1fe
V  [libjvm.dylib+0x31a1d4]  G1CMTask::do_marking_step(double, bool, bool)+0x1da
V  [libjvm.dylib+0x31b5ce]  G1CMConcurrentMarkingTask::work(unsigned int)+0xa6
V  [libjvm.dylib+0x7f250b]  GangWorker::loop()+0x43
V  [libjvm.dylib+0x77e8c4]  Thread::call_run()+0x68
V  [libjvm.dylib+0x62f7a7]  thread_native_entry(Thread*)+0x122
C  [libsystem_pthread.dylib+0x6202]  _pthread_start+0x63
C  [libsystem_pthread.dylib+0x1bab]  thread_start+0xf

github-actions · 2024-02-24T00:04:54Z

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable.
It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

karianna · 2024-02-25T22:55:11Z

@Adam- 11.0.22 has been released if you want to give that a go.

Adam- · 2024-02-26T01:34:22Z

I've seen 40 crashes in the last week from various hosts on 11.0.22:

      1 host 12th Gen Intel(R) Core(TM) i5-12600K, 16 cores, 63G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      1 host 12th Gen Intel(R) Core(TM) i7-12700K, 20 cores, 63G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      1 host 13th Gen Intel(R) Core(TM) i5-13600K, 20 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      2 host AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G  , 4 cores, 6G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3393) jdk 11.0.22+7
      1 host AMD Ryzen 5 2600 Six-Core Processor            , 12 cores, 15G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host AMD Ryzen 5 3600 6-Core Processor              , 12 cores, 15G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      7 host AMD Ryzen 5 5600X 6-Core Processor             , 12 cores, 15G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      1 host AMD Ryzen 7 3700X 8-Core Processor             , 16 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      2 host AMD Ryzen 7 5800X 8-Core Processor             , 16 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      2 host AMD Ryzen 7 5800X 8-Core Processor             , 16 cores, 31G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      1 host AMD Ryzen 9 7950X3D 16-Core Processor          , 32 cores, 63G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      5 host Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz, 4 cores, 7G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i5-4210M CPU @ 2.60GHz, 4 cores, 7G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      5 host Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz, 4 cores, 7G,  Windows 10 , 64 bit Build 19041 (10.0.19041.1348) jdk 11.0.22+7
      3 host Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz, 4 cores, 7G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 8 cores, 15G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz, 8 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz, 12 cores, 15G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i7-9700KF CPU @ 3.60GHz, 8 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz, 20 cores, 31G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, 16 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.2546) jdk 11.0.22+7

Here they are, along with their stacktraces:
g1-crashes-Feb24.tar.gz

Only about 14% of my total VMs are 11.0.22

github-actions · 2024-05-27T00:05:56Z

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable.
It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

karianna · 2024-05-28T04:47:24Z

@Adam- Unfortunately we'll need to get some reports with -XX:+VerifyBeforeGC and -XX:+VerifyAfterGC and send in the results after a crash to the OpenJDK folks

Adam- · 2024-05-28T12:45:04Z

I don't think I am able to get those. It is fine to just close this if you want.

Adam- added the bug Something isn't working label Aug 11, 2023

karianna added the Waiting on OP label Aug 16, 2023

karianna added jbs:needs-report Waiting for someone from our org to report to OpenJDK and removed Waiting on OP labels Aug 16, 2023

karianna added jbs:reported Someone from our org has reported it to OpenJDK and removed jbs:needs-report Waiting for someone from our org to report to OpenJDK labels Aug 20, 2023

github-actions bot added the stale label Nov 20, 2023

karianna removed the stale label Nov 20, 2023

github-actions bot added the stale label Feb 24, 2024

karianna removed the stale label Feb 25, 2024

github-actions bot added the stale label May 27, 2024

karianna removed the stale label May 28, 2024

karianna added the Waiting on OP label May 28, 2024

karianna closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

G1 crash in mark_in_next_bitmap #867

G1 crash in mark_in_next_bitmap #867

Adam- commented Aug 11, 2023

karianna commented Aug 16, 2023

Adam- commented Aug 16, 2023

Adam- commented Aug 17, 2023

karianna commented Aug 20, 2023

karianna commented Aug 21, 2023

Adam- commented Aug 21, 2023

github-actions bot commented Nov 20, 2023

karianna commented Nov 20, 2023

Adam- commented Nov 20, 2023

Adam- commented Nov 25, 2023

github-actions bot commented Feb 24, 2024

karianna commented Feb 25, 2024

Adam- commented Feb 26, 2024

github-actions bot commented May 27, 2024

karianna commented May 28, 2024

Adam- commented May 28, 2024

G1 crash in mark_in_next_bitmap #867

G1 crash in mark_in_next_bitmap #867

Comments

Adam- commented Aug 11, 2023

Please provide a brief summary of the bug

Please provide steps to reproduce where possible

Expected Results

Actual Results

What Java Version are you using?

What is your operating system and platform?

How did you install Java?

Did it work before?

Did you test with the latest update version?

Did you test with other Java versions?

Relevant log output

karianna commented Aug 16, 2023

Adam- commented Aug 16, 2023

Adam- commented Aug 17, 2023

karianna commented Aug 20, 2023

karianna commented Aug 21, 2023

Adam- commented Aug 21, 2023

github-actions bot commented Nov 20, 2023

karianna commented Nov 20, 2023

Adam- commented Nov 20, 2023

Adam- commented Nov 25, 2023

github-actions bot commented Feb 24, 2024

karianna commented Feb 25, 2024

Adam- commented Feb 26, 2024

github-actions bot commented May 27, 2024

karianna commented May 28, 2024

Adam- commented May 28, 2024