Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

G1 crash in mark_in_next_bitmap #867

Closed
Adam- opened this issue Aug 11, 2023 · 16 comments
Closed

G1 crash in mark_in_next_bitmap #867

Adam- opened this issue Aug 11, 2023 · 16 comments
Labels
bug Something isn't working jbs:reported Someone from our org has reported it to OpenJDK Waiting on OP

Comments

@Adam-
Copy link

Adam- commented Aug 11, 2023

Please provide a brief summary of the bug

We observe rare G1 crashes in G1ConcurrentMark::mark_in_next_bitmap in at least AdoptOpenJDK/Temurin 11.0.4, 11.0.8, 11.0.16, 11.0.16.1, 11.0.18, and 11.0.19. We don't have a way to reproduce the issue, and it seemingly happens at random based on the reports sent to us by users. We have observed this specific crash 200 times on 144 different machines in the last 3 weeks.

I have included the full crash report of one of these crashes here, they are all nearly identical and have an identical native frame stack.

They look like this:

Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.19+7 (11.0.19+7, mixed mode, tiered, compressed oops, g1 gc, windows-amd64)

Current thread (0x000001e522b25000):  ConcurrentGCThread "G1 Conc#0" [stack: 0x0000003ed5500000,0x0000003ed5600000] [id=2332]

Stack: [0x0000003ed5500000,0x0000003ed5600000],  sp=0x0000003ed55ffb90,  free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [jvm.dll+0x3041e5]
V  [jvm.dll+0x3040c2]
V  [jvm.dll+0x2fbdcd]
V  [jvm.dll+0x2fb491]
V  [jvm.dll+0x30ce70]
V  [jvm.dll+0x30c764]
V  [jvm.dll+0x310c36]
V  [jvm.dll+0x812840]
V  [jvm.dll+0x79d6e4]
V  [jvm.dll+0x64e915]
C  [ucrtbase.dll+0x29363]
C  [KERNEL32.DLL+0x126ad]
C  [ntdll.dll+0x5aa68]


siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), reading address 0x0000000000000160

Mapping the dll offsets to symbols via the pdb files Adoptium provides, yields this stack:

bool G1ConcurrentMark::mark_in_next_bitmap(unsigned int,class oopDesc * __ptr64 const,unsigned __int64) __ptr64
bool G1CMTask::make_reference_grey(class oopDesc * __ptr64) __ptr64
static void OopOopIterateDispatch<class G1CMOopClosure>::Table::oop_oop_iterate<class ObjArrayKlass,unsigned int>(class G1CMOopClosure * __ptr64,class oopDesc * __ptr64,class Klass * __ptr64)
int oopDesc::oop_iterate_size<class G1CMOopClosure>(class G1CMOopClosure * __ptr64) __ptr64
void G1CMTask::drain_local_queue(bool) __ptr64
void G1CMTask::do_marking_step(double,bool,bool) __ptr64
virtual void G1CMConcurrentMarkingTask::work(unsigned int) __ptr64
virtual void GangWorker::loop(void) __ptr64
void Thread::call_run(void) __ptr64
static __int64 os::thread_cpu_time(class Thread * __ptr64,bool)

For reference, the code surrounding the crash is:

inline bool G1ConcurrentMark::mark_in_next_bitmap(uint const worker_id, oop const obj, size_t const obj_size) {
  HeapRegion* const hr = _g1h->heap_region_containing(obj);
  return mark_in_next_bitmap(worker_id, hr, obj, obj_size);
}

inline bool G1ConcurrentMark::mark_in_next_bitmap(uint const worker_id, HeapRegion* const hr, oop const obj, size_t const obj_size) {
  assert(hr != NULL, "just checking");
  assert(hr->is_in_reserved(obj), "Attempting to mark object at " PTR_FORMAT " that is not contained in the given region %u", p2i(obj), hr->hrm_index());

  if (hr->obj_allocated_since_next_marking(obj)) {
    return false;
  }

I have disassembled jvm.dll to determine what is happening.

       1803041be 48 8b 41 08     MOV        RAX,qword ptr [RCX + this->_g1h]                 RAX=this + _g1h
       1803041c2 4c 8b f9        MOV        R15,this
       1803041c5 4d 8b d0        MOV        R10,param_2                                      R10=param_2
       1803041c8 44 8b e2        MOV        R12D,param_1
       1803041cb 49 8b e9        MOV        RBP,param_3
       1803041ce 49 8b f8        MOV        RDI,param_2
       1803041d1 8b 88 c0        MOV        this,dword ptr [RAX + 0x2c0]                     RCX=_regions._shift_by
                 02 00 00
       1803041d7 48 8b 80        MOV        RAX,qword ptr [RAX + 0x2b0]                      RAX=_regions._biased_base
                 b0 02 00 00
       1803041de 49 d3 ea        SHR        R10,this                                         R10=param_2 >> RCX
       1803041e1 4e 8b 14 d0     MOV        R10,qword ptr [RAX + R10*0x8]                    R10=biased_base[R10*8]
       1803041e5 4d 3b 82        CMP        param_2,qword ptr [R10 + 0x160]                  crash here; cmp param_2 and ->_next_top_at_mark_start
                 60 01 00 00

The compiled code is somewhat dense because the compiler inlines the call to heap_region_containing, addr_to_region, get_by_address, shift_by and biased_base, as well as the overloaded call to mark_in_next_bitmap.

Inlining them in source form would look something like this:

inline bool G1ConcurrentMark::mark_in_next_bitmap(uint const worker_id, oop const obj, size_t const obj_size) {
  if (obj >= _g1h->_hrm._regions._biased_base[obj >> _g1h->_hrm._regions._shift_by]->_next_top_at_mark_start) {
    return false;
  }

  ...
}

Note that the CMP instruction accesses qword ptr [R10 + 0x160] and also the crash log shows EXCEPTION_ACCESS_VIOLATION (0xc0000005), reading address 0x0000000000000160. As far as I can tell, this means the value loaded from the _biased_base array is 0x0, which means hr is null, and is crashing when doing the access to _next_top_at_mark_start due to a null pointer dereference.

I have almost no understanding of the G1 GC or most of the JDK so I don't know where to go from here.

Please provide steps to reproduce where possible

No response

Expected Results

No crash

Actual Results

Crash

What Java Version are you using?

Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.19+7 (11.0.19+7, mixed mode, tiered, compressed oops, g1 gc, windows-amd64)

What is your operating system and platform?

No response

How did you install Java?

No response

Did it work before?

No response

Did you test with the latest update version?

No response

Did you test with other Java versions?

No response

Relevant log output

No response

@Adam- Adam- added the bug Something isn't working label Aug 11, 2023
@karianna
Copy link
Contributor

https://bugs.openjdk.org/browse/JDK-8210557 is possible culprit (and only fixed in 12+). There's nothing in the 11.0.20 release notes that's related so it's unlikely that upgrading to that point release will fix. Are you able to run with 17.0.8?

@Adam-
Copy link
Author

Adam- commented Aug 16, 2023

https://bugs.openjdk.org/browse/JDK-8210557 is possible culprit (and only fixed in 12+). There's nothing in the 11.0.20 release notes that's related so it's unlikely that upgrading to that point release will fix. Are you able to run with 17.0.8?

I think that https://bugs.openjdk.org/browse/JDK-8210557 cannot be the culprit because it is removing an assert() https://hg.openjdk.org/jdk/jdk/rev/b177af763b82, which are not compiled into the official releases from adoptium. As my software is deployed to many end-user systems instead of running on a bunch of servers I own, I cannot change the Java version en-masse easily.

@karianna karianna added jbs:needs-report Waiting for someone from our org to report to OpenJDK and removed Waiting on OP labels Aug 16, 2023
@Adam-
Copy link
Author

Adam- commented Aug 17, 2023

Here is a similar crash in 11.0.20 - however its stack is different:

bool G1ConcurrentMark::mark_in_next_bitmap(unsigned int,class oopDesc * __ptr64 const,unsigned __int64) __ptr64
bool G1CMTask::make_reference_grey(class oopDesc * __ptr64) __ptr64
static void OopOopIterateDispatch<class G1CMOopClosure>::Table::oop_oop_iterate<class InstanceKlass,unsigned int>(class G1CMOopClosure * __ptr64,class oopDesc * __ptr64,class Klass * __ptr64)
int oopDesc::oop_iterate_size<class G1CMOopClosure>(class G1CMOopClosure * __ptr64) __ptr64
void G1CMTask::scan_task_entry(class G1TaskQueueEntry) __ptr64
bool G1CMBitMapClosure::do_addr(class HeapWord * __ptr64 const) __ptr64
bool G1CMBitMap::iterate(class G1CMBitMapClosure * __ptr64,class MemRegion) __ptr64
void G1CMTask::do_marking_step(double,bool,bool) __ptr64
virtual void G1CMConcurrentMarkingTask::work(unsigned int) __ptr64
virtual void GangWorker::loop(void) __ptr64
void Thread::call_run(void) __ptr64
static __int64 os::thread_cpu_time(class Thread * __ptr64,bool)

g1_crash_11020.txt

@karianna
Copy link
Contributor

@karianna karianna added jbs:reported Someone from our org has reported it to OpenJDK and removed jbs:needs-report Waiting for someone from our org to report to OpenJDK labels Aug 20, 2023
@karianna
Copy link
Contributor

@Adam- Can you try running with -XX:+VerifyBeforeGC and -XX:+VerifyAfterGC and send in the results after a crash?

@Adam-
Copy link
Author

Adam- commented Aug 21, 2023

@Adam- Can you try running with -XX:+VerifyBeforeGC and -XX:+VerifyAfterGC and send in the results after a crash?

The performance degradation that is caused by these options is not something that we can widely deploy. I will try to get some individual users who are experiencing this issue to run with that.

Copy link

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable.
It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

@github-actions github-actions bot added the stale label Nov 20, 2023
@karianna
Copy link
Contributor

@Adam- Any luck debugging this? It might also be worth trying 11.0.21 and seeing if it was fixed some other way

@karianna karianna removed the stale label Nov 20, 2023
@Adam-
Copy link
Author

Adam- commented Nov 20, 2023

@Adam- Any luck debugging this? It might also be worth trying 11.0.21 and seeing if it was fixed some other way

I have not made any further progress debugging this. However we have linked AMD CPB to some of the lower-volume crashes we receive (not this crash specifically - I only filed a bug for this one issue because it is observed on many machines on various processors, including many Intel). I still see this crash commonly, with 84 crashes in the last 2 weeks. I only just began deploying 11.0.21 yesterday and so only about 3% of my VMs have it, but I have not observed this crash on it yet.

@Adam-
Copy link
Author

Adam- commented Nov 25, 2023

I still see this crash, here is one from a macos machine: g1-11.0.21.txt. The stack is slightly different but it still crashes in mark_in_next_bitmap

V  [libjvm.dylib+0x3113a2]  G1ConcurrentMark::mark_in_next_bitmap(unsigned int, HeapRegion*, oopDesc*, unsigned long)+0xa
V  [libjvm.dylib+0x31c0f7]  G1CMTask::make_reference_grey(oopDesc*)+0x43
V  [libjvm.dylib+0x312833]  void OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<InstanceKlass, unsigned int>(G1CMOopClosure*, oopDesc*, Klass*)+0x7b
V  [libjvm.dylib+0x31b194]  void G1CMTask::process_grey_task_entry<true>(G1TaskQueueEntry)+0x158
V  [libjvm.dylib+0x315b57]  G1CMBitMapClosure::do_addr(HeapWord*)+0x27
V  [libjvm.dylib+0x31a8d0]  G1CMBitMap::iterate(G1CMBitMapClosure*, MemRegion)+0x1fe
V  [libjvm.dylib+0x31a1d4]  G1CMTask::do_marking_step(double, bool, bool)+0x1da
V  [libjvm.dylib+0x31b5ce]  G1CMConcurrentMarkingTask::work(unsigned int)+0xa6
V  [libjvm.dylib+0x7f250b]  GangWorker::loop()+0x43
V  [libjvm.dylib+0x77e8c4]  Thread::call_run()+0x68
V  [libjvm.dylib+0x62f7a7]  thread_native_entry(Thread*)+0x122
C  [libsystem_pthread.dylib+0x6202]  _pthread_start+0x63
C  [libsystem_pthread.dylib+0x1bab]  thread_start+0xf

Copy link

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable.
It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

@github-actions github-actions bot added the stale label Feb 24, 2024
@karianna
Copy link
Contributor

@Adam- 11.0.22 has been released if you want to give that a go.

@karianna karianna removed the stale label Feb 25, 2024
@Adam-
Copy link
Author

Adam- commented Feb 26, 2024

I've seen 40 crashes in the last week from various hosts on 11.0.22:

      1 host 12th Gen Intel(R) Core(TM) i5-12600K, 16 cores, 63G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      1 host 12th Gen Intel(R) Core(TM) i7-12700K, 20 cores, 63G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      1 host 13th Gen Intel(R) Core(TM) i5-13600K, 20 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      2 host AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G  , 4 cores, 6G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3393) jdk 11.0.22+7
      1 host AMD Ryzen 5 2600 Six-Core Processor            , 12 cores, 15G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host AMD Ryzen 5 3600 6-Core Processor              , 12 cores, 15G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      7 host AMD Ryzen 5 5600X 6-Core Processor             , 12 cores, 15G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      1 host AMD Ryzen 7 3700X 8-Core Processor             , 16 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      2 host AMD Ryzen 7 5800X 8-Core Processor             , 16 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      2 host AMD Ryzen 7 5800X 8-Core Processor             , 16 cores, 31G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      1 host AMD Ryzen 9 7950X3D 16-Core Processor          , 32 cores, 63G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      5 host Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz, 4 cores, 7G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i5-4210M CPU @ 2.60GHz, 4 cores, 7G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      5 host Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz, 4 cores, 7G,  Windows 10 , 64 bit Build 19041 (10.0.19041.1348) jdk 11.0.22+7
      3 host Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz, 4 cores, 7G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 8 cores, 15G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz, 8 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz, 12 cores, 15G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i7-9700KF CPU @ 3.60GHz, 8 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.3636) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz, 20 cores, 31G,  Windows 11 , 64 bit Build 22621 (10.0.22621.3085) jdk 11.0.22+7
      1 host Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, 16 cores, 31G,  Windows 10 , 64 bit Build 19041 (10.0.19041.2546) jdk 11.0.22+7

Here they are, along with their stacktraces:
g1-crashes-Feb24.tar.gz

Only about 14% of my total VMs are 11.0.22

Copy link

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable.
It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

@github-actions github-actions bot added the stale label May 27, 2024
@karianna karianna removed the stale label May 28, 2024
@karianna
Copy link
Contributor

@Adam- Unfortunately we'll need to get some reports with -XX:+VerifyBeforeGC and -XX:+VerifyAfterGC and send in the results after a crash to the OpenJDK folks

@Adam-
Copy link
Author

Adam- commented May 28, 2024

I don't think I am able to get those. It is fine to just close this if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working jbs:reported Someone from our org has reported it to OpenJDK Waiting on OP
Projects
None yet
Development

No branches or pull requests

2 participants