-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes from Todd Coxeter #1479
Comments
I cannot reproduce this, neither in master nor in stable. How long do you wait (roughly) till you abort |
On 32-bit Linux I get
not a SEGFAULT, though, but still a bit pathological. |
@dimpase I also get that, and it's exactly what I'd expect to get on a 32bit machine. So I don't think it is whatever issue @frankluebeck is seeing... But then, who knows, given that he has told us relatively little about this problem :-/. |
Why would the 2nd run lead to the error, unless there is a memory leak?
This is what I found strange.
…On 14 Jul 2017 5:05 pm, "Max Horn" ***@***.***> wrote:
@dimpase <https://github.com/dimpase> I also get that, and it's exactly
what I'd expect to get on a 32bit machine. So I don't think it is whatever
issue @frankluebeck <https://github.com/frankluebeck> is seeing... But
then, who knows, given that he has told us relatively little about this
problem :-/.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1479 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABN8HDPuEV0zGtgB6EcBN86Mr94aUjT6ks5sN5GzgaJpZM4OP3Iw>
.
|
By the time GAP shows that "Error, reached the pre-set memory limit", it has already increased the workspace size. This is because in order to run the code which shows the error message, it needs memory, but it just runs out of memory. So increases the workspace, then shows the message. I assume what happens here is that by the second time you run out of memory (note that the first time, the coset enumeration tried a bigger table 4 times, now it does it 5 times), it cannot double the workspace again due to the limited 32bit address space. BTW, I just run the test again, but this time on a Linux machine, and there, instead of the "gap: cannot extend the workspace any more!" message, it actually does segfault! But at least on a cursory glance, I doubt this has anything to do with Todd-Coxeter, and is rather more about a bug in how we enlarge the workspace, or perhaps a problem on/with Linux... But perhaps I am wrong. |
So on the Linux machines, when GAP segfaults it has ca 1.5 GB of RAM allocated; I assume doubling again puts it around 3 GB, and the 32bit Linux can probably not accommodate it. But still, this simpler tests "works" correctly, so I am surprised:
So perhaps something in the Todd-Coxeter code still is at fault here. Hm. |
Seems the crash is inside GASMAN's
|
Just checking, are these crashes coming from running in Virtual Machines? I ask because I'm having trouble reproducing them, but if you were running on a VM with a small amount of RAM and no swap, we might just be running the machine out of memory and trigger linux's OOM killer (which we can't do anything about really) |
No VM involved over here. |
I also cannot reproduce this on 64-bit Linux (non-VM). |
Well, of course, you'll be hard pressed to exhaust a 64bit address space |
If anyone can reproduce this in a debugger, please provide a backtrace. Also, a guide to reproducing it would be useful. |
Just to say, there are (I feel) two different issues here. The segfault, shouldn't be happening. The |
I haven't duplicated this bug exactly, but I have got a similar crash by making increasing large plists. The problem is that So, I would suggest for safety |
Here is a fast way I can make GAP crash (sometimes):
Note: don't reset |
So, I would suggest for safety GrowPlist should check the size it is being asked for, and also ResizeBag should for safety also check the memory allocation being asked for.
Sounds like a good plan.
… —
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
My temptation is to cap all bag sizes at |
What is 32 bit GAP still really needed for? On a 16GB machine you can do anything in 64 bit that you can do in 32 bit. If we could withdraw 32 bit support life would be significantly easier. If not, can we simplify it by doing as Chris suggested and maybe limiting total workspace size to 2GB (avoids a whole lot of signed/unsigned issues) ? Are there tasks for which one really needs a 32 bit GAP with a 3GB workspace? |
On Sun, Jul 16, 2017 at 6:05 AM, Steve Linton ***@***.***> wrote:
What is 32 bit GAP still really needed for?
Things like (older?) Rasberry Pi and other low-end ARM boxes are 32-bit.
On a 16GB machine you can do anything in 64 bit that you can do in 32 bit.
If we could withdraw 32 bit support life would be significantly easier. If
not, can we simplify it by doing as Chris suggested and maybe limiting
total workspace size to 2GB (avoids a whole lot of signed/unsigned issues)
?
This makes perfect sense to me.
|
Anybody planning to work on this? @ChrisJefferson ? |
I'll pick this back up. |
I can produce in alll versions of 32-bit GAP similar behaviour to the original example with
I would certainly not give up support for 32-bit GAP in general (which is still faster in many cases and needs less memory resources in many cases even on 64-bit systems). Restricting bags to 2^28 byte may be ok if this helps to resolve this issue. (Although, when working on strings I made sure that strings up to 2^30 bytes can be handled.) |
On Mac OS X, I get this:
On Linux, I get:
With PR #2039 however, I get:
So that seems to have fixed the issue. (Note that the PR only restricts plists, so a string should still be able to grow to 2^30 bytes, assuming that indeed worked flawlessly before). |
Can anyone who can reproduce this crash try running in gdb (preferably after turning off optimisation in GAP -- go into GNUmakefile and delete |
There are two crashes here. The first is fixed by PR #2039. You'll never even get to the memmove crash unless you build with that PR. Forget about ward and HPC-GAP, they have nothing to do with this issue. |
On the TC example, with PR #2039 and ABI=32 with debugging on I still get a crash at memmove called from GC:
|
@dimpase but is this the same crash as before, or a new one? Because the above suggests a crash inside of To get better debugging output, I also suggest to configure GAP via Also, when starting GAP, try starting it with the option |
I don't know whether it's a new crash or not (figuring it out now, these tests are very slow). We don't seem to have a detailed record of where exactly the TC crash occurs. (runs with As the glibc patch only affects 32-bit applications, no surprise that on a 64-bit system |
@dimpase ohhh, you were testing the TC crash -- but that's not at all what I've been doing. Rather, there is a much, much simpler test case which triggers the problem, and which I used for all testing, namely this one (mentioned throughout this issue and also PR #2039): l:=[1]; while true do Print("*\c");Append(l,l);od; Using that makes it a matter of seconds to reproduce the crash (you have to re-execute it 2-3 times to get the crash). |
on
|
@dimpase thanks for your continued help, however, please (re)read the discussion here and on PR #2039, we are repeating a lot of things which were already settled before. Specifically: there are two crashes. The first one is caused by a plist growing "too big", leading to an overflow in The crash you just pasted was made with git revision Finally, could you please clarify whether these recent results are with the patched |
It is the patched glibc. The commit number on the GAP tree is misleading, it is a somewhat clumsy merge commit of the 1st version of the PR #2039 (i.e. without 6b07c8d). With Now I am trying to get a more meaningful trace of the TC crash in this settings, with |
so, on GAP built with
I suppose I can also built glibc with |
It seems that with PR #2039 and the glibc patch there is still a bug in (or triggered by) the workspace increase request. Namely, with
|
It's quite possible that there is a third bug at work here (or even more). Will investigate further. |
In that backtraces of yours, is memmove((void *)dst, (void *)DATA(header),
sizeof(Obj) * WORDS_BAG(old_size)); If so, the crash seems to be inside |
OK, turns out there is indeed a third bug lurking here: BagHeader * newHeader = (BagHeader *)AllocBags;
AllocBags = DATA(newHeader) + WORDS_BAG(new_size); // <-- this overflows Now, that overflow should have been detected and avoided by this check: /* check that enough storage for the new bag is available */
if ( SizeAllocationArea < WORDS_BAG(sizeof(BagHeader)+new_size)
&& CollectBags( new_size, 0 ) == 0 ) {
return 0;
} But this fails to prevent it, because Bag * stopBags = AllocBags + WORDS_BAG(sizeof(BagHeader)+size); I guess those might be the overflows @ChrisJefferson mentioned? Anyway, I'll try to think about how to deal with this. |
We still have the problem of |
As @ChrisJefferson's comment already suggest, this is not fully "fixed" -- although I'd say the remaining issue now is really an operating system bug, as it is only appears when using certain GNU libc versions (all released in the past few years...). So, what is missing then is a workaround for that. Normally, the "proper' way to address such things is to write a C program testing for the issue, then use that to create an autoconf test to detect if the problem appears, and then activate the workaround. However, such a test would require us to allocate 2 GB of RAM or so, and I don't think that's practical here, as e.g. the active system might only have, say 1 GB RAM (think about people trying to compile and run GAP on their Raspberry Pi or similar systems). The workaround itself could consist of a custom memmove function, aye. However, we'd have to take great care with that, as compilers really try hard to detect memory move patterns, and "optimize" them into custom builtin code and/or calls to memcpy and memmove, bringing back the original problem. In fact, this exact thing happened to me when I tried to track down the bug -- I had to tune down the optimization level of the compiler, otherwise my alternative memory moving code also did run into errors. However, we can't stop people from turning on arbitrary compiler optimizations for GAP, so we would need to use some other method to prevent the compiler from optimizing our custom memmove. And of course that would also mean our custom memmove would be slower -- so e.g. garbage collection on 32bit systems for a GAP with a large heap, will take a performance hit. I am afraid there is not much we can do about that (unless we were willing to add our own assembler implementations of BTW, @ChrisJefferson this is one of the reasons I recently changed some |
How would we feel about limiting 32 bit systems to 2GB of workspace? It's potentially limited for people with between 2 and 8GB or RAM available, and might slow down some computations that can be completed in 2-4GB on a 32 bit system, but it continues the main purpose of 32 bit GAP (supporting really small systems) and avoids a whole load of bugs. |
I think Windows users might be most disappointed if this happens today, but if by GAP 4.10 we will make 64-bit version for Windows a standard part of GAP distribution (#2112) then fine. |
Note that any 32-bit program has at best a 3GB memory space (on linux or windows), so we are "only" limiting from 3GB down to 2GB. |
No, my default GAP is a 32-bit compiled GAP on a large 64-bit machine, and there I can start it with 3.9GB. (The mentioned limitation is true on a native 32-bit operating system.) Nevertheless, if it solved a tricky problem I would find a limitation to 2 GB for 32-bit GAP acceptable. But I cannot confirm this statement:
My system (x86_64 with Debian 8) is not affected by the |
I don't see how limiting the heap to 2GB would help, it could still end up crossing the boundary between positive and negative addresses, no? |
@frankluebeck : I think there are further bugs to discover, related to filling up memory, we are just fixing each in turn. I didn't know you could have a 4GB memory space, it turns out (on googling) that is an option that some linux distributions enable. @fingolfin : I think we meant limiting the heap to the bottom 2GB, which might in practice mean you can only have 1GB of GAP heap, as I think by default mmap starts at 1GB (at least for me). |
@fingolfin You're right, of course. I first encountered this when I ran into bugs about signed pointer differences within the workspace, which is about total workspace size. If the problem is with workspaces crossing the 2GB address, then, in principle, any size of workspace could have the problem, and my idea is no use. @ChrisJefferson limiting to 1GB of workspace seems pretty useless, and there is no guarantee of the workspace landing up at 1GB start when there are shared libraries etc. around. |
What's the status of this? @frankluebeck you wrote that you can still reproduce "all" the issues reported here. Well, on x86_64 running Ubuntu, I cannot reproduce crashes with
Thanks! |
pinging @frankluebeck |
I have tried the problematic inputs discussed here in the current master branch and could no longer reproduce the problems. So, I guess this can be closed then? |
For reference, the |
Observed behaviour
Calling the following code twice in the same GAP session crashes GAP with a segmentation fault (leave the break loop after the first time):
This occurs in various versions of GAP compiled in 32-bit mode (including current master).
Expected behaviour
The code should run into a break loop every time.
Remark
Maybe something is not cleaned up correctly in the kernel part of the Todd-Coxeter enumeration when a calculation is interrupted?
The text was updated successfully, but these errors were encountered: