Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes from Todd Coxeter #1479

Closed
frankluebeck opened this issue Jul 6, 2017 · 77 comments
Closed

Crashes from Todd Coxeter #1479

frankluebeck opened this issue Jul 6, 2017 · 77 comments
Labels
kind: bug: crash Issues describing bugs that cause GAP to crash, and PRs fixing them (used for release notes) kind: bug Issues describing general bugs, and PRs fixing them
Milestone

Comments

@frankluebeck
Copy link
Member

Observed behaviour

Calling the following code twice in the same GAP session crashes GAP with a segmentation fault (leave the break loop after the first time):

f := FreeGroup( 3 );;
g := f/[f.1^2,f.2^2,f.3^2,(f.1*f.2)^3,(f.1*f.3)^3,(f.2*f.3)^3];
Size(g);

This occurs in various versions of GAP compiled in 32-bit mode (including current master).

Expected behaviour

The code should run into a break loop every time.

Remark

Maybe something is not cleaned up correctly in the kernel part of the Todd-Coxeter enumeration when a calculation is interrupted?

@fingolfin fingolfin added kind: bug Issues describing general bugs, and PRs fixing them kind: bug: crash Issues describing bugs that cause GAP to crash, and PRs fixing them (used for release notes) labels Jul 8, 2017
@fingolfin
Copy link
Member

I cannot reproduce this, neither in master nor in stable.

How long do you wait (roughly) till you abort Size(g) ? Then, what do you do the second time -- just run Size(g); again, or really redo the whole computation (i.e. also redefine f and g) ?
Perhaps you could add a full log of your session (including the backtrace in the first break loop; knowing where it happened might allow inserting an if ... Error combo to trigger the break at the right point.

@fingolfin fingolfin added the status: awaiting response Issues and PRs whose progress is stalled awaiting a response from (usually) the author label Jul 10, 2017
@dimpase
Copy link
Member

dimpase commented Jul 13, 2017

On 32-bit Linux I get

 ┌───────┐   GAP 4.8.6, 12-Nov-2016, build of 2017-07-02 08:49:34 (BST)
 │  GAP  │   http://www.gap-system.org
 └───────┘   Architecture: i686-pc-linux-gnu-gcc-default32
 Libs used:  gmp, readline
 Loading the library and packages ...
 Packages:   GAPDoc 1.5.1
 Try '??help' for help. See also '?copyright', '?cite' and '?authors'
gap> f := FreeGroup( 3 );;
gap> g := f/[f.1^2,f.2^2,f.3^2,(f.1*f.2)^3,(f.1*f.3)^3,(f.2*f.3)^3];
<fp group on the generators [ f1, f2, f3 ]>
gap> Size(g);
#I  Coset table calculation failed -- trying with bigger table limit
#I  Coset table calculation failed -- trying with bigger table limit
#I  Coset table calculation failed -- trying with bigger table limit
#I  Coset table calculation failed -- trying with bigger table limit
Error, reached the pre-set memory limit
(change it with the -o command line option) in
  prev[2 * limit] := 2 * limit - 1
 ; at /home/dima/sage/dev/sage/local/gap/latest/lib/grpfp.gi:1201 called from 
TCENUM.CosetTableFromGensAndRels( fgens, grels, fsgens 
 ) at /home/dima/sage/dev/sage/local/gap/latest/lib/grpfp.gi:1032 called from
CosetTableFromGensAndRels( fgens, grels, List( trial, UnderlyingElement ) 
 ) at /home/dima/sage/dev/sage/local/gap/latest/lib/grpfp.gi:3699 called from
Attempt( trial 
 ) at /home/dima/sage/dev/sage/local/gap/latest/lib/grpfp.gi:3702 called from
Attempt( gens 
 ) at /home/dima/sage/dev/sage/local/gap/latest/lib/grpfp.gi:3724 called from
FinIndexCyclicSubgroupGenerator( G, infinity 
 ) at /home/dima/sage/dev/sage/local/gap/latest/lib/grpfp.gi:3773 called from
...  at line 3 of *stdin*
you can 'return;'
brk> 
#I  Options stack has been reset
gap> f := FreeGroup( 3 );;
gap> g := f/[f.1^2,f.2^2,f.3^2,(f.1*f.2)^3,(f.1*f.3)^3,(f.2*f.3)^3];
<fp group on the generators [ f1, f2, f3 ]>
gap> Size(g);
#I  Coset table calculation failed -- trying with bigger table limit
#I  Coset table calculation failed -- trying with bigger table limit
#I  Coset table calculation failed -- trying with bigger table limit
#I  Coset table calculation failed -- trying with bigger table limit
#I  Coset table calculation failed -- trying with bigger table limit
gap: cannot extend the workspace any more!

not a SEGFAULT, though, but still a bit pathological.

@fingolfin
Copy link
Member

@dimpase I also get that, and it's exactly what I'd expect to get on a 32bit machine. So I don't think it is whatever issue @frankluebeck is seeing... But then, who knows, given that he has told us relatively little about this problem :-/.

@dimpase
Copy link
Member

dimpase commented Jul 14, 2017 via email

@fingolfin
Copy link
Member

By the time GAP shows that "Error, reached the pre-set memory limit", it has already increased the workspace size. This is because in order to run the code which shows the error message, it needs memory, but it just runs out of memory. So increases the workspace, then shows the message.

I assume what happens here is that by the second time you run out of memory (note that the first time, the coset enumeration tried a bigger table 4 times, now it does it 5 times), it cannot double the workspace again due to the limited 32bit address space.

BTW, I just run the test again, but this time on a Linux machine, and there, instead of the "gap: cannot extend the workspace any more!" message, it actually does segfault!

But at least on a cursory glance, I doubt this has anything to do with Todd-Coxeter, and is rather more about a bug in how we enlarge the workspace, or perhaps a problem on/with Linux... But perhaps I am wrong.

@fingolfin
Copy link
Member

So on the Linux machines, when GAP segfaults it has ca 1.5 GB of RAM allocated; I assume doubling again puts it around 3 GB, and the 32bit Linux can probably not accommodate it.

But still, this simpler tests "works" correctly, so I am surprised:

gap> x:=10^10000000;;
gap> l:=[];; for i in [1..1000] do Add(l, x+1); od;
Error, reached the pre-set memory limit
(change it with the -o command line option) in
  Add( l, x + 1 ); at *stdin*:2 called from
<function "unknown">( <arguments> )
 called from read-eval loop at *stdin*:2
you can 'return;'
brk>
gap> l:=[];; for i in [1..1000] do Add(l, x+1); od;
gap: cannot extend the workspace any more!

So perhaps something in the Todd-Coxeter code still is at fault here. Hm.

@fingolfin
Copy link
Member

Seems the crash is inside GASMAN's ResizeBag, which surprise me. Hohum.

Program received signal SIGSEGV, Segmentation fault.
ResizeBag (bag=0x999b6ea8, new_size=524288004) at ../../src/gasman.c:1376
1376                *dst++ = *src++;
(gdb) l
1371            Bag * dst = DATA(newHeader);
1372            PTR_BAG(bag) = dst;
1373
1374            /* copy the contents of the bag                                    */
1375            while ( src < end )
1376                *dst++ = *src++;
1377
1378        }
1379
1380        /* return success                                                      */

@ChrisJefferson
Copy link
Contributor

Just checking, are these crashes coming from running in Virtual Machines? I ask because I'm having trouble reproducing them, but if you were running on a VM with a small amount of RAM and no swap, we might just be running the machine out of memory and trigger linux's OOM killer (which we can't do anything about really)

@fingolfin
Copy link
Member

No VM involved over here.

@dimpase
Copy link
Member

dimpase commented Jul 15, 2017

I also cannot reproduce this on 64-bit Linux (non-VM).

@fingolfin
Copy link
Member

fingolfin commented Jul 15, 2017

Well, of course, you'll be hard pressed to exhaust a 64bit address space

@ChrisJefferson
Copy link
Contributor

If anyone can reproduce this in a debugger, please provide a backtrace. Also, a guide to reproducing it would be useful.

@ChrisJefferson
Copy link
Contributor

Just to say, there are (I feel) two different issues here.

The segfault, shouldn't be happening.

The gap: cannot extend the workspace any more! is (by my understanding) intended behaviour. Doing something else would require a careful plan on what we should do instead.

@ChrisJefferson
Copy link
Contributor

I haven't duplicated this bug exactly, but I have got a similar crash by making increasing large plists.

The problem is that GrowPlist can be persuaded to make a list longer than the maximum allowable length (2^28), which then causes us to ask for too much memory (2^31), which causes an overflow in ResizeBag.

So, I would suggest for safety GrowPlist should check the size it is being asked for, and also ResizeBag should for safety also check the memory allocation being asked for.

@ChrisJefferson
Copy link
Contributor

Here is a fast way I can make GAP crash (sometimes):

x := [2];
while true do Append(x,x); od;
(break loop entered)
while true do Append(x,x); od;

Note: don't reset x once you exit the break loop.

@fingolfin
Copy link
Member

fingolfin commented Jul 15, 2017 via email

@ChrisJefferson
Copy link
Contributor

My temptation is to cap all bag sizes at 2^28 (on 32-bit), because it wouldn't surprise me if somewhere we assume you can fit the size of a bag in a small int, and on a 32-bit computer, that doesn't seem like an unreasonable restriction.

@stevelinton
Copy link
Contributor

What is 32 bit GAP still really needed for? On a 16GB machine you can do anything in 64 bit that you can do in 32 bit. If we could withdraw 32 bit support life would be significantly easier. If not, can we simplify it by doing as Chris suggested and maybe limiting total workspace size to 2GB (avoids a whole lot of signed/unsigned issues) ? Are there tasks for which one really needs a 32 bit GAP with a 3GB workspace?

@dimpase
Copy link
Member

dimpase commented Jul 16, 2017 via email

@fingolfin fingolfin removed the status: awaiting response Issues and PRs whose progress is stalled awaiting a response from (usually) the author label Dec 19, 2017
@fingolfin
Copy link
Member

Anybody planning to work on this? @ChrisJefferson ?

@fingolfin fingolfin added this to the GAP 4.10.0 milestone Dec 19, 2017
@ChrisJefferson
Copy link
Contributor

I'll pick this back up.

@frankluebeck
Copy link
Member Author

I can produce in alll versions of 32-bit GAP similar behaviour to the original example with

l:=[1]; while true do Print("*\c");Append(l,l);od;

I would certainly not give up support for 32-bit GAP in general (which is still faster in many cases and needs less memory resources in many cases even on 64-bit systems).

Restricting bags to 2^28 byte may be ok if this helps to resolve this issue. (Although, when working on strings I made sure that strings up to 2^30 bytes can be handled.)

@fingolfin
Copy link
Member

On Mac OS X, I get this:

gap> l:=[1]; while true do Print("*\c");Append(l,l);od;
[ 1 ]
****************************gap: cannot extend the workspace any more!!!

On Linux, I get:

gap>  l:=[1]; while true do Print("*\c");Append(l,l);od;
[ 1 ]
****************************Segmentation fault (core dumped)

With PR #2039 however, I get:

gap> l:=[1]; while true do Print("*\c");Append(l,l);od;
[ 1 ]
****************************Error, GrowPlist: List size too large in
  Append( l, l ); at *stdin*:1 called from
<function "unknown">( <arguments> )
 called from read-eval loop at *stdin*:1
type 'quit;' to quit to outer loop
brk>

So that seems to have fixed the issue. (Note that the PR only restricts plists, so a string should still be able to grow to 2^30 bytes, assuming that indeed worked flawlessly before).

@ChrisJefferson
Copy link
Contributor

Can anyone who can reproduce this crash try running in gdb (preferably after turning off optimisation in GAP -- go into GNUmakefile and delete -O2 from CFLAGS) and get a backtrace?

@fingolfin
Copy link
Member

There are two crashes here. The first is fixed by PR #2039. You'll never even get to the memmove crash unless you build with that PR.

Forget about ward and HPC-GAP, they have nothing to do with this issue.

@dimpase
Copy link
Member

dimpase commented Dec 26, 2017

On the TC example, with PR #2039 and ABI=32 with debugging on I still get a crash at memmove called from GC:

brk> return;
#I  Coset table calculation failed -- trying with bigger table limit

Program received signal SIGSEGV, Segmentation fault.
0xf7dd4fe4 in ?? () from /lib32/libc.so.6
(gdb) bt
#0  0xf7dd4fe4 in ?? () from /lib32/libc.so.6
#1  0x0822de2e in memmove (__len=<optimized out>, __src=<optimized out>, __dest=<optimized out>)
    at /usr/include/bits/string_fortified.h:40
#2  ResizeBag (bag=0x99812728, new_size=524288004) at src/gasman.c:1365
#3  0x08194a7a in GrowPlist (list=0x99812728, need=131072000) at src/plist.c:83
#4  0x08200346 in GROW_PLIST (plen=131072000, list=<optimized out>) at ./src/plist.h:110
#5  ExecAssList (stat=3492) at src/vars.c:549
#6  0x081d0805 in EXEC_STAT (stat=<optimized out>) at ./src/stats.h:53
#7  ExecSeqStat (stat=4064) at src/stats.c:162
...

@fingolfin
Copy link
Member

@dimpase but is this the same crash as before, or a new one? Because the above suggests a crash inside of memmove, which is different from what we saw before. So perhaps my memmove patch simply broke it (although then it's weird your system is running at all, given that a lot of stuff is going to use memcpy / memmove).

To get better debugging output, I also suggest to configure GAP via ../configure --enable-debug CFLAGS=-g ABI=32. This will disable optimizations (which means fewer <optimized out>), and enable additional asserts.

Also, when starting GAP, try starting it with the option-A -o 3g

@dimpase
Copy link
Member

dimpase commented Dec 27, 2017

I don't know whether it's a new crash or not (figuring it out now, these tests are very slow). We don't seem to have a detailed record of where exactly the TC crash occurs. (runs with -g are very slow, so it's precious data).
It also seems to be CPU-specific - on systems that don't support that sse2-unaligned stuff ssse3 stuff is used in memmove instead, as far as I can read the corresponding assembly code.

As the glibc patch only affects 32-bit applications, no surprise that on a 64-bit system
it won't break much, however wrong it might be.

@fingolfin
Copy link
Member

@dimpase ohhh, you were testing the TC crash -- but that's not at all what I've been doing. Rather, there is a much, much simpler test case which triggers the problem, and which I used for all testing, namely this one (mentioned throughout this issue and also PR #2039):

l:=[1]; while true do Print("*\c");Append(l,l);od;

Using that makes it a matter of seconds to reproduce the crash (you have to re-execute it 2-3 times to get the crash).

@dimpase
Copy link
Member

dimpase commented Dec 27, 2017

on l:=[1]; while true do Print("*\c");Append(l,l);od; I get a segfault in FuncAPPEND_LIST_INTR if I call GAP without -o 3g, as below.
And with -o -3g I don't get any segfaults, I only get Error, GrowPlist: List size too large in all the way (but this, I imagine, has nothing to do with the glibc patch, this is just what PR #2039 is doing, no?)

(gdb) ru -A
Starting program: /mnt/opt/gap/gap -A
 ┌───────┐   GAP 4.8.8-4845-g660995f-dirty of today
 │  GAP  │   https://www.gap-system.org
 └───────┘   Architecture: x86_64-pc-linux-gnu-default32
 Configuration:  gmp 6.1.2, KernelDebug
 Loading the library and packages ...
 Packages:   GAPDoc 1.6.1
 Try '??help' for help. See also '?copyright', '?cite' and '?authors'
gap> l:=[1]; while true do Print("*\c");Append(l,l);od;
[ 1 ]
****************************Error, GrowPlist: List size too large in
  Append( l, l ); at *stdin*:1 called from 
<function "unknown">( <arguments> )
 called from read-eval loop at *stdin*:1
type 'quit;' to quit to outer loop
brk> 
gap> l:=[1]; while true do Print("*\c");Append(l,l);od;
[ 1 ]
**************************Error, reached the pre-set memory limit
(change it with the -o command line option) in
  Print( "*\c" ); at *stdin*:1 called from 
<function "unknown">( <arguments> )
 called from read-eval loop at *stdin*:1
you can 'return;'
brk> 
gap> l:=[1]; while true do Print("*\c");Append(l,l);od;
[ 1 ]
***************************
Program received signal SIGSEGV, Segmentation fault.
0x0812de76 in FuncAPPEND_LIST_INTR (self=0x0, list1=0x98e2d958, list2=0x98e2d958) at src/listfunc.c:337
337	            ptr1[i] = ptr2[i];
(gdb) bt
#0  0x0812de76 in FuncAPPEND_LIST_INTR (self=0x0, list1=0x98e2d958, list2=0x98e2d958) at src/listfunc.c:337
#1  0x0812df27 in FuncAPPEND_LIST (self=0x97c8bfc8, list=0x98e2d958, obj=0x98e2d958) at src/listfunc.c:363
#2  0x0810578e in ExecProccall2args (call=148) at src/funcs.c:227
#3  0x081d7374 in EXEC_STAT (stat=148) at ./src/stats.h:53
#4  0x081da540 in ExecWhile2 (stat=168) at src/stats.c:1182
#5  0x081d7374 in EXEC_STAT (stat=168) at ./src/stats.h:53
#6  0x081d7ab4 in ExecSeqStat2 (stat=20) at src/stats.c:178
#7  0x0810484c in EXEC_STAT (stat=20) at ./src/stats.h:53
#8  0x0810a64c in ExecFuncHelper () at src/funcs.c:842
#9  0x0810a6e0 in DoExecFunc0args (func=0x98e2d9b8) at src/funcs.c:865
#10 0x0812051d in FinishAndCallFakeFuncExpr () at src/intrprtr.c:224
#11 0x08121420 in IntrWhileEnd () at src/intrprtr.c:821
#12 0x081c7a66 in ReadWhile (follow=3221225472) at src/read.c:2255
#13 0x081c8dd8 in ReadEvalCommand (context=0x97c88028, evalResult=0xffffb684, dualSemicolon=0xffffb688) at src/read.c:2704
#14 0x0810ceb7 in Shell (context=0x97c88028, canReturnVoid=0, canReturnObj=0, lastDepth=3, setTime=1, prompt=0xffffb74b "gap> ", 
    preCommandHook=0x97caf290, catchQUIT=1, inFile=0x9b4ccac0 "\005", outFile=0x9b4ccad8 "\005") at src/gap.c:240
#15 0x0810d753 in FuncSHELL (self=0x97c9006c, args=0x97caf294) at src/gap.c:437
#16 0x08107864 in ExecProccallXargs (call=328) at src/funcs.c:404
...

@fingolfin
Copy link
Member

@dimpase thanks for your continued help, however, please (re)read the discussion here and on PR #2039, we are repeating a lot of things which were already settled before.

Specifically: there are two crashes. The first one is caused by a plist growing "too big", leading to an overflow in GrowPlist, which is fixed by PR #2039. The second crash (or rather: memory corruption, which usually quickly leads to a crash) is masked by the first, i.e. you'll never see it unless you are using PR #2039. With -o -3g, you'll need to run the test snippet l:=[1]; while true do Print("*\c");Append(l,l);od; at least three times to get a crash -- well, or if the memmove fix worked, then hopefully no crash.

The crash you just pasted was made with git revision 660995f. I do not have such a commit locally. Could you please repeat it with PR #2039?

Finally, could you please clarify whether these recent results are with the patched memmove, or the "stock" version?

@dimpase
Copy link
Member

dimpase commented Dec 28, 2017

It is the patched glibc. The commit number on the GAP tree is misleading, it is a somewhat clumsy merge commit of the 1st version of the PR #2039 (i.e. without 6b07c8d).

With -o -3g I did run the test snippet 20 times or so, no crash.


Now I am trying to get a more meaningful trace of the TC crash in this settings, with -g in CFLAGS. It takes quite long...

@dimpase
Copy link
Member

dimpase commented Dec 28, 2017

so, on GAP built with -g, the trace looks as follows:

brk> return;
#I  Coset table calculation failed -- trying with bigger table limit

Program received signal SIGSEGV, Segmentation fault.
0xf7dd4fe4 in ?? () from /lib32/libc.so.6
(gdb) bt
#0  0xf7dd4fe4 in ?? () from /lib32/libc.so.6
#1  0x0823b2b6 in ResizeBag (bag=0x99812ed0, new_size=524288004) at src/gasman.c:1365
#2  0x08198371 in GrowPlist (list=0x99812ed0, need=131072000) at src/plist.c:83
#3  0x08206c1f in GROW_PLIST (list=0x99812ed0, plen=131072000) at ./src/plist.h:110
#4  0x0820832f in ExecAssList (stat=3492) at src/vars.c:549
#5  0x081d7374 in EXEC_STAT (stat=3492) at ./src/stats.h:53
#6  0x081d7a65 in ExecSeqStat (stat=4064) at src/stats.c:162
#7  0x081d7374 in EXEC_STAT (stat=4064) at ./src/stats.h:53
#8  0x081d8050 in ExecIf (stat=4116) at src/stats.c:292
...

I suppose I can also built glibc with -g, to get a better idea about the place of the crash.

@dimpase
Copy link
Member

dimpase commented Jan 1, 2018

It seems that with PR #2039 and the glibc patch there is still a bug in (or triggered by) the workspace increase request. Namely, with -o 3g the fragment l:=[1]; while true do Print("*\c");Append(l,l);od; does not lead to a crash after many iterations, while without -o I get a crash

gap> l:=[1]; while true do Print("*\c");Append(l,l);od;
[ 1 ]
****************************Error, GrowPlist: List size too large in
  Append( l, l ); at *stdin*:1 called from 
<function "unknown">( <arguments> )
 called from read-eval loop at *stdin*:1
type 'quit;' to quit to outer loop
brk> 
gap> l:=[1]; while true do Print("*\c");Append(l,l);od;
[ 1 ]
**************************Error, reached the pre-set memory limit
(change it with the -o command line option) in
  Print( "*\c" ); at *stdin*:1 called from 
<function "unknown">( <arguments> )
 called from read-eval loop at *stdin*:1
you can 'return;'
brk> return;
**Error, GrowPlist: List size too large in
  Append( l, l ); at *stdin*:1 called from 
<function "unknown">( <arguments> )
 called from read-eval loop at *stdin*:1
type 'quit;' to quit to outer loop
brk> 
gap> l:=[1]; while true do Print("*\c");Append(l,l);od;
[ 1 ]
***************************
Program received signal SIGSEGV, Segmentation fault.
0xf7dd7bd4 in ?? () from /lib32/libc.so.6
(gdb) bt
#0  0xf7dd7bd4 in ?? () from /lib32/libc.so.6
#1  0x0823b2b6 in ResizeBag (bag=0x990dc27c, new_size=536870916) at src/gasman.c:1365
#2  0x08198371 in GrowPlist (list=0x990dc27c, need=134217728) at src/plist.c:83
#3  0x0812cf6d in GROW_PLIST (list=0x990dc27c, plen=134217728) at ./src/plist.h:110
#4  0x0812ddf5 in FuncAPPEND_LIST_INTR (self=0x0, list1=0x990dc27c, list2=0x990dc27c) at src/listfunc.c:328
#5  0x0812df27 in FuncAPPEND_LIST (self=0x97c8efc8, list=0x990dc27c, obj=0x990dc27c) at src/listfunc.c:363
#6  0x0810578e in ExecProccall2args (call=148) at src/funcs.c:227
#7  0x081d7374 in EXEC_STAT (stat=148) at ./src/stats.h:53
#8  0x081da540 in ExecWhile2 (stat=168) at src/stats.c:1182
#9  0x081d7374 in EXEC_STAT (stat=168) at ./src/stats.h:53
#10 0x081d7ab4 in ExecSeqStat2 (stat=20) at src/stats.c:178
#11 0x0810484c in EXEC_STAT (stat=20) at ./src/stats.h:53
#12 0x0810a64c in ExecFuncHelper () at src/funcs.c:842
...

@fingolfin
Copy link
Member

It's quite possible that there is a third bug at work here (or even more). Will investigate further.

@fingolfin
Copy link
Member

In that backtraces of yours, is src/gasman.c:1365 this line?

        memmove((void *)dst, (void *)DATA(header),
                sizeof(Obj) * WORDS_BAG(old_size));

If so, the crash seems to be inside memmove again? It would be helpful to know what dst etc. are at this point (i.e. whether we are feeding valid or invalid input into memmove)

@fingolfin
Copy link
Member

OK, turns out there is indeed a third bug lurking here: AllocBags can overflow, which is the cause for the FuncAPPEND_LIST_INTR crash. Specifically:

        BagHeader * newHeader = (BagHeader *)AllocBags;
        AllocBags = DATA(newHeader) + WORDS_BAG(new_size);  // <-- this overflows

Now, that overflow should have been detected and avoided by this check:

        /* check that enough storage for the new bag is available          */
        if ( SizeAllocationArea <  WORDS_BAG(sizeof(BagHeader)+new_size)
              && CollectBags( new_size, 0 ) == 0 ) {
            return 0;
        }

But this fails to prevent it, because CollectBags actually returns 1... due to an overflow in there, specifically in this line:

    Bag * stopBags = AllocBags + WORDS_BAG(sizeof(BagHeader)+size);

I guess those might be the overflows @ChrisJefferson mentioned? Anyway, I'll try to think about how to deal with this.

@olexandr-konovalov
Copy link
Member

What is the status of this? With #2064 and #2067 merged, was it fixed?

@ChrisJefferson
Copy link
Contributor

We still have the problem of memmove being buggy :( Not sure of the best way of working around that.. I can't think of anything better than writing our own implementation, and checking neither gcc or clang optimise it back to a memmove.

@fingolfin
Copy link
Member

As @ChrisJefferson's comment already suggest, this is not fully "fixed" -- although I'd say the remaining issue now is really an operating system bug, as it is only appears when using certain GNU libc versions (all released in the past few years...). So, what is missing then is a workaround for that.

Normally, the "proper' way to address such things is to write a C program testing for the issue, then use that to create an autoconf test to detect if the problem appears, and then activate the workaround. However, such a test would require us to allocate 2 GB of RAM or so, and I don't think that's practical here, as e.g. the active system might only have, say 1 GB RAM (think about people trying to compile and run GAP on their Raspberry Pi or similar systems).
So, I'd settle for the not-as-nice "hard code range of affected systems" test: If the current system is 32bit, and Linux*, and uses GNU libc version >= 2.21, then activate the workaround (and once a fixed glibc has been released, also add an upper bound to the version check).

The workaround itself could consist of a custom memmove function, aye. However, we'd have to take great care with that, as compilers really try hard to detect memory move patterns, and "optimize" them into custom builtin code and/or calls to memcpy and memmove, bringing back the original problem. In fact, this exact thing happened to me when I tried to track down the bug -- I had to tune down the optimization level of the compiler, otherwise my alternative memory moving code also did run into errors.

However, we can't stop people from turning on arbitrary compiler optimizations for GAP, so we would need to use some other method to prevent the compiler from optimizing our custom memmove. And of course that would also mean our custom memmove would be slower -- so e.g. garbage collection on 32bit systems for a GAP with a large heap, will take a performance hit. I am afraid there is not much we can do about that (unless we were willing to add our own assembler implementations of memmove -- which I am not willing to do).

BTW, @ChrisJefferson this is one of the reasons I recently changed some memmove calls in our codebase to memcpy: That's a few places less to worry about regarding performance impact of a custom memmove... :-)

@stevelinton
Copy link
Contributor

How would we feel about limiting 32 bit systems to 2GB of workspace? It's potentially limited for people with between 2 and 8GB or RAM available, and might slow down some computations that can be completed in 2-4GB on a 32 bit system, but it continues the main purpose of 32 bit GAP (supporting really small systems) and avoids a whole load of bugs.

@olexandr-konovalov
Copy link
Member

I think Windows users might be most disappointed if this happens today, but if by GAP 4.10 we will make 64-bit version for Windows a standard part of GAP distribution (#2112) then fine.

@ChrisJefferson
Copy link
Contributor

Note that any 32-bit program has at best a 3GB memory space (on linux or windows), so we are "only" limiting from 3GB down to 2GB.

@frankluebeck
Copy link
Member Author

Note that any 32-bit program has at best a 3GB memory space (on linux or windows), so we are "only" limiting from 3GB down to 2GB.

No, my default GAP is a 32-bit compiled GAP on a large 64-bit machine, and there I can start it with 3.9GB. (The mentioned limitation is true on a native 32-bit operating system.)

Nevertheless, if it solved a tricky problem I would find a limitation to 2 GB for 32-bit GAP acceptable.

But I cannot confirm this statement:

this is not fully "fixed" -- although I'd say the remaining issue now is really an operating system bug, as it is only appears when using certain GNU libc versions (all released in the past few years...).

My system (x86_64 with Debian 8) is not affected by the memmove bug. But nevertheless all the examples discussed in this thread still lead to segfaults as before. (I agree that #2064 and #2067 fixed some bugs, but not this issue.) I just checked this with new builds of the current master and the current stable-4.9 branches.

@fingolfin
Copy link
Member

I don't see how limiting the heap to 2GB would help, it could still end up crossing the boundary between positive and negative addresses, no?

@ChrisJefferson
Copy link
Contributor

ChrisJefferson commented Jan 31, 2018

@frankluebeck : I think there are further bugs to discover, related to filling up memory, we are just fixing each in turn. I didn't know you could have a 4GB memory space, it turns out (on googling) that is an option that some linux distributions enable.

@fingolfin : I think we meant limiting the heap to the bottom 2GB, which might in practice mean you can only have 1GB of GAP heap, as I think by default mmap starts at 1GB (at least for me).

@stevelinton
Copy link
Contributor

stevelinton commented Jan 31, 2018

@fingolfin You're right, of course. I first encountered this when I ran into bugs about signed pointer differences within the workspace, which is about total workspace size. If the problem is with workspaces crossing the 2GB address, then, in principle, any size of workspace could have the problem, and my idea is no use. @ChrisJefferson limiting to 1GB of workspace seems pretty useless, and there is no guarantee of the workspace landing up at 1GB start when there are shared libraries etc. around.

@fingolfin
Copy link
Member

What's the status of this? @frankluebeck you wrote that you can still reproduce "all" the issues reported here.

Well, on x86_64 running Ubuntu, I cannot reproduce crashes with l:=[1]; while true do Print("*\c");Append(l,l);od; anymore (and couldn't back when I made my previous comment). I've not yet tested the Todd-Coxeter example, but let's try to settle this one bug at a time: If you still can reproduce the crash with l:=[1]; while true do Print("*\c");Append(l,l);od;, it would be helpful to know how exactly:

  • which options did you pass to configure? Just ABI=32 or also anything else, like --enable-debug?
  • which options do you pass to GAP? Can you reproduce it using just -A -r?
  • can you perhaps paste the text of a fresh session where you reproduce the crash?

Thanks!

@olexandr-konovalov
Copy link
Member

pinging @frankluebeck

@frankluebeck
Copy link
Member Author

I have tried the problematic inputs discussed here in the current master branch and could no longer reproduce the problems. So, I guess this can be closed then?

@fingolfin
Copy link
Member

For reference, the memmove crash in glibc is fixed in glibc 2.28, and is CVE-2017-18269.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug: crash Issues describing bugs that cause GAP to crash, and PRs fixing them (used for release notes) kind: bug Issues describing general bugs, and PRs fixing them
Projects
None yet
Development

No branches or pull requests

6 participants