bfs dfs smart comparison

Comparison between the BFS, DFS and smart strategies

Executive Summary It does not look like there is a big fundamental difference between the strategies. The smart strategy found one very interesting test case, where the BFS strategy found the highest number of incompleteness. The data is moderately flawed due to a bug (that is now fixed) that made vdiff crash on some files.

This analysis is based on the data set random100_fast. First of all: Due to a bug (that is now fixed), a few segfaults happened during the executions, which explains the difference in file size of specific databases.

> du -sh *
2.9M    bfs.db
2.5M    dfs.db
1.8M    smart.db

Or in absolute numbers (vdiff-viewer --stat):

data	runs	programs	used source files
bfs.db	10429	1187	96
dfs.db	8867	1055	91
smart.db	7280	970	88

the invocation happened through random100_fast.sh, which ran each verifier with the following parameters: budget = 20, timeout=15, seed=123.

Results

vdiff-viewer --count --unsound` (`--incomplete) counts all cases in which a verifier reached the verdict “Unsat” (“Sat”) although the majority of the verifiers said “Sat” (“Unsat”). One has to be careful to interpret this data, as often with the interesting cases, verifiers tend to timeout. In those cases, the consensus is only based on (too) few participants. Anyhow, here’s the raw data:

data	–unsound	–incomplete	–disagreement	–unsound-klee-cbmc
bfs.db	259 (2.48%)	12	220 (3.16%)	0
dfs.db	211 (2.38%)	6	181 (2.04%)	0
smart.db	161 (2.21%)	1	156 (2.14%)	1

The percentages in the unsound column compares the number with the total number of runs.

The query --unsound-klee-cbmc lists the special case where klee or cbmc says “Sat” (meaning reachable) and some other verifier says “Unsat”. In those cases we have a strong tendency to believe Klee or CBMC.

Unsoundness

All cases of unsoundness in bfs.db and dfs.db were due to KLEE and therefore not a very surprising finding. More interesting is the one case in smart.db that is not caused by KLEE. Let’s have a closer look (vdiff-viewer --list --unsound-klee-cbmc -d smart.db)

runId	verifierName	originalFn	programHash	unsats	sats
3105	cpachecker	./c/ldv-regression/mutex_lock_int_true-termination.c_true-unreach-call_1.i	20fca5f1d266f3d00b28e4a3d55b4fbe678ae678	1	6

Let’s have a look at the complete run information (vdiff-viewer --runs 20fca5f1d266f3d00b28e4a3d55b4fbe678ae678)

verifier	verdict	time (s)	memory
cbmc	sat	Just 0.24	Just 29156
klee	sat	Just 1.0e-2	Just 18888
seacrab	sat	Just 0.17	Just 36664
seahorn	sat	Just 0.18	Just 36312
uautomizer	sat	Just 4.16	Just 308984
utaipan	sat	Just 3.99	Just 299444
cpachecker	unsat	Just 2.37	Just 298736

The programs look like this (vdiff-viewer --hash 20fca5f1d266f3d00b28e4a3d55b4fbe678ae678)

extern void __VERIFIER_error() __attribute__((__noreturn__));
void __VERIFIER_assert(int cond)
{
    if (!cond)
    {
    ERROR:
        __VERIFIER_error();
    }
}
void __DUMMY_VERIFIER_assert(int condition)
{
}
extern void __VERIFIER_error() __attribute__((__noreturn__));
void err()
{
ERROR:
    __VERIFIER_error();
}
void mutex_lock(int * a)
{
    if (*a == 1)
    {
        err();
    }
    *a = 1;
}
void mutex_unlock(int * b)
{
    if (*b != 1)
    {
        err();
    }
    *b = 0;
}
int main()
{
    int m;
    m = 0;
    mutex_lock(&m);
    __VERIFIER_assert(m != 1);
    mutex_unlock(&m);
}

Incompleteness

In short: All cases of incompleteness in bfs,dfs and smart where caused by cpachecker. It’s not quite clear to my why. Just picking one of those cases:

7834      cpachecker        ./c/array-examples/data_structures_set_multi_proc_trivial_true-unreach-call_ground_true-termination.i     f7c19fdee1100d0c353e1849c602764affdb8586     5           1

runs:

cpachecker	sat	Just 8.84	Just 828100
cbmc	unknown	Nothing	Nothing
klee	unsat	Just 4.48	Just 19320
seacrab	unsat	Just 0.11	Just 36836
seahorn	unsat	Just 0.12	Just 36872
uautomizer	unsat	Just 4.28	Just 302160
utaipan	unsat	Just 3.81	Just 309040

and the instrumented code:

extern void __VERIFIER_error() __attribute__((__noreturn__));
void __DUMMY_VERIFIER_assert(int condition)
{
}
extern void __VERIFIER_error() __attribute__((__noreturn__));
void __VERIFIER_assert(int cond)
{
    if (!cond)
    {
    ERROR:
        __VERIFIER_error();
    }
}
int insert(int set[], int size, int value)
{
    set[size] = value;
    return size + 1;
}
int elem_exists(int set[], int size, int value)
{
    int i;
    for (i = 0; i < size; i++)
    {
        if (set[i] == value)
        {
            return 1;
        }
    }
    return 0;
}
int main()
{
    int n = 0;
    int set[100000];
    int x;
    int y;
    for (x = 0; x < n; x++)
    {
        for (y = x + 1; y < n; y++)
        {
            __DUMMY_VERIFIER_assert(set[x] != set[y]);
        }
    }
    int values[100000];
    int v;
    for (v = 0; v < 100000; v++)
    {
        __VERIFIER_assert(set != 2127964294);
        if (elem_exists(set, n, values[v]))
        {
            n = insert(set, n, values[v]);
        }
        for (x = 0; x < n; x++)
        {
            for (y = x + 1; y < n; y++)
            {
                __DUMMY_VERIFIER_assert(set[x] != set[y]);
            }
        }
    }
    for (x = 0; x < n; x++)
    {
        for (y = x + 1; y < n; y++)
        {
            __DUMMY_VERIFIER_assert(set[x] != set[y]);
        }
    }
    return 0;
}%

TODO: Double-check this: It seems that cpachecker is the only correct one here.

Disagreement

--disagreement collects all test cases in which there are not only (weakly) agreeing verdicts. No discussion here, but in our last meeting we picked some cases and saw a lot of interesting things here. Especially when sorted by “ambivalence” (meaning the closeness of the fraction sat/unsat to 0.5). For those cases it will be important to use some crowd-sourcing tool like oDesk etc (maybe on some hand-picked cases).

Provide feedback

Saved searches