Skip to content

bfs dfs smart comparison

Christian Klinger edited this page May 15, 2018 · 9 revisions

Comparison between the BFS, DFS and smart strategies

Executive Summary It does not look like there is a big fundamental difference between the strategies. The smart strategy found one very interesting test case, where the BFS strategy found the highest number of incompleteness. The data is moderately flawed due to a bug (that is now fixed) that made vdiff crash on some files.

This analysis is based on the data set random100_fast. First of all: Due to a bug (that is now fixed), a few segfaults happened during the executions, which explains the difference in file size of specific databases.

> du -sh *
2.9M    bfs.db
2.5M    dfs.db
1.8M    smart.db

Or in absolute numbers (vdiff-viewer --stat):

data runs programs used source files
bfs.db 10429 1187 96
dfs.db 8867 1055 91
smart.db 7280 970 88

the invocation happened through random100_fast.sh, which ran each verifier with the following parameters: budget = 20, timeout=15, seed=123.

Results

vdiff-viewer --count --unsound` (`--incomplete) counts all cases in which a verifier reached the verdict “Unsat” (“Sat”) although the majority of the verifiers said “Sat” (“Unsat”). One has to be careful to interpret this data, as often with the interesting cases, verifiers tend to timeout. In those cases, the consensus is only based on (too) few participants. Anyhow, here’s the raw data:

data –unsound –incomplete –disagreement –unsound-klee-cbmc
bfs.db 259 (2.48%) 12 220 (3.16%) 0
dfs.db 211 (2.38%) 6 181 (2.04%) 0
smart.db 161 (2.21%) 1 156 (2.14%) 1

The percentages in the unsound column compares the number with the total number of runs.

The query --unsound-klee-cbmc lists the special case where klee or cbmc says “Sat” (meaning reachable) and some other verifier says “Unsat”. In those cases we have a strong tendency to believe Klee or CBMC.

Unsoundness

All cases of unsoundness in bfs.db and dfs.db were due to KLEE and therefore not a very surprising finding. More interesting is the one case in smart.db that is not caused by KLEE. Let’s have a closer look (vdiff-viewer --list --unsound-klee-cbmc -d smart.db)

runId verifierName originalFn programHash unsats sats
3105 cpachecker ./c/ldv-regression/mutex_lock_int_true-termination.c_true-unreach-call_1.i 20fca5f1d266f3d00b28e4a3d55b4fbe678ae678 1 6

Let’s have a look at the complete run information (vdiff-viewer --runs 20fca5f1d266f3d00b28e4a3d55b4fbe678ae678)

verifier verdict time (s) memory
cbmc sat Just 0.24 Just 29156
klee sat Just 1.0e-2 Just 18888
seacrab sat Just 0.17 Just 36664
seahorn sat Just 0.18 Just 36312
uautomizer sat Just 4.16 Just 308984
utaipan sat Just 3.99 Just 299444
cpachecker unsat Just 2.37 Just 298736

The programs look like this (vdiff-viewer --hash 20fca5f1d266f3d00b28e4a3d55b4fbe678ae678)

extern void __VERIFIER_error() __attribute__((__noreturn__));
void __VERIFIER_assert(int cond)
{
    if (!cond)
    {
    ERROR:
        __VERIFIER_error();
    }
}
void __DUMMY_VERIFIER_assert(int condition)
{
}
extern void __VERIFIER_error() __attribute__((__noreturn__));
void err()
{
ERROR:
    __VERIFIER_error();
}
void mutex_lock(int * a)
{
    if (*a == 1)
    {
        err();
    }
    *a = 1;
}
void mutex_unlock(int * b)
{
    if (*b != 1)
    {
        err();
    }
    *b = 0;
}
int main()
{
    int m;
    m = 0;
    mutex_lock(&m);
    __VERIFIER_assert(m != 1);
    mutex_unlock(&m);
}                                                                                                                                                                                                                               

Incompleteness

In short: All cases of incompleteness in bfs,dfs and smart where caused by cpachecker. It’s not quite clear to my why. Just picking one of those cases:

7834      cpachecker        ./c/array-examples/data_structures_set_multi_proc_trivial_true-unreach-call_ground_true-termination.i     f7c19fdee1100d0c353e1849c602764affdb8586     5           1    

runs:

cpachecker sat Just 8.84 Just 828100
cbmc unknown Nothing Nothing
klee unsat Just 4.48 Just 19320
seacrab unsat Just 0.11 Just 36836
seahorn unsat Just 0.12 Just 36872
uautomizer unsat Just 4.28 Just 302160
utaipan unsat Just 3.81 Just 309040

and the instrumented code:

extern void __VERIFIER_error() __attribute__((__noreturn__));
void __DUMMY_VERIFIER_assert(int condition)
{
}
extern void __VERIFIER_error() __attribute__((__noreturn__));
void __VERIFIER_assert(int cond)
{
    if (!cond)
    {
    ERROR:
        __VERIFIER_error();
    }
}
int insert(int set[], int size, int value)
{
    set[size] = value;
    return size + 1;
}
int elem_exists(int set[], int size, int value)
{
    int i;
    for (i = 0; i < size; i++)
    {
        if (set[i] == value)
        {
            return 1;
        }
    }
    return 0;
}
int main()
{
    int n = 0;
    int set[100000];
    int x;
    int y;
    for (x = 0; x < n; x++)
    {
        for (y = x + 1; y < n; y++)
        {
            __DUMMY_VERIFIER_assert(set[x] != set[y]);
        }
    }
    int values[100000];
    int v;
    for (v = 0; v < 100000; v++)
    {
        __VERIFIER_assert(set != 2127964294);
        if (elem_exists(set, n, values[v]))
        {
            n = insert(set, n, values[v]);
        }
        for (x = 0; x < n; x++)
        {
            for (y = x + 1; y < n; y++)
            {
                __DUMMY_VERIFIER_assert(set[x] != set[y]);
            }
        }
    }
    for (x = 0; x < n; x++)
    {
        for (y = x + 1; y < n; y++)
        {
            __DUMMY_VERIFIER_assert(set[x] != set[y]);
        }
    }
    return 0;
}%                                                                                                                                                                                                                              
TODO: Double-check this: It seems that cpachecker is the only correct one here.

Disagreement

--disagreement collects all test cases in which there are not only (weakly) agreeing verdicts. No discussion here, but in our last meeting we picked some cases and saw a lot of interesting things here. Especially when sorted by “ambivalence” (meaning the closeness of the fraction sat/unsat to 0.5). For those cases it will be important to use some crowd-sourcing tool like oDesk etc (maybe on some hand-picked cases).