-
Notifications
You must be signed in to change notification settings - Fork 6
bfs dfs smart comparison
Executive Summary It does not look like there is a big fundamental difference between the strategies. The smart strategy found one very interesting test case, where the BFS strategy found the highest number of incompleteness. The data is moderately flawed due to a bug (that is now fixed) that made vdiff crash on some files.
This analysis is based on the data set random100_fast
. First of all: Due to a
bug (that is now fixed), a few segfaults happened during the executions, which
explains the difference in file size of specific databases.
> du -sh *
2.9M bfs.db
2.5M dfs.db
1.8M smart.db
Or in absolute numbers (vdiff-viewer --stat
):
data | runs | programs | used source files |
---|---|---|---|
bfs.db | 10429 | 1187 | 96 |
dfs.db | 8867 | 1055 | 91 |
smart.db | 7280 | 970 | 88 |
the invocation happened through random100_fast.sh
, which ran each verifier with
the following parameters: budget = 20, timeout=15, seed=123
.
vdiff-viewer --count --unsound` (`--incomplete
) counts all cases in which a
verifier reached the verdict “Unsat” (“Sat”) although the majority of the verifiers said
“Sat” (“Unsat”). One has to be careful to interpret this data, as often with the
interesting cases, verifiers tend to timeout. In those cases, the consensus is
only based on (too) few participants. Anyhow, here’s the raw data:
data | –unsound | –incomplete | –disagreement | –unsound-klee-cbmc |
---|---|---|---|---|
bfs.db | 259 (2.48%) | 12 | 220 (3.16%) | 0 |
dfs.db | 211 (2.38%) | 6 | 181 (2.04%) | 0 |
smart.db | 161 (2.21%) | 1 | 156 (2.14%) | 1 |
The percentages in the unsound column compares the number with the total number of runs.
The query --unsound-klee-cbmc
lists the special case where klee or cbmc says
“Sat” (meaning reachable) and some other verifier says “Unsat”. In those cases
we have a strong tendency to believe Klee or CBMC.
All cases of unsoundness in bfs.db and dfs.db were due to KLEE and therefore not
a very surprising finding. More interesting is the one case in smart.db that is not caused by KLEE.
Let’s have a closer look (vdiff-viewer --list --unsound-klee-cbmc -d smart.db
)
runId | verifierName | originalFn | programHash | unsats | sats |
3105 | cpachecker | ./c/ldv-regression/mutex_lock_int_true-termination.c_true-unreach-call_1.i | 20fca5f1d266f3d00b28e4a3d55b4fbe678ae678 | 1 | 6 |
Let’s have a look at the complete run information (vdiff-viewer --runs 20fca5f1d266f3d00b28e4a3d55b4fbe678ae678
)
verifier | verdict | time (s) | memory |
---|---|---|---|
cbmc | sat | Just 0.24 | Just 29156 |
klee | sat | Just 1.0e-2 | Just 18888 |
seacrab | sat | Just 0.17 | Just 36664 |
seahorn | sat | Just 0.18 | Just 36312 |
uautomizer | sat | Just 4.16 | Just 308984 |
utaipan | sat | Just 3.99 | Just 299444 |
cpachecker | unsat | Just 2.37 | Just 298736 |
The programs look like this (vdiff-viewer --hash 20fca5f1d266f3d00b28e4a3d55b4fbe678ae678
)
extern void __VERIFIER_error() __attribute__((__noreturn__));
void __VERIFIER_assert(int cond)
{
if (!cond)
{
ERROR:
__VERIFIER_error();
}
}
void __DUMMY_VERIFIER_assert(int condition)
{
}
extern void __VERIFIER_error() __attribute__((__noreturn__));
void err()
{
ERROR:
__VERIFIER_error();
}
void mutex_lock(int * a)
{
if (*a == 1)
{
err();
}
*a = 1;
}
void mutex_unlock(int * b)
{
if (*b != 1)
{
err();
}
*b = 0;
}
int main()
{
int m;
m = 0;
mutex_lock(&m);
__VERIFIER_assert(m != 1);
mutex_unlock(&m);
}
In short: All cases of incompleteness in bfs,dfs and smart where caused by cpachecker. It’s not quite clear to my why. Just picking one of those cases:
7834 cpachecker ./c/array-examples/data_structures_set_multi_proc_trivial_true-unreach-call_ground_true-termination.i f7c19fdee1100d0c353e1849c602764affdb8586 5 1
runs:
cpachecker | sat | Just 8.84 | Just 828100 |
cbmc | unknown | Nothing | Nothing |
klee | unsat | Just 4.48 | Just 19320 |
seacrab | unsat | Just 0.11 | Just 36836 |
seahorn | unsat | Just 0.12 | Just 36872 |
uautomizer | unsat | Just 4.28 | Just 302160 |
utaipan | unsat | Just 3.81 | Just 309040 |
and the instrumented code:
extern void __VERIFIER_error() __attribute__((__noreturn__));
void __DUMMY_VERIFIER_assert(int condition)
{
}
extern void __VERIFIER_error() __attribute__((__noreturn__));
void __VERIFIER_assert(int cond)
{
if (!cond)
{
ERROR:
__VERIFIER_error();
}
}
int insert(int set[], int size, int value)
{
set[size] = value;
return size + 1;
}
int elem_exists(int set[], int size, int value)
{
int i;
for (i = 0; i < size; i++)
{
if (set[i] == value)
{
return 1;
}
}
return 0;
}
int main()
{
int n = 0;
int set[100000];
int x;
int y;
for (x = 0; x < n; x++)
{
for (y = x + 1; y < n; y++)
{
__DUMMY_VERIFIER_assert(set[x] != set[y]);
}
}
int values[100000];
int v;
for (v = 0; v < 100000; v++)
{
__VERIFIER_assert(set != 2127964294);
if (elem_exists(set, n, values[v]))
{
n = insert(set, n, values[v]);
}
for (x = 0; x < n; x++)
{
for (y = x + 1; y < n; y++)
{
__DUMMY_VERIFIER_assert(set[x] != set[y]);
}
}
}
for (x = 0; x < n; x++)
{
for (y = x + 1; y < n; y++)
{
__DUMMY_VERIFIER_assert(set[x] != set[y]);
}
}
return 0;
}%
--disagreement
collects all test cases in which there are not only (weakly)
agreeing verdicts. No discussion here, but in our last meeting we picked some
cases and saw a lot of interesting things here. Especially when sorted by
“ambivalence” (meaning the closeness of the fraction sat/unsat to 0.5). For
those cases it will be important to use some crowd-sourcing tool like oDesk etc
(maybe on some hand-picked cases).