-
Notifications
You must be signed in to change notification settings - Fork 1
/
eval.tex
64 lines (47 loc) · 5.89 KB
/
eval.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
\chapter{Evaluation}\label{sec:eval}
\section{Unrollability}\label{sec:eval:unrollability}
One of the primary goals of this thesis was to increase the number of loops that are unrollable with~\libFIRM.
To evaluate to what extent this goal was achieved, we ran the benchmark suite \texttt{spec2006}, and logged how many loops we encountered, how many of them were innermost loops, how many could be unrolled using the old method, and how many that were previously not unrollable can now be unrolled\footnote{N.B.: The test was conducted with a max loop size of $\infty$}.
Considering it is expected for many loops to have non-constant bounds, such as the length of a container data-structure (e.g., a list or an unbounded array), we predict the new optimization to cause a significant increase in unrollable loops.
\Cref{fig:eval:unrollability:cmp-unrollability} shows a table with the results.
We can see, as mentioned in \Cref{sec:basics:unrolling}, that prior to the new optimization, 5.87\% of the innermost loops could be unrolled.
Now we can unroll an additional 7.37\% of loops.
Contrasted to the baseline of the constant bound unrolled loops this is a 125.65\% increase.
This means we more than doubled the number of unrollable loops using our approach.
Furthermore, we note that more than 70\% of loops are innermost loops.
Thus, even if unrolling nested loops were advantageous -- which is highly doubted -- we would not miss out on many loops.
\input{fig/result-unrollability.tex}
\section{Performance}\label{sec:eval:perf}
Even though a high unrollability is a noble goal, most compiler optimizations aim to improve the runtime of the binaries they produce.
In order to evaluate the optimization in this regard, \texttt{spec2006} is used as a benchmark suite and run on a machine with an Intel Core i7 6700 clocked at 3.4GHz.
We run the tests on the Ubuntu 16.04 operating system, with cparser~\cite{cparser} as the frontend for~\libFIRM, and the native \texttt{x86} backend of~\libFIRM{} in use.
We use the same setup as used in the referenced work~\cite{aebi18bachelorarbeit}, such that we can get as comparable results as possible.
To evaluate the performance gain, we run the new optimization in conjunction with the old unrolling (see \Cref{sec:impl:unrollability}), given that it is intended as an extension.
As a result of there being two approaches for the fixup, as seen in \Cref{sec:impl:fixup}, we will try both of these, to see if one or the other yields better binary runtimes.
Furthermore, as described in~\Cref{sec:impl:sel-factor}, the maximum unrolled size determines the scope of the optimization.
Therefore, all sizes~$l \in \{2^n, n \in \lbrack 5, 10 \rbrack \}$~are each tried for both the fixup code strategies.
The reason that we chose 32 as a lower bound, is that very small loops are already more than eight nodes in size and hence wouldn't be unrolled with a maximum size that is a smaller power of two.
In order to compensate for measurement uncertainties, all benchmarks run ten times, and the average ($\mu$), as well as the standard deviation ($\sigma$), will be recorded and discussed.
We will compare all results to the reference benchmark run, which itself is a run of \texttt{spec2006} without any loop unrolling turned on.
These reference results can be seen in \Cref{fig:eval:perf:ref}.
In order to evaluate our findings in terms of performance, we should compare them to unrollability broken down by benchmark.
\Cref{fig:eval:unrollability:cmp-unrollability-bench} shows the number of unrollable loops\footnote{Both constant and non-constant bound unrollable loops are considered together} compared to the total number of loops.
Like in \Cref{fig:eval:unrollability:cmp-unrollability}, we assume a maximum size of infinity to collect this data.
Seeing this data, we would suspect \texttt{bzip2}, \texttt{mcf} and to a lesser extent (even though it has the most unrollable loops in absolute terms) \texttt{h264ref} to have the most considerable speedup.
\input{fig/result-unrollability-bench.tex}
\input{fig/bench-result-ref.tex}
\subsection{Duff's device fixup}\label{sec:eval:perf:duff}
Figures~\ref{fig:eval:perf:duff:32} through~\ref{fig:eval:perf:duff:1024} show the results we obtained.
While for most benchmarks the results hover around the 100\% mark, with no significant benefit or drawback, \texttt{h264ref} seems to profit from unrolling with maximum sizes 32 and 64, by being close to 4.5\% faster.
Though on account of the ratios of all the other benchmarks only diverting by three percent or less from the reference runtimes, unrolling does not seem to have a significant effect on performance.
The standard deviations are less than $1\%$ across all tests, due to the highly controlled test environment.
Though they do not entirely account for the percentage deltas, which are small, yet measurable.
Further, we can expect a small percentage of systemic errors in our measurements due to system process scheduling and similar factors.
Runtimes are, independent of the maximum loop size, within $\lbrack 99\%, 101\% \rbrack$, so we can still consider them to be within the margin of error.
\subsection{Loop fixup}\label{sec:eval:perf:loop}
Figures~\ref{fig:eval:perf:loop:32} through~\ref{fig:eval:perf:loop:1024} show the results obtained for the unrolling run with the loop fixup code.
As was the case in~\Cref{sec:eval:perf:duff}, there does not seem to be any noticeable performance gain or loss in any benchmark, except for \texttt{h264ref}, which again sped up through unrolling by up to 5\%.
The other benchmarks were, compared to the reference, within the interval $\lbrack 99\%, 103\% \rbrack$.
It further becomes evident that there is no correlation between unrollability and performance gain, since, while \texttt{h264ref} has one of the highest unrollabilities and gains performance, \texttt{bzip2} and \texttt{mcf} have higher unrollabilities, yet see no improvement.
\input{fig/bench-result-duff.tex}
\input{fig/bench-result-loop.tex}