This is the ParallelAccelerator Julia package, part of the High Performance Scripting project at Intel Labs. Please join our discussion group http://groups.google.com/group/julia-hps/ to ask questions or discuss features/issues.
-
Julia v0.4.x. If you don't have it yet, there are various ways to install Julia:
- Go to http://julialang.org/downloads/ and download the appropriate version listed under "Current Release".
- On Ubuntu or Debian variants,
sudo add-apt-repository ppa:staticfloat/juliareleases -y && sudo apt-get update -q && sudo apt-get install julia -y
should work. - On OS X with Homebrew,
brew update && brew tap staticfloat/julia && brew install julia
should work.
Check that you can run Julia and get to a
julia>
prompt. You will know you're running the correct version if when you run it, you seeVersion 0.4.0
orVersion 0.4.1
. -
Either
gcc
/g++
oricc
/icpc
. We recommend GCC 4.8.4 or later and ICC 15.0.3 or later. At package build time, ParallelAccelerator will check to see if you have ICC installed. If so, ParallelAccelerator will use it. Otherwise, it will use GCC. Clang with GCC front-end (default on Mac OS X) also works. -
Platforms we have tested on so far include Ubuntu 14.04, CentOS 6.6, and Mac OS X Yosemite with both GCC and ICC.
At the julia>
prompt, run these commands:
Pkg.add("ParallelAccelerator") # Install this package and its dependencies.
Since we add features and fix bugs very frequently, we
recommend switching to our master
branch.
Pkg.checkout("ParallelAccelerator") # Switch to master branch
Pkg.checkout("CompilerTools") # Switch to master branch
Pkg.build("ParallelAccelerator") # Build the C++ runtime component of the package.
Pkg.test("CompilerTools") # Run CompilerTools tests.
Pkg.test("ParallelAccelerator") # Run ParallelAccelerator tests.
If all of the above succeeded, you should be ready to use ParallelAccelerator.
The examples/
subdirectory has a few example programs demonstrating
how to use ParallelAccelerator. You can run them at the command line.
For instance:
$ julia ~/.julia/v0.4/ParallelAccelerator/examples/laplace-3d/laplace-3d.jl
Run laplace-3d with size 300x300x300 for 100 iterations.
SELFPRIMED 18.663935711
SELFTIMED 1.527286803
checksum: 0.49989778
The SELFTIMED
line in the printed output shows the running time,
while the SELFPRIMED
line shows the time it takes to compile the
accelerated code and run it with a small "warm-up" input.
Pass the --help
option to see usage information for each example:
$ julia ~/.julia/v0.4/ParallelAccelerator/examples/laplace-3d/laplace-3d.jl -- --help
laplace-3d.jl
Laplace 6-point 3D stencil.
Usage:
laplace-3d.jl -h | --help
laplace-3d.jl [--size=<size>] [--iterations=<iterations>]
Options:
-h --help Show this screen.
--size=<size> Specify a 3d array size (<size> x <size> x <size>); defaults to 300.
--iterations=<iterations> Specify a number of iterations; defaults to 100.
You can also run the examples at the julia>
prompt:
julia> include("$(homedir())/.julia/v0.4/ParallelAccelerator/examples/laplace-3d/laplace-3d.jl")
Run laplace-3d with size 300x300x300 for 100 iterations.
SELFPRIMED 18.612651534
SELFTIMED 1.355707121
checksum: 0.49989778
Some of the examples require additional Julia packages. The top-level
REQUIRE
file in this repository lists all registered packages that
examples depend on.
To start using ParallelAccelerator in your own program, first import
the package with using ParallelAccelerator
, and then put the @acc
macro before the function you want to accelerate. A trivial example is
given below:
julia> using ParallelAccelerator
julia> @acc f(x) = x .+ x .* x
f (generic function with 1 method)
julia> f([1,2,3,4,5])
5-element Array{Int64,1}:
2
6
12
20
30
You can also use @acc begin ... end
, and put multiple functions in the block
to have all of them accelerated. The @acc
macro only works for top-level
definitions.
ParallelAccelerator is essentially a domain-specific compiler written in Julia.
It performs additional analysis and optimization on top of the Julia compiler.
ParallelAccelerator discovers and exploits the implicit parallelism in source programs that
use parallel programming patterns such as map, reduce, comprehension, and
stencil. For example, Julia array operators such as .+, .-, .*, ./
are
translated by ParallelAccelerator internally into data-parallel map operations over all
elements of input arrays. For the most part, these patterns are already
present in standard Julia, so programmers can use ParallelAccelerator to run
the same Julia program without (significantly) modifying the source code.
The @acc
macro provided by ParallelAccelerator first intercepts Julia
functions at the macro level and substitutes the set of implicitly parallel
operations that we are targeting. @acc
points them to those supplied in the
ParallelAccelerator.API
module. It then creates a proxy function that when
called with concrete arguments (and known types) will try to compile the
original function to an optimized form. Therefore, there is some compilation
time the first time an accelerated function is called. The subsequent
calls to the same function will not have compilation time overhead.
ParallelAccelerator performs aggressive optimizations when they are safe depending on the program structure.
For example, it will automatically infer size equivalence relations among array
variables and skip array bounds check whenever it can safely do so. Eventually all
parallel patterns are lowered into explicit parallel for
loops which are internally
represented at the level of Julia's typed AST. Aggressive loop fusion will
try to combine adjacent loops into one and eliminate temporary array objects
that store intermediate results.
Finally, functions with parallel for loops are translated into a C program with OpenMP pragmas, and ParallelAccelerator will use an external C/C++ compiler to compile it into binary form before loading it back into Julia as a dynamic library for execution. This step of translating Julia to C currently imposes certain constraints (see details below), and therefore we can only run user programs that meet such constraints.
As mentioned above, ParallelAccelerator aims to optimize implicitly parallel Julia programs that are safe to parallelize. It also tries to be non-invasive, which means a user function or program should continue to work as expected even when only a part of it is accelerated. It is still important to know what parts are accelerated, however. As a general guideline, we encourage users to write programs using high-level array operations rather than writing explicit for-loops which can have unrestricted mutations or unknown side-effects. High-level operations are more amenable to analysis and optimization provided by ParallelAccelerator.
To help user verify program correctness, the optimizations of ParallelAccelerator
can be turned off by setting environment variable PROSPECT_MODE=none
before
running Julia. Programs that use ParallelAccelerator will still run
(including those that use runStencil
, described below), but no optimizations or
Julia-to-C translation will take place. Users can also use @noacc
at the function call site to use the original version of the function.
Array operations that work uniformly on all elements of input arrays and
produce an output array of equal size are called point-wise
operations.
Point-wise
binary operations in Julia usually have a .
prefix in the
operator name. These operations are translated internally into data-parallel map operations by
ParallelAccelerator. The following are recognized by @acc
as map
operations:
-
Unary functions:
-, +, acos, acosh, angle, asin, asinh, atan, atanh, cbrt, cis, cos, cosh, exp10, exp2, exp, expm1, lgamma, log10, log1p, log2, log, sin, sinh, sqrt, tan, tanh, abs, copy, erf
-
Binary functions:
-, +, .+, .-, .*, ./, .\, .%, .>, .<, .<=, .>=, .==, .<<, .>>, .^, div, mod, rem, &, |, $, min, max
Array assignments are also recognized and converted into in-place map
operations. Expressions like a = a .+ b
will be turned into an in-place map
that takes two inputs arrays, a
and b
, and updates a
in-place.
Array operations that compute a single result by repeating an associative
and commutative operator on all input array elements are called reduce operations.
The following are recognized by @acc
as reduce
operations:
minimum, maximum, sum, prod, any, all
We also support range operations to a limited extent. For example, a[r] = b[r]
where r
is either a BitArray
or UnitRange
(e.g., 1:s
) is
internally converted to parallel operations when the ranges can be inferred
statically to be compatible. However, such support is still
experimental, and occasionally ParallelAccelerator will complain about not
being able to optimize them. We are working on improving this feature
to provide more coverage and better error messages.
Array comprehensions in Julia are in general also parallelizable, because
unlike general loops, their iteration variables have no inter-dependencies.
So the @acc
macro will turn them into an internal form that we call
cartesianarray
:
A = Type[ f (x1, x2, ...) for x1 in r1, x2 in r2, ... ]
becomes
cartesianarray((i1,i2,...) -> begin x1 = r1[i1]; x2 = r2[i2]; f(x1,x2,...) end,
Type,
(length(r1), length(r2), ...))
This cartesianarray
function is also exported by ParallelAccelerator
and
can be directly used by the user. Both the above two forms are acceptable
programs, and equivalent in semantics. They both produce a N-dimensional array
whose element is of Type
, where N
is the number of x
s and r
s, and
currently only up-to-3 dimensions are supported.
It should be noted, however, that not all comprehensions are safe to parallelize.
For example, if the function f
above reads and writes to a variable outside of the comprehension,
then making it run in parallel can produce a non-deterministic
result. Therefore, it is the responsibility of the user to avoid using @acc
in such situations.
Another difference between parallel comprehension and the aforementioned map operation is that array indexing operations in the body of a parallel comprehension remain explicit and therefore should go through necessary bounds-checking to ensure safety. On the other hand, in all map operations such bounds-checking is skipped.
Stencils are commonly found in scientific computing and image processing. A stencil
computation is one that computes new values for all elements of an array based
on the current values of their neighboring elements. Since Julia's base library
does not provide such an API, ParallelAccelerator exports a general
runStencil
interface to help with stencil programming:
runStencil(kernel :: Function, buffer1, buffer2, ...,
iteration :: Int, boundaryHandling :: Symbol)
As an example, the following (taken from our Gaussian blur example)
performs a 5x5 stencil computation (note the use of Julia's do
syntax that lets
the user write a lambda function):
runStencil(buf, img, iterations, :oob_skip) do b, a
b[0,0] =
(a[-2,-2] * 0.003 + a[-1,-2] * 0.0133 + a[0,-2] * 0.0219 + a[1,-2] * 0.0133 + a[2,-2] * 0.0030 +
a[-2,-1] * 0.0133 + a[-1,-1] * 0.0596 + a[0,-1] * 0.0983 + a[1,-1] * 0.0596 + a[2,-1] * 0.0133 +
a[-2, 0] * 0.0219 + a[-1, 0] * 0.0983 + a[0, 0] * 0.1621 + a[1, 0] * 0.0983 + a[2, 0] * 0.0219 +
a[-2, 1] * 0.0133 + a[-1, 1] * 0.0596 + a[0, 1] * 0.0983 + a[1, 1] * 0.0596 + a[2, 1] * 0.0133 +
a[-2, 2] * 0.003 + a[-1, 2] * 0.0133 + a[0, 2] * 0.0219 + a[1, 2] * 0.0133 + a[2, 2] * 0.0030)
return a, b
end
It takes two input arrays, buf
and img
, and performs an iterative stencil
loop (ISL) of the number of iterations given by iterations
.
The stencil kernel is specified by a lambda
function that takes two arrays a
and b
(that correspond to buf
and
img
), and computes the value of the output buffer using relative indices
as if a cursor is traversing all array elements. [0,0]
represents
the current cursor position. The return
statement in this lambda reverses
the position of a
and b
to specify a buffer rotation that should happen
in between the stencil iterations. runStencil
assumes that
all input and output buffers are of the same dimension and size.
Stencil boundary handling can be specified as one of the following symbols:
:oob_skip
: Writing to output is skipped when input indexing is out-of-bound.:oob_wraparound
: Indexing is "wrapped around" at the array boundaries so they are always safe.:oob_dst_zero
: Write 0 to the output array when any of the input indices is out-of-bounds.:oob_src_zero
. Assume 0 is returned by a read operation when indexing is out-of-bounds.
Just as with parallel comprehension, accessing the variables outside the body
of the runStencil
lambda expression is allowed.
However, accessing outside array values is
not supported, and reading/writing the same outside variable can cause
non-determinism.
All arrays that need to be relatively indexed can be specified as
input buffers. runStencil
does not impose any implicit buffer rotation
order, and the user can choose not to rotate buffers in return
. There
can be multiple output buffers as well. Finally, the call to runStencil
does
not have any return value, and inputs are rotated iterations - 1
times if rotation is specified.
ParallelAccelerator exports a naive Julia implementation of runStencil
that
runs without using @acc
. Its purpose is mostly for correctness checking.
When @acc
is being used with environment variable PROSPECT_MODE=none
,
instead of parallelizing the stencil computation @acc
will expand the call
to runStencil
to a fast sequential implementation.
It is possible to embed a binary/compiled version of the ParallelAccelerator compiler and CompilerTools into a Julia executable. This has the potential to greatly reduce the time it takes for our compiler to accelerate a given program. For this feature, the user needs to have the Julia source code and should be able to rebuild Julia. Hence, Julia installations using ready binaries are not suitable for this purpose.
To use this feature, start the Julia REPL and do the following:
importall ParallelAccelerator
ParallelAccelerator.embed()
This version of embed()
tries to embed ParallelAccelerator into the Julia
version used by the current REPL.
If you want to target a different Julia distribution, you can alternatively use the following version of embed.
ParallelAccelerator.embed("<your/path/to/the/root/of/the/julia/distribution>")
This embed
function takes a path which is expected to point to the root
directory of a Julia source distribution. embed
performs some simple checks to
try to verify this fact. Then, embed will try to create a file
base/userimg.jl
that will tell the Julia build how to embed the compiled
version into the Julia executable. Finally, embed
runs make
in the Julia root
directory to create the embedded Julia version.
If there is already a userimg.jl
file in the base directory then a new file is
created called ParallelAccelerator_userimg.jl
and it then becomes the user's
responsibility to merge that with the existing userimg.jl
and run make
.
After the call to embed
finishes, when trying to exit the Julia REPL, you may
receive an exception like: ErrorException("ccall: could not find function git_libgit2_shutdown in library libgit2"
). This error does not effect the
embedding process but we are working towards a solution to this minor issue.
Currently, ParallelAccelerator tries to compile Julia to C, which puts some constraints on what can be successfully compiled and run:
-
We only support a limited subset of Julia language features currently. This includes basic numbers and dense array types, a subset of math functions, and basic control flow structures. Notably, we do not support
String
types and custom data types such as records and unions, since their translation to C is difficult. There is also no support for exceptions, I/O operations (only very limitedprintln
), and arbitraryccall
s. We also do not support keyword arguments. -
We do not support calling Julia functions from C in the optimized function. What this implies is that we transitively convert every Julia function in the call chain to C. If any of them is not translated properly, the target function with
@acc
will fail to compile. -
We do not support Julia's
Any
type in C, mostly to defend against erroneous translation. If the AST of a Julia function contains a variable withAny
type, our Julia-to-C translator will give up compiling the function. This is indeed more limiting than it sounds, because Julia does not annotate all expressions in a typed AST with complete type information. For example, this happens for some expressions that call Julia's own intrinsics. We are working on supporting more of them if we can derive the actual type to be notAny
, but this is still a work in progress.
At the moment ParallelAccelerator only supports the Julia-to-C back-end. We are working on alternatives that make use of Julia's upcoming threading implementation that hopefully can alleviate the above mentioned restrictions, without sacrificing much of the speed brought by quality C compilers and parallel runtimes such as OpenMP.
Apart from the constraints imposed by Julia-to-C translation, our current implementation of ParallelAccelerator has some other limitations:
-
We currently support a limited subset of Julia functions available in the
Base
library. However, not all Julia functions inBase
are supported yet, and using them may or may not work in ParallelAccelerator. For supported functions, we rely on capturing operator names to resolve array related functions and operators to our API module. This prevents them from being inlined by Julia which helps our translation. For unsupported functions such asmean(x)
, Julia's typed AST for the program that containsmean(x)
becomes a lowered call that is basically the the low-level sequential implementation which cannot be handled by ParallelAccelerator. Of course, adding support for functions likemean
is not a huge effort, and we are still in the process of expanding the coverage of supported APIs. -
ParallelAccelerator relies heavily on full type information being available in Julia's typed AST in order to work properly. Although we do not require user functions to be explicitly typed, it is in general a good practice to ensure the function that is being accelerated can pass Julia's type inference without leaving any parameters or internal variables with an
Any
type. There is currently no facility to help users understand whether something is being optimized or silently rejected. We plan to provide better report on what is going on under the hood.
Performance tuning is a hard problem, especially in high-performance scientific programs. ParallelAccelerator is but a first step of bringing reasonable parallel performance to a productivity language by utilizing both domain-specific and general compiler optimization techniques without much effort from the user. However, eventually there will be bottlenecks, and not all optimizations work in favor of each other. ParallelAccelerator is still a proof-of-concept at this stage and we hope to hear from all users. We welcome bug reports and code contributions. Please feel free to use our system, fork the project, contact us by email, and file bug reports on the issue tracker.