Skip to content

Latest commit

 

History

History
 
 

examples

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

oneAPI Math Kernel Library (oneMKL) Interfaces Examples

oneAPI Math Kernel Library (oneMKL) Interfaces offers examples with the following routines:

  • blas: level3/gemm_usm
  • rng: uniform_usm
  • lapack: getrs_usm
  • dft: complex_fwd_usm, real_fwd_usm
  • sparse_blas: sparse_gemv_usm

Each routine has one run-time dispatching example and one compile-time dispatching example (which uses both mklcpu and cuda backends), located in example/<$domain>/run_time_dispatching and example/<$domain>/compile_time_dispatching subfolders, respectively.

To build examples, use cmake build option -DBUILD_EXAMPLES=true.
Compile_time_dispatching will be built if -DBUILD_EXAMPLES=true and cuda backend is enabled, because the compile-time dispatching example runs on both mklcpu and cuda backends. Run_time_dispatching will be built if -DBUILD_EXAMPLES=true and -DBUILD_SHARED_LIBS=true.

The example executable naming convention follows example_<$domain>_<$routine>_<$backend> for compile-time dispatching examples or example_<$domain>_<$routine> for run-time dispatching examples. E.g. example_blas_gemm_usm_mklcpu_cublas example_blas_gemm_usm

Example outputs (blas, rng, lapack, dft, sparse_blas)

blas

Run-time dispatching examples with mklcpu backend

$ export ONEAPI_DEVICE_SELECTOR="opencl:cpu"
$ ./bin/example_blas_gemm_usm

########################################################################
# General Matrix-Matrix Multiplication using Unified Shared Memory Example:
#
# C = alpha * A * B + beta * C
#
# where A, B and C are general dense matrices and alpha, beta are
# floating point type precision scalars.
#
# Using apis:
#   gemm
#
# Using single precision (float) data type
#
# Device will be selected during runtime.
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
#
########################################################################

Running BLAS GEMM USM example on CPU device.
Device name is: Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
Running with single precision real data type:

                GEMM parameters:
                        transA = trans, transB = nontrans
                        m = 45, n = 98, k = 67
                        lda = 103, ldB = 105, ldC = 106
                        alpha = 2, beta = 3

                Outputting 2x2 block of A,B,C matrices:

                        A = [ 0.340188, 0.260249, ...
                            [ -0.105617, 0.0125354, ...
                            [ ...


                        B = [ -0.326421, -0.192968, ...
                            [ 0.363891, 0.251295, ...
                            [ ...


                        C = [ 0.00698781, 0.525862, ...
                            [ 0.585167, 1.59017, ...
                            [ ...

BLAS GEMM USM example ran OK.

Run-time dispatching examples with mklgpu backend

$ export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
$ ./bin/example_blas_gemm_usm

########################################################################
# General Matrix-Matrix Multiplication using Unified Shared Memory Example:
#
# C = alpha * A * B + beta * C
#
# where A, B and C are general dense matrices and alpha, beta are
# floating point type precision scalars.
#
# Using apis:
#   gemm
#
# Using single precision (float) data type
#
# Device will be selected during runtime.
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
#
########################################################################

Running BLAS GEMM USM example on GPU device.
Device name is: Intel(R) Iris(R) Pro Graphics 580 [0x193b]
Running with single precision real data type:

                GEMM parameters:
                        transA = trans, transB = nontrans
                        m = 45, n = 98, k = 67
                        lda = 103, ldB = 105, ldC = 106
                        alpha = 2, beta = 3

                Outputting 2x2 block of A,B,C matrices:

                        A = [ 0.340188, 0.260249, ...
                            [ -0.105617, 0.0125354, ...
                            [ ...


                        B = [ -0.326421, -0.192968, ...
                            [ 0.363891, 0.251295, ...
                            [ ...


                        C = [ 0.00698781, 0.525862, ...
                            [ 0.585167, 1.59017, ...
                            [ ...

BLAS GEMM USM example ran OK.

Compile-time dispatching example with both mklcpu and cublas backend

(Note that the mklcpu and cublas result matrices have a small difference. This is expected due to precision limitation of float)

./bin/example_blas_gemm_usm_mklcpu_cublas

########################################################################
# General Matrix-Matrix Multiplication using Unified Shared Memory Example:
#
# C = alpha * A * B + beta * C
#
# where A, B and C are general dense matrices and alpha, beta are
# floating point type precision scalars.
#
# Using apis:
#   gemm
#
# Using single precision (float) data type
#
# Running on both Intel CPU and Nvidia GPU devices
#
########################################################################

Running BLAS GEMM USM example
Running with single precision real data type on:
        CPU device: Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
        GPU device: TITAN RTX

                GEMM parameters:
                        transA = trans, transB = nontrans
                        m = 45, n = 98, k = 67
                        lda = 103, ldB = 105, ldC = 106
                        alpha = 2, beta = 3

                Outputting 2x2 block of A,B,C matrices:

                        A = [ 0.340188, 0.260249, ...
                            [ -0.105617, 0.0125354, ...
                            [ ...


                        B = [ -0.326421, -0.192968, ...
                            [ 0.363891, 0.251295, ...
                            [ ...


                        (CPU) C = [ 0.00698781, 0.525862, ...
                            [ 0.585167, 1.59017, ...
                            [ ...


                        (GPU) C = [ 0.00698793, 0.525862, ...
                            [ 0.585168, 1.59017, ...
                            [ ...

BLAS GEMM USM example ran OK on MKLCPU and CUBLAS

lapack

Run-time dispatching example with mklgpu backend:

$ export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
$ ./bin/example_lapack_getrs_usm

########################################################################
# LU Factorization and Solve Example:
#
# Computes LU Factorization A = P * L * U
# and uses it to solve for X in a system of linear equations:
#   AX = B
# where A is a general dense matrix and B is a matrix whose columns
# are the right-hand sides for the systems of equations.
#
# Using apis:
#   getrf and getrs
#
# Using single precision (float) data type
#
# Device will be selected during runtime.
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
#
########################################################################

Running LAPACK getrs example on GPU device.
Device name is: Intel(R) Iris(R) Pro Graphics 580 [0x193b]
Running with single precision real data type:

                GETRF and GETRS parameters:
                        trans = nontrans
                        m = 23, n = 23, nrhs = 23
                        lda = 32, ldb = 32

                Outputting 2x2 block of A and X matrices:

                        A = [ 0.340188, 0.304177, ...
                            [ -0.105617, -0.343321, ...
                            [ ...


                        X = [ -1.1748, 1.84793, ...
                            [ 1.47856, 0.189481, ...
                            [ ...

LAPACK GETRS USM example ran OK

Compile-time dispatching example with both mklcpu and cusolver backend

$ ./bin/example_lapack_getrs_usm_mklcpu_cusolver

########################################################################
# LU Factorization and Solve Example:
#
# Computes LU Factorization A = P * L * U
# and uses it to solve for X in a system of linear equations:
#   AX = B
# where A is a general dense matrix and B is a matrix whose columns
# are the right-hand sides for the systems of equations.
#
# Using apis:
#   getrf and getrs
#
# Using single precision (float) data type
#
# Running on both Intel CPU and NVIDIA GPU devices
#
########################################################################

Running LAPACK GETRS USM example
Running with single precision real data type on:
        CPU device :Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
        GPU device :TITAN RTX

                GETRF and GETRS parameters:
                        trans = nontrans
                        m = 23, n = 23, nrhs = 23
                        lda = 32, ldb = 32

                Outputting 2x2 block of A,B,X matrices:

                        A = [ 0.340188, 0.304177, ...
                            [ -0.105617, -0.343321, ...
                            [ ...


                        (CPU) X = [ -1.1748, 1.84793, ...
                            [ 1.47856, 0.189481, ...
                            [ ...


                        (GPU) X = [ -1.1748, 1.84793, ...
                            [ 1.47856, 0.189481, ...
                            [ ...

LAPACK GETRS USM example ran OK on MKLCPU and CUSOLVER

rng

Run-time dispatching example with mklgpu backend:

$ export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
$ ./bin/example_rng_uniform_usm

########################################################################
# Generate uniformly distributed random numbers with philox4x32x10
# generator example:
#
# Using APIs:
#   default_engine uniform
#
# Using single precision (float) data type
#
# Device will be selected during runtime.
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
#
########################################################################

Running RNG uniform usm example on GPU device
Device name is: Intel(R) Iris(R) Pro Graphics 580 [0x193b]
Running with single precision real data type:
                generation parameters:
                        seed = 777, a = 0, b = 10
                Output of generator:
                        first 10 numbers of 1000:
8.52971 1.76033 6.04753 3.68079 9.04039 2.61014 3.75788 3.94859 7.93444 8.60436
Random number generator with uniform distribution ran OK

Compile-time dispatching example with both mklcpu and curand backend

$ ./bin/example_rng_uniform_usm_mklcpu_curand

########################################################################
# Generate uniformly distributed random numbers with philox4x32x10
# generator example:
#
# Using APIs:
#   default_engine uniform
#
# Using single precision (float) data type
#
# Running on both Intel CPU and Nvidia GPU devices
#
########################################################################

Running RNG uniform usm example
Running with single precision real data type:
        CPU device: Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
        GPU device: TITAN RTX
                generation parameters:
                        seed = 777, a = 0, b = 10
                Output of generator on CPU device:
                        first 10 numbers of 1000:
8.52971 1.76033 6.04753 3.68079 9.04039 2.61014 3.75788 3.94859 7.93444 8.60436
                Output of generator on GPU device:
                        first 10 numbers of 1000:
3.52971 6.76033 1.04753 8.68079 4.48229 0.501966 6.78265 8.99091 6.39516 9.67955
Random number generator example with uniform distribution ran OK on MKLCPU and CURAND

dft

Compile-time dispatching example with MKLGPU backend

$ ONEAPI_DEVICE_SELECTOR="level_zero:gpu" ./bin/example_dft_complex_fwd_buffer_mklgpu

########################################################################
# Complex out-of-place forward transform for Buffer API's example:
#
# Using APIs:
#   Compile-time dispatch API
#   Buffer forward complex out-of-place
#
# Using single precision (float) data type
#
# For Intel GPU with Intel MKLGPU backend.
#
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
########################################################################

Running DFT Complex forward out-of-place buffer example
Using compile-time dispatch API with MKLGPU.
Running with single precision real data type on:
	GPU device :Intel(R) UHD Graphics 750 [0x4c8a]
DFT Complex USM example ran OK on MKLGPU

Runtime dispatching example with MKLGPU, cuFFT, rocFFT and portFFT backends:

$ ONEAPI_DEVICE_SELECTOR="level_zero:gpu" ./bin/example_dft_real_fwd_usm

########################################################################
# DFT complex in-place forward transform with USM API example:
#
# Using APIs:
#   USM forward complex in-place
#   Run-time dispatch
#
# Using single precision (float) data type
#
# Device will be selected during runtime.
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
#
########################################################################

Running DFT complex forward example on GPU device
Device name is: Intel(R) UHD Graphics 750 [0x4c8a]
Running with single precision real data type:
DFT example run_time dispatch
DFT example ran OK
$ ONEAPI_DEVICE_SELECTOR="level_zero:gpu" ./bin/example_dft_real_fwd_usm

########################################################################
# DFT complex in-place forward transform with USM API example:
#
# Using APIs:
#   USM forward complex in-place
#   Run-time dispatch
#
# Using single precision (float) data type
#
# Device will be selected during runtime.
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
#
########################################################################

Running DFT complex forward example on GPU device
Device name is: NVIDIA A100-PCIE-40GB
Running with single precision real data type:
DFT example run_time dispatch
DFT example ran OK
$ ./bin/example_dft_real_fwd_usm

########################################################################
# DFT complex in-place forward transform with USM API example:
#
# Using APIs:
#   USM forward complex in-place
#   Run-time dispatch
#
# Using single precision (float) data type
#
# Device will be selected during runtime.
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
#
########################################################################

Running DFT complex forward example on GPU device
Device name is: AMD Radeon PRO W6800
Running with single precision real data type:
DFT example run_time dispatch
DFT example ran OK
$ LD_LIBRARY_PATH=lib/:$LD_LIBRARY_PATH ./bin/example_dft_real_fwd_usm
########################################################################
# DFT complex in-place forward transform with USM API example:
#
# Using APIs:
#   USM forward complex in-place
#   Run-time dispatch
#
# Using single precision (float) data type
#
# Device will be selected during runtime.
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
#
########################################################################

Running DFT complex forward example on GPU device
Device name is: Intel(R) UHD Graphics 750
Running with single precision real data type:
DFT example run_time dispatch
Unsupported Configuration:
	oneMKL: dft/backends/portfft/commit: function is not implemented portFFT only supports complex to complex transforms

sparse_blas

Run-time dispatching examples with mklcpu backend

$ export ONEAPI_DEVICE_SELECTOR="opencl:cpu"
$ ./bin/example_sparse_blas_gemv_usm

########################################################################
# Sparse Matrix-Vector Multiply Example: 
# 
# y = alpha * op(A) * x + beta * y
# 
# where A is a sparse matrix in CSR format, x and y are dense vectors
# and alpha, beta are floating point type precision scalars.
# 
# Using apis:
#   sparse::gemv
# 
# Using single precision (float) data type
# 
# Device will be selected during runtime.
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
# 
########################################################################

Running Sparse BLAS GEMV USM example on CPU device.
Device name is: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Running with single precision real data type:

		sparse::gemv parameters:
			transA = nontrans
			nrows = 64
			alpha = 1, beta = 0

		 sparse::gemv example passed
	Finished
Sparse BLAS GEMV USM example ran OK.

Run-time dispatching examples with mklgpu backend

$ export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
$ ./bin/example_sparse_blas_gemv_usm

########################################################################
# Sparse Matrix-Vector Multiply Example: 
# 
# y = alpha * op(A) * x + beta * y
# 
# where A is a sparse matrix in CSR format, x and y are dense vectors
# and alpha, beta are floating point type precision scalars.
# 
# Using apis:
#   sparse::gemv
# 
# Using single precision (float) data type
# 
# Device will be selected during runtime.
# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
# available devices
# 
########################################################################

Running Sparse BLAS GEMV USM example on GPU device.
Device name is: Intel(R) HD Graphics 530 [0x1912]
Running with single precision real data type:

		sparse::gemv parameters:
			transA = nontrans
			nrows = 64
			alpha = 1, beta = 0

		 sparse::gemv example passed
	Finished
Sparse BLAS GEMV USM example ran OK.

Compile-time dispatching example with mklcpu backend

$ export ONEAPI_DEVICE_SELECTOR="opencl:cpu"
$ ./bin/example_sparse_blas_gemv_usm_mklcpu

########################################################################
# Sparse Matrix-Vector Multiply Example: 
# 
# y = alpha * op(A) * x + beta * y
# 
# where A is a sparse matrix in CSR format, x and y are dense vectors
# and alpha, beta are floating point type precision scalars.
# 
# Using apis:
#   sparse::gemv
# 
# Using single precision (float) data type
# 
# Running on Intel CPU device
# 
########################################################################

Running Sparse BLAS GEMV USM example on CPU device.
Device name is: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Running with single precision real data type:

		sparse::gemv parameters:
			transA = nontrans
			nrows = 64
			alpha = 1, beta = 0

		 sparse::gemv example passed
	Finished
Sparse BLAS GEMV USM example ran OK.