From 59f726a52e75a7d731dc6fad191db73f255ab959 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Thu, 12 Dec 2024 20:44:37 +0000 Subject: [PATCH] build based on 4412bc8 --- previews/PR63/.documenter-siteinfo.json | 2 +- previews/PR63/generic/index.html | 8 ++++---- previews/PR63/index.html | 12 ++++++------ previews/PR63/options/index.html | 2 +- 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/previews/PR63/.documenter-siteinfo.json b/previews/PR63/.documenter-siteinfo.json index 6fe5490..345b34f 100644 --- a/previews/PR63/.documenter-siteinfo.json +++ b/previews/PR63/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.11.2","generation_timestamp":"2024-12-12T20:36:28","documenter_version":"1.8.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.2","generation_timestamp":"2024-12-12T20:44:28","documenter_version":"1.8.0"}} \ No newline at end of file diff --git a/previews/PR63/generic/index.html b/previews/PR63/generic/index.html index f088370..29f0210 100644 --- a/previews/PR63/generic/index.html +++ b/previews/PR63/generic/index.html @@ -1,5 +1,5 @@ -Generic API · CUDSS.jl

LLᵀ and LLᴴ

LinearAlgebra.choleskyMethod
solver = cholesky(A::CuSparseMatrixCSR{T,Cint}; view::Char='F')

Compute the LLᴴ factorization of a sparse matrix A on an NVIDIA GPU. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

Input argument

  • A: a sparse Hermitian positive definite matrix stored in the CuSparseMatrixCSR format.

Keyword argument

*view: A character that specifies which triangle of the sparse matrix is provided. Possible options are L for the lower triangle, U for the upper triangle, and F for the full matrix.

Output argument

  • solver: Opaque structure CudssSolver that stores the factors of the LLᴴ decomposition.
source
LinearAlgebra.cholesky!Method
solver = cholesky!(solver::CudssSolver{T}, A::CuSparseMatrixCSR{T,Cint})

Compute the LLᴴ factorization of a sparse matrix A on an NVIDIA GPU, reusing the symbolic factorization stored in solver. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

source
using CUDA, CUDA.CUSPARSE
+Generic API · CUDSS.jl

LLᵀ and LLᴴ

LinearAlgebra.choleskyMethod
solver = cholesky(A::CuSparseMatrixCSR{T,Cint}; view::Char='F')

Compute the LLᴴ factorization of a sparse matrix A on an NVIDIA GPU. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

Input argument

  • A: a sparse Hermitian positive definite matrix stored in the CuSparseMatrixCSR format.

Keyword argument

*view: A character that specifies which triangle of the sparse matrix is provided. Possible options are L for the lower triangle, U for the upper triangle, and F for the full matrix.

Output argument

  • solver: Opaque structure CudssSolver that stores the factors of the LLᴴ decomposition.
source
LinearAlgebra.cholesky!Method
solver = cholesky!(solver::CudssSolver{T}, A::CuSparseMatrixCSR{T,Cint})

Compute the LLᴴ factorization of a sparse matrix A on an NVIDIA GPU, reusing the symbolic factorization stored in solver. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

source
using CUDA, CUDA.CUSPARSE
 using CUDSS
 using LinearAlgebra
 using SparseArrays
@@ -33,7 +33,7 @@
 
 R_gpu = C_gpu - ( CuSparseMatrixCSR(A_cpu) + Diagonal(d_gpu) ) * X_gpu
 norm(R_gpu)
Note

If we only store one triangle of A_gpu, we can also use the wrappers Symmetric and Hermitian instead of using the keyword argument view in cholesky. For real matrices, both wrappers are allowed but only Hermitian can be used for complex matrices.

H_gpu = Hermitian(A_gpu, :U)
-F = cholesky(H_gpu)

LDLᵀ and LDLᴴ

LinearAlgebra.ldltMethod
solver = ldlt(A::CuSparseMatrixCSR{T,Cint}; view::Char='F')

Compute the LDLᴴ factorization of a sparse matrix A on an NVIDIA GPU. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

Input argument

  • A: a sparse Hermitian matrix stored in the CuSparseMatrixCSR format.

Keyword argument

*view: A character that specifies which triangle of the sparse matrix is provided. Possible options are L for the lower triangle, U for the upper triangle, and F for the full matrix.

Output argument

  • solver: Opaque structure CudssSolver that stores the factors of the LDLᴴ decomposition.
source
LinearAlgebra.ldlt!Method
solver = ldlt!(solver::CudssSolver{T}, A::CuSparseMatrixCSR{T,Cint})

Compute the LDLᴴ factorization of a sparse matrix A on an NVIDIA GPU, reusing the symbolic factorization stored in solver. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

source
using CUDA, CUDA.CUSPARSE
+F = cholesky(H_gpu)

LDLᵀ and LDLᴴ

LinearAlgebra.ldltMethod
solver = ldlt(A::CuSparseMatrixCSR{T,Cint}; view::Char='F')

Compute the LDLᴴ factorization of a sparse matrix A on an NVIDIA GPU. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

Input argument

  • A: a sparse Hermitian matrix stored in the CuSparseMatrixCSR format.

Keyword argument

*view: A character that specifies which triangle of the sparse matrix is provided. Possible options are L for the lower triangle, U for the upper triangle, and F for the full matrix.

Output argument

  • solver: Opaque structure CudssSolver that stores the factors of the LDLᴴ decomposition.
source
LinearAlgebra.ldlt!Method
solver = ldlt!(solver::CudssSolver{T}, A::CuSparseMatrixCSR{T,Cint})

Compute the LDLᴴ factorization of a sparse matrix A on an NVIDIA GPU, reusing the symbolic factorization stored in solver. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

source
using CUDA, CUDA.CUSPARSE
 using CUDSS
 using LinearAlgebra
 using SparseArrays
@@ -67,7 +67,7 @@
 
 R_gpu = C_gpu - ( CuSparseMatrixCSR(A_cpu) + Diagonal(d_gpu) ) * X_gpu
 norm(R_gpu)
Note

If we only store one triangle of A_gpu, we can also use the wrappers Symmetric and Hermitian instead of using the keyword argument view in ldlt. For real matrices, both wrappers are allowed but only Hermitian can be used for complex matrices.

S_gpu = Symmetric(A_gpu, :L)
-F = ldlt(S_gpu)

LU

LinearAlgebra.luMethod
solver = lu(A::CuSparseMatrixCSR{T,Cint})

Compute the LU factorization of a sparse matrix A on an NVIDIA GPU. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

Input argument

  • A: a sparse square matrix stored in the CuSparseMatrixCSR format.

Output argument

  • solver: an opaque structure CudssSolver that stores the factors of the LU decomposition.
source
LinearAlgebra.lu!Method
solver = lu!(solver::CudssSolver{T}, A::CuSparseMatrixCSR{T,Cint})

Compute the LU factorization of a sparse matrix A on an NVIDIA GPU, reusing the symbolic factorization stored in solver. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

source
using CUDA, CUDA.CUSPARSE
+F = ldlt(S_gpu)

LU

LinearAlgebra.luMethod
solver = lu(A::CuSparseMatrixCSR{T,Cint})

Compute the LU factorization of a sparse matrix A on an NVIDIA GPU. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

Input argument

  • A: a sparse square matrix stored in the CuSparseMatrixCSR format.

Output argument

  • solver: an opaque structure CudssSolver that stores the factors of the LU decomposition.
source
LinearAlgebra.lu!Method
solver = lu!(solver::CudssSolver{T}, A::CuSparseMatrixCSR{T,Cint})

Compute the LU factorization of a sparse matrix A on an NVIDIA GPU, reusing the symbolic factorization stored in solver. The type T can be Float32, Float64, ComplexF32 or ComplexF64.

source
using CUDA, CUDA.CUSPARSE
 using CUDSS
 using LinearAlgebra
 using SparseArrays
@@ -96,4 +96,4 @@
 ldiv!(x_gpu, F, c_gpu)
 
 r_gpu = c_gpu - A_gpu * x_gpu
-norm(r_gpu)
+norm(r_gpu)
diff --git a/previews/PR63/index.html b/previews/PR63/index.html index d8f4b27..f59a99f 100644 --- a/previews/PR63/index.html +++ b/previews/PR63/index.html @@ -6,10 +6,10 @@ matrix = CudssMatrix(A::CuSparseMatrixCSR{T,Cint}, struture::String, view::Char; index::Char='O') matrix = CudssMatrix(v::Vector{CuVector{T}}) matrix = CudssMatrix(A::Vector{CuMatrix{T}}) -matrix = CudssMatrix(A::Vector{CuSparseMatrixCSR{T,Cint}}, struture::String, view::Char; index::Char='O')

The type T can be Float32, Float64, ComplexF32 or ComplexF64.

CudssMatrix is a wrapper for CuVector, CuMatrix and CuSparseMatrixCSR. CudssMatrix is used to pass matrix of the linear system, as well as solution and right-hand side.

structure specifies the stucture for sparse matrices:

view specifies matrix view for sparse matrices:

index specifies indexing base for sparse matrix indices:

source
CUDSS.CudssConfigType
config = CudssConfig()

CudssConfig stores configuration settings for the solver.

source
CUDSS.CudssDataType
data = CudssData()
-data = CudssData(cudss_handle::cudssHandle_t)

CudssData holds internal data (e.g., LU factors arrays).

source
CUDSS.CudssSolverType
solver = CudssSolver(A::CuSparseMatrixCSR{T,Cint}, structure::String, view::Char; index::Char='O')
+matrix = CudssMatrix(A::Vector{CuSparseMatrixCSR{T,Cint}}, struture::String, view::Char; index::Char='O')

The type T can be Float32, Float64, ComplexF32 or ComplexF64.

CudssMatrix is a wrapper for CuVector, CuMatrix and CuSparseMatrixCSR. CudssMatrix is used to pass matrix of the linear system, as well as solution and right-hand side.

structure specifies the stucture for sparse matrices:

  • "G": General matrix – LDU factorization;
  • "S": Real symmetric matrix – LDLᵀ factorization;
  • "H": Complex Hermitian matrix – LDLᴴ factorization;
  • "SPD": Symmetric positive-definite matrix – LLᵀ factorization;
  • "HPD": Hermitian positive-definite matrix – LLᴴ factorization.

view specifies matrix view for sparse matrices:

  • 'L': Lower-triangular matrix and all values above the main diagonal are ignored;
  • 'U': Upper-triangular matrix and all values below the main diagonal are ignored;
  • 'F': Full matrix.

index specifies indexing base for sparse matrix indices:

  • 'Z': 0-based indexing;
  • 'O': 1-based indexing.
source
CUDSS.CudssConfigType
config = CudssConfig()

CudssConfig stores configuration settings for the solver.

source
CUDSS.CudssDataType
data = CudssData()
+data = CudssData(cudss_handle::cudssHandle_t)

CudssData holds internal data (e.g., LU factors arrays).

source
CUDSS.CudssSolverType
solver = CudssSolver(A::CuSparseMatrixCSR{T,Cint}, structure::String, view::Char; index::Char='O')
 solver = CudssSolver(A::Vector{CuSparseMatrixCSR{T,Cint}}, structure::String, view::Char; index::Char='O')
-solver = CudssSolver(matrix::CudssMatrix{T}, config::CudssConfig, data::CudssData)

The type T can be Float32, Float64, ComplexF32 or ComplexF64.

CudssSolver contains all structures required to solve linear systems with cuDSS. One constructor of CudssSolver takes as input the same parameters as CudssMatrix.

structure specifies the stucture for sparse matrices:

  • "G": General matrix – LDU factorization;
  • "S": Real symmetric matrix – LDLᵀ factorization;
  • "H": Complex Hermitian matrix – LDLᴴ factorization;
  • "SPD": Symmetric positive-definite matrix – LLᵀ factorization;
  • "HPD": Hermitian positive-definite matrix – LLᴴ factorization.

view specifies matrix view for sparse matrices:

  • 'L': Lower-triangular matrix and all values above the main diagonal are ignored;
  • 'U': Upper-triangular matrix and all values below the main diagonal are ignored;
  • 'F': Full matrix.

index specifies indexing base for sparse matrix indices:

  • 'Z': 0-based indexing;
  • 'O': 1-based indexing.

CudssSolver can be also constructed from the three structures CudssMatrix, CudssConfig and CudssData if needed.

source

Functions

CUDSS.cudss_setFunction
cudss_set(matrix::CudssMatrix{T}, v::CuVector{T})
+solver = CudssSolver(matrix::CudssMatrix{T}, config::CudssConfig, data::CudssData)

The type T can be Float32, Float64, ComplexF32 or ComplexF64.

CudssSolver contains all structures required to solve linear systems with cuDSS. One constructor of CudssSolver takes as input the same parameters as CudssMatrix.

structure specifies the stucture for sparse matrices:

  • "G": General matrix – LDU factorization;
  • "S": Real symmetric matrix – LDLᵀ factorization;
  • "H": Complex Hermitian matrix – LDLᴴ factorization;
  • "SPD": Symmetric positive-definite matrix – LLᵀ factorization;
  • "HPD": Hermitian positive-definite matrix – LLᴴ factorization.

view specifies matrix view for sparse matrices:

  • 'L': Lower-triangular matrix and all values above the main diagonal are ignored;
  • 'U': Upper-triangular matrix and all values below the main diagonal are ignored;
  • 'F': Full matrix.

index specifies indexing base for sparse matrix indices:

  • 'Z': 0-based indexing;
  • 'O': 1-based indexing.

CudssSolver can be also constructed from the three structures CudssMatrix, CudssConfig and CudssData if needed.

source

Functions

CUDSS.cudss_setFunction
cudss_set(matrix::CudssMatrix{T}, v::CuVector{T})
 cudss_set(matrix::CudssMatrix{T}, A::CuMatrix{T})
 cudss_set(matrix::CudssMatrix{T}, A::CuSparseMatrixCSR{T,Cint})
 cudss_set(solver::CudssSolver{T}, A::CuSparseMatrixCSR{T,Cint})
@@ -19,10 +19,10 @@
 cudss_set(solver::CudssSolver{T}, A::Vector{CuSparseMatrixCSR{T,Cint}})
 cudss_set(solver::CudssSolver, parameter::String, value)
 cudss_set(config::CudssConfig, parameter::String, value)
-cudss_set(data::CudssData, parameter::String, value)

The type T can be Float32, Float64, ComplexF32 or ComplexF64.

The available configuration parameters are:

  • "reordering_alg": Algorithm for the reordering phase ("default", "algo1", "algo2" or "algo3");
  • "factorization_alg": Algorithm for the factorization phase ("default", "algo1", "algo2" or "algo3");
  • "solve_alg": Algorithm for the solving phase ("default", "algo1", "algo2" or "algo3");
  • "matching_type": Type of matching;
  • "solve_mode": Potential modificator on the system matrix (transpose or adjoint);
  • "ir_n_steps": Number of steps during the iterative refinement;
  • "ir_tol": Iterative refinement tolerance;
  • "pivot_type": Type of pivoting ('C', 'R' or 'N');
  • "pivot_threshold": Pivoting threshold which is used to determine if digonal element is subject to pivoting;
  • "pivot_epsilon": Pivoting epsilon, absolute value to replace singular diagonal elements;
  • "max_lu_nnz": Upper limit on the number of nonzero entries in LU factors for non-symmetric matrices;
  • "hybrid_mode": Memory mode – 0 (default = device-only) or 1 (hybrid = host/device);
  • "hybrid_device_memory_limit": User-defined device memory limit (number of bytes) for the hybrid memory mode;
  • "use_cuda_register_memory": A flag to enable (1) or disable (0) usage of cudaHostRegister() by the hybrid memory mode.

The available data parameters are:

  • "user_perm": User permutation to be used instead of running the reordering algorithms;
  • "comm": Communicator for Multi-GPU multi-node mode.
source
CUDSS.cudss_getFunction
value = cudss_get(solver::CudssSolver, parameter::String)
+cudss_set(data::CudssData, parameter::String, value)

The type T can be Float32, Float64, ComplexF32 or ComplexF64.

The available configuration parameters are:

  • "reordering_alg": Algorithm for the reordering phase ("default", "algo1", "algo2" or "algo3");
  • "factorization_alg": Algorithm for the factorization phase ("default", "algo1", "algo2" or "algo3");
  • "solve_alg": Algorithm for the solving phase ("default", "algo1", "algo2" or "algo3");
  • "matching_type": Type of matching;
  • "solve_mode": Potential modificator on the system matrix (transpose or adjoint);
  • "ir_n_steps": Number of steps during the iterative refinement;
  • "ir_tol": Iterative refinement tolerance;
  • "pivot_type": Type of pivoting ('C', 'R' or 'N');
  • "pivot_threshold": Pivoting threshold which is used to determine if digonal element is subject to pivoting;
  • "pivot_epsilon": Pivoting epsilon, absolute value to replace singular diagonal elements;
  • "max_lu_nnz": Upper limit on the number of nonzero entries in LU factors for non-symmetric matrices;
  • "hybrid_mode": Memory mode – 0 (default = device-only) or 1 (hybrid = host/device);
  • "hybrid_device_memory_limit": User-defined device memory limit (number of bytes) for the hybrid memory mode;
  • "use_cuda_register_memory": A flag to enable (1) or disable (0) usage of cudaHostRegister() by the hybrid memory mode.

The available data parameters are:

  • "user_perm": User permutation to be used instead of running the reordering algorithms;
  • "comm": Communicator for Multi-GPU multi-node mode.
source
CUDSS.cudss_getFunction
value = cudss_get(solver::CudssSolver, parameter::String)
 value = cudss_get(config::CudssConfig, parameter::String)
-value = cudss_get(data::CudssData, parameter::String)

The available configuration parameters are:

  • "reordering_alg": Algorithm for the reordering phase;
  • "factorization_alg": Algorithm for the factorization phase;
  • "solve_alg": Algorithm for the solving phase;
  • "matching_type": Type of matching;
  • "solve_mode": Potential modificator on the system matrix (transpose or adjoint);
  • "ir_n_steps": Number of steps during the iterative refinement;
  • "ir_tol": Iterative refinement tolerance;
  • "pivot_type": Type of pivoting;
  • "pivot_threshold": Pivoting threshold which is used to determine if digonal element is subject to pivoting;
  • "pivot_epsilon": Pivoting epsilon, absolute value to replace singular diagonal elements;
  • "max_lu_nnz": Upper limit on the number of nonzero entries in LU factors for non-symmetric matrices;
  • "hybrid_mode": Memory mode – 0 (default = device-only) or 1 (hybrid = host/device);
  • "hybrid_device_memory_limit": User-defined device memory limit (number of bytes) for the hybrid memory mode;
  • "use_cuda_register_memory": A flag to enable (1) or disable (0) usage of cudaHostRegister() by the hybrid memory mode.

The available data parameters are:

  • "info": Device-side error information;
  • "lu_nnz": Number of non-zero entries in LU factors;
  • "npivots": Number of pivots encountered during factorization;
  • "inertia": Tuple of positive and negative indices of inertia for symmetric and hermitian non positive-definite matrix types;
  • "perm_reorder_row": Reordering permutation for the rows;
  • "perm_reorder_col": Reordering permutation for the columns;
  • "perm_row": Final row permutation (which includes effects of both reordering and pivoting);
  • "perm_col": Final column permutation (which includes effects of both reordering and pivoting);
  • "diag": Diagonal of the factorized matrix;
  • "hybrid_device_memory_min": Minimal amount of device memory (number of bytes) required in the hybrid memory mode;
  • "memory_estimates": Memory estimates (in bytes) for host and device memory required for the chosen memory mode.

The data parameters "info", "lu_nnz", "perm_reorder_row", "perm_reorder_col", "hybrid_device_memory_min" and "memory_estimates" require the phase "analyse" performed by cudss. The data parameters "npivots", "inertia" and "diag" require the phases "analyse" and "factorization" performed by cudss. The data parameters "perm_row" and "perm_col" are available but not yet functional.

source
CUDSS.cudssFunction
cudss(phase::String, solver::CudssSolver{T}, x::CuVector{T}, b::CuVector{T})
+value = cudss_get(data::CudssData, parameter::String)

The available configuration parameters are:

  • "reordering_alg": Algorithm for the reordering phase;
  • "factorization_alg": Algorithm for the factorization phase;
  • "solve_alg": Algorithm for the solving phase;
  • "matching_type": Type of matching;
  • "solve_mode": Potential modificator on the system matrix (transpose or adjoint);
  • "ir_n_steps": Number of steps during the iterative refinement;
  • "ir_tol": Iterative refinement tolerance;
  • "pivot_type": Type of pivoting;
  • "pivot_threshold": Pivoting threshold which is used to determine if digonal element is subject to pivoting;
  • "pivot_epsilon": Pivoting epsilon, absolute value to replace singular diagonal elements;
  • "max_lu_nnz": Upper limit on the number of nonzero entries in LU factors for non-symmetric matrices;
  • "hybrid_mode": Memory mode – 0 (default = device-only) or 1 (hybrid = host/device);
  • "hybrid_device_memory_limit": User-defined device memory limit (number of bytes) for the hybrid memory mode;
  • "use_cuda_register_memory": A flag to enable (1) or disable (0) usage of cudaHostRegister() by the hybrid memory mode.

The available data parameters are:

  • "info": Device-side error information;
  • "lu_nnz": Number of non-zero entries in LU factors;
  • "npivots": Number of pivots encountered during factorization;
  • "inertia": Tuple of positive and negative indices of inertia for symmetric and hermitian non positive-definite matrix types;
  • "perm_reorder_row": Reordering permutation for the rows;
  • "perm_reorder_col": Reordering permutation for the columns;
  • "perm_row": Final row permutation (which includes effects of both reordering and pivoting);
  • "perm_col": Final column permutation (which includes effects of both reordering and pivoting);
  • "diag": Diagonal of the factorized matrix;
  • "hybrid_device_memory_min": Minimal amount of device memory (number of bytes) required in the hybrid memory mode;
  • "memory_estimates": Memory estimates (in bytes) for host and device memory required for the chosen memory mode.

The data parameters "info", "lu_nnz", "perm_reorder_row", "perm_reorder_col", "hybrid_device_memory_min" and "memory_estimates" require the phase "analyse" performed by cudss. The data parameters "npivots", "inertia" and "diag" require the phases "analyse" and "factorization" performed by cudss. The data parameters "perm_row" and "perm_col" are available but not yet functional.

source
CUDSS.cudssFunction
cudss(phase::String, solver::CudssSolver{T}, x::CuVector{T}, b::CuVector{T})
 cudss(phase::String, solver::CudssSolver{T}, X::CuMatrix{T}, B::CuMatrix{T})
 cudss(phase::String, solver::CudssSolver{T}, x::Vector{CuVector{T}}, b::Vector{CuVector{T}})
 cudss(phase::String, solver::CudssSolver{T}, X::Vector{CuMatrix{T}}, B::Vector{CuMatrix{T}})
-cudss(phase::String, solver::CudssSolver{T}, X::CudssMatrix{T}, B::CudssMatrix{T})

The type T can be Float32, Float64, ComplexF32 or ComplexF64.

The available phases are "analysis", "factorization", "refactorization" and "solve". The phases "solve_fwd", "solve_diag" and "solve_bwd" are available but not yet functional.

source
+cudss(phase::String, solver::CudssSolver{T}, X::CudssMatrix{T}, B::CudssMatrix{T})

The type T can be Float32, Float64, ComplexF32 or ComplexF64.

The available phases are "analysis", "factorization", "refactorization" and "solve". The phases "solve_fwd", "solve_diag" and "solve_bwd" are available but not yet functional.

source diff --git a/previews/PR63/options/index.html b/previews/PR63/options/index.html index aafc008..7955aa1 100644 --- a/previews/PR63/options/index.html +++ b/previews/PR63/options/index.html @@ -86,4 +86,4 @@ cudss("solve", solver, x_gpu, b_gpu) r_gpu = b_gpu - CuSparseMatrixCSR(A_cpu) * x_gpu -norm(r_gpu) +norm(r_gpu)