pytorch-4-test/aten/src/ATen/native/cpu at main · bla2ej/pytorch-4-test

History

Name		Name	Last commit message	Last commit date
parent directory ..
Activation.cpp		Activation.cpp
AdaptiveAvgPoolKernel.cpp		AdaptiveAvgPoolKernel.cpp
AdaptiveMaxPoolKernel.cpp		AdaptiveMaxPoolKernel.cpp
AmpGradScalerKernels.cpp		AmpGradScalerKernels.cpp
AtomicAddFloat.h		AtomicAddFloat.h
AvgPoolKernel.cpp		AvgPoolKernel.cpp
BinaryOpsKernel.cpp		BinaryOpsKernel.cpp
BlasKernel.cpp		BlasKernel.cpp
CatKernel.cpp		CatKernel.cpp
CatKernel.h		CatKernel.h
ChannelShuffleKernel.cpp		ChannelShuffleKernel.cpp
ChannelShuffleKernel.h		ChannelShuffleKernel.h
ComplexKernel.cpp		ComplexKernel.cpp
CopyKernel.cpp		CopyKernel.cpp
CopyKernel.h		CopyKernel.h
CrossKernel.cpp		CrossKernel.cpp
DepthwiseConvKernel.cpp		DepthwiseConvKernel.cpp
DepthwiseConvKernel.h		DepthwiseConvKernel.h
DistanceOpsKernel.cpp		DistanceOpsKernel.cpp
DistributionKernels.cpp		DistributionKernels.cpp
DistributionTemplates.h		DistributionTemplates.h
FillKernel.cpp		FillKernel.cpp
FlashAttentionKernel.cpp		FlashAttentionKernel.cpp
FunctionOfAMatrixUtilsKernel.cpp		FunctionOfAMatrixUtilsKernel.cpp
FusedAdagradKernel.cpp		FusedAdagradKernel.cpp
FusedAdamKernel.cpp		FusedAdamKernel.cpp
FusedSGDKernel.cpp		FusedSGDKernel.cpp
GridSamplerKernel.cpp		GridSamplerKernel.cpp
GridSamplerKernel.h		GridSamplerKernel.h
HistogramKernel.cpp		HistogramKernel.cpp
IndexKernel.cpp		IndexKernel.cpp
IndexKernelUtils.h		IndexKernelUtils.h
Intrinsics.h		Intrinsics.h
IsContiguous.h		IsContiguous.h
LerpKernel.cpp		LerpKernel.cpp
LinearAlgebraKernel.cpp		LinearAlgebraKernel.cpp
LogAddExp.h		LogAddExp.h
Loops.h		Loops.h
MaxPoolKernel.cpp		MaxPoolKernel.cpp
MaxPooling.cpp		MaxPooling.cpp
MaxUnpoolKernel.cpp		MaxUnpoolKernel.cpp
MaxUnpoolKernel.h		MaxUnpoolKernel.h
MultinomialKernel.cpp		MultinomialKernel.cpp
NativeMultiheadAttnKernel.cpp		NativeMultiheadAttnKernel.cpp
PaddingKernel.cpp		PaddingKernel.cpp
PixelShuffleKernel.cpp		PixelShuffleKernel.cpp
PixelShuffleKernel.h		PixelShuffleKernel.h
PointwiseOpsKernel.cpp		PointwiseOpsKernel.cpp
PowKernel.cpp		PowKernel.cpp
README.md		README.md
RangeFactoriesKernel.cpp		RangeFactoriesKernel.cpp
Reduce.h		Reduce.h
ReduceAllOpsKernel.cpp		ReduceAllOpsKernel.cpp
ReduceOpsKernel.cpp		ReduceOpsKernel.cpp
ReduceUtils.h		ReduceUtils.h
RenormKernel.cpp		RenormKernel.cpp
SampledAddmmKernel.cpp		SampledAddmmKernel.cpp
SampledAddmmKernel.h		SampledAddmmKernel.h
ScatterGatherKernel.cpp		ScatterGatherKernel.cpp
SerialStackImpl.h		SerialStackImpl.h
SoftMaxKernel.cpp		SoftMaxKernel.cpp
SoftmaxKernel.h		SoftmaxKernel.h
SortingKernel.cpp		SortingKernel.cpp
SparseFactories.cpp		SparseFactories.cpp
SpmmReduceKernel.cpp		SpmmReduceKernel.cpp
SpmmReduceKernel.h		SpmmReduceKernel.h
StackKernel.cpp		StackKernel.cpp
StackKernel.h		StackKernel.h
SumKernel.cpp		SumKernel.cpp
TensorCompareKernel.cpp		TensorCompareKernel.cpp
UnaryOpsKernel.cpp		UnaryOpsKernel.cpp
Unfold2d.cpp		Unfold2d.cpp
UnfoldBackwardKernel.cpp		UnfoldBackwardKernel.cpp
UpSampleKernel.cpp		UpSampleKernel.cpp
UpSampleKernelAVXAntialias.h		UpSampleKernelAVXAntialias.h
UpSampleMoreKernel.cpp		UpSampleMoreKernel.cpp
WeightNormKernel.cpp		WeightNormKernel.cpp
WeightNormKernel.h		WeightNormKernel.h
airy_ai.cpp		airy_ai.cpp
avx_mathfun.h		avx_mathfun.h
batch_norm_kernel.cpp		batch_norm_kernel.cpp
group_norm_kernel.cpp		group_norm_kernel.cpp
int4mm_kernel.cpp		int4mm_kernel.cpp
int8mm_kernel.cpp		int8mm_kernel.cpp
int_mm_kernel.h		int_mm_kernel.h
layer_norm_kernel.cpp		layer_norm_kernel.cpp
mixed_data_type.h		mixed_data_type.h
moments_utils.h		moments_utils.h
scaled_modified_bessel_k0.cpp		scaled_modified_bessel_k0.cpp
scaled_modified_bessel_k1.cpp		scaled_modified_bessel_k1.cpp
spherical_bessel_j0.cpp		spherical_bessel_j0.cpp
utils.h		utils.h
zmath.h		zmath.h

README.md

The most important things to know:

Don't add a kernel to this folder unless you want it to be compiled multiple times for different instruction sets. Yes, this folder is named cpu, but that doesn't mean put any old CPU kernel it. Only put CPU kernels which need to be compiled multiple times to take advantage of AVX512/AVX2/SSE instructions, but only on processors that support them.

Ensure that all implementations in this folder are put in an anonymous namespace. The files in this folder are compiled multiple times with different headers. It's important that these functions have internal linkage so that kernels for different architectures don't get combined during linking. It's sufficient to label functions "static", but class methods must be an unnamed namespace to have internal linkage (since static means something different in the context of classes).

The basic recipe is to define your kernel, and then register it using DECLARE/REGISTER DISPATCH. Writing a kernel requires three steps:

Declare your dispatch in a header file using DECLARE_DISPATCH(fn_type, fnNameImpl); where fn_type is the function pointer type of the kernel (e.g., defined as using fn_type = void(*)(Tensor&, const Tensor&) and fnNameImpl is the name of your dispatch registry. (It doesn't really matter where you put this declaration.)
Define your dispatch in a C++ file that is NOT in the cpu directory (dispatch must be defined exactly once) using DEFINE_DISPATCH(fnNameImpl) (matching the name of your declaration.) Include the header file that declares the dispatch in this C++ file. Conventionally, we define the dispatch in the same file we will define our native function in.
Define a native function which calls into the dispatch using fnNameImpl(kCPU, arguments...), where the arguments are the arguments according to the fn_type you defined in the declaration.
Write your actual kernel (e.g., your_kernel) in the cpu directory, and register it to the dispatch using REGISTER_DISPATCH(fnNameImpl, &your_kernel), if it does not perform as well with AVX512, as it does with AVX2. Otherwise, if it performs well with AVX512, register it with ALSO_REGISTER_AVX512_DISPATCH(fnNameImpl, &your_kernel). Compute-intensive kernels tend to perform better with AVX512, than with AVX2. Comparing AVX2 & AVX512 variants of a kernel can be done by registering a kernel with ALSO_REGISTER_AVX512_DISPATCH(fnNameImpl, &your_kernel), building from source, and then benchmarking the kernel's performance by running a benchmarking script with the environment variables ATEN_CPU_CAPABILITY=avx2 and ATEN_CPU_CAPABILITY=avx512, respectively. tcmalloc/jemalloc can be preloaded for minimal run-to-run variation.

There are plenty of existing examples, look at them for more details.

TODO: Clarify and add more documentation all around.

All of the *.cpp files in this folder will be compiled under all compiler flags specified by CPU_CAPABILITY_FLAGS in aten/src/ATen/CMakeLists.txt.

The purpose of this is to allow the compilation with various compiler flags to enable features such as AVX2 or AVX512 instructions, while using runtime dispatch, which makes sure only valid instructions will be used on any given platform.

vec.h provides a generic implementation of vec type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. vec defines various operators such as

and * and provides functions to allow operations such as max, min, etc.

As an example ReduceOpsKernel.cpp implements a generic kernel_ that reduces an entire array using a given associative binary operation such as +.

More explicitly, calling kernel_ with template argument std::plus will cause it to sum up the entire array into a single value.

ReduceOpsKernel.cpp uses the CPU_CAPABILITY_* macros to "know" under which compiler flags it is currently compiled. This allows the programmer to write generic code, which will be compiled under multipled compilation settings.

../ReduceOps.cpp now includes the header ReduceOpsKernel.h, which contains a generic definition of sumImplAll. This function allows the user to reduce over a dimension or all dimensions. The appropriate capability is chosen at runtime using cpuinfo. If the current platform has AVX2, sumImpl will be set to sumImplAll<CPUCapability::AVX2>.

At runtime, the following environment variables control which codepath is taken:

x64 options: ATEN_CPU_CAPABILITY=avx2 # Force AVX2 codepaths to be used ATEN_CPU_CAPABILITY=avx # Force AVX codepaths to be used ATEN_CPU_CAPABILITY=default # Use oldest supported vector instruction set

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu

cpu

README.md

Files

cpu

Directory actions

More options

Directory actions

More options

Latest commit

History

cpu

Folders and files

parent directory

README.md