Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCLomatic] Fix sycl to SYCLomatic pull down failed. #2148

Open
wants to merge 3,158 commits into
base: SYCLomatic
Choose a base branch
from

Conversation

daiyaan-ahmed6
Copy link
Contributor

[SYCLomatic] Fix sycl to SYCLomatic pull down failed

Signed-off-by: Daiyaan Ahmed [email protected]

VyacheslavLevytskyy and others added 30 commits June 25, 2024 10:56
…#96514)

This PR fixes llvm/llvm-project#96513.

The way of creation of array type constant was incorrect: instead of
creating [1, 1, 1] or [1, 1, 1, 1, 1, ....] constants, the same [1]
constant was always created, substituting original composite constants.
This in its turn led to a situation when only one of constants might
exist in the code without emitting invalid code, the second constant
would be eventually rewritten to the first constant, because a key to
address both was an array of a single element (like [1]).

This PR fixes the issue and purges from the code unneeded copy/pasted
clone of the function that creates an array constant.
…nes (#95269)

This patch extends #73964 and adds optimisation of load SVE intrinsics
when predicate is zero.
The handling of `PointerType` is similar to `HeapType`. The only
difference is that allocated flag is generated for `HeapType` and
associated flag for `PointerType`. The tests for pointer to allocatable
strings are disabled for now. I will enable them once #95906 is merged.

The debugging in GDB looks like this:
    
      integer, pointer :: par2(:)
      integer, target, allocatable :: ar2(:) 
      integer, target :: sc
      integer, pointer :: psc
      allocate(ar2(4))
      par2 => ar2
      psc => sc
    
    19        par2 => ar2
    (gdb) p par2
    $3 = <not associated>
    (gdb) n
    20        do i=1,5
    (gdb) p par2
    $4 = (0, 0, 0, 0)
    (gdb) ptype par2
    type = integer (4)
    (gdb) p sc
    $5 = 3
    (gdb) p psc
    $6 = (PTR TO -> ( integer )) 0x7fffffffda24
    (gdb) p *psc
    $7 = 3
…ing for generic types (#89217)

This patch is intended to be the first of a series with end goal to
adapt atomic optimizer pass to support i64 and f64 operations (along
with removing all unnecessary bitcasts). This legalizes 64 bit readlane,
writelane and readfirstlane ops pre-ISel

---------

Co-authored-by: vikramRH <[email protected]>
This change adds methods like buildGetFPEnv and similar for opcodes that
represent manipulation on floating-point state.
This changes the behaviour in C++03 mode because we'll now use the
builtin on Clang, but I don't think that's much of a problem.
This header used three-space indentation in a number of places.
Reformat it completely.
This FIXME has already been addressed in #89358
Instead for iterating over all VFs when computing costs, simply iterate
over the VFs available in the created VPlans.

Split off from llvm/llvm-project#92555.

This also prepares for moving the check if any vector instructions will
be generated to be based on VPlan, to unblock recommitting
llvm/llvm-project#92555.
Without the store, the vector loop body is empty. Add a store to avoid
that, while not impacting the induction resume values that are created.
This patch implements lowering of the GlobalAddress, BlockAddress,
JumpTable and BR_JT. Also patch adds legal support of the BR_CC
operation for i32 type.
Some of these are just old, while others previously did not use
UTC due to missing features that have since been implemented
(such as signature matching).
Since we mark the pseudos as mayLoad but do not provide any MMOs,
isSafeToMove conservatively returns false, stopping MachineLICM from
hoisting the instructions. PseudoLA_TLS_{LD,GD} does not actually expand
to a load, so stop marking that as mayLoad to allow it to be hoisted,
and for the others make sure to add MMOs during lowering to indicate
they're GOT loads and thus can be freely moved.
… (#95061)

This patch augments the HIPAMD driver to allow it to target AMDGCN
flavoured SPIR-V compilation. It's mostly straightforward, as we re-use
some of the existing SPIRV infra, however there are a few notable
additions:

- we introduce an `amdgcnspirv` offload arch, rather than relying on
using `generic` (this is already fairly overloaded) or simply using
`spirv` or `spirv64` (we'll want to use these to denote unflavoured
SPIRV, once we bring up that capability)
- initially it is won't be possible to mix-in SPIR-V and concrete AMDGPU
targets, as it would require some relatively intrusive surgery in the
HIPAMD Toolchain and the Driver to deal with two triples
(`spirv64-amd-amdhsa` and `amdgcn-amd-amdhsa`, respectively)
- in order to retain user provided compiler flags and have them
available at JIT time, we rely on embedding the command line via
`-fembed-bitcode=marker`, which the bitcode writer had previously not
implemented for SPIRV; we only allow it conditionally for AMDGCN
flavoured SPIRV, and it is handled correctly by the Translator (it ends
up as a string literal)

Once the SPIRV BE is no longer experimental we'll switch to using that
rather than the translator. There's some additional work that'll come
via a separate PR around correctly piping through AMDGCN's
implementation of `printf`, for now we merely handle its flags
correctly.
Extends llvm/llvm-project#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together.

Most targets benefit from performing this for non-constant cases - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention (but lower instruction count).

Fixes llvm/llvm-project#90748
This adds a new pattern that can legalize a multi-tile transfer_write as
a single store loop. This is done as part of type decomposition as at
this level we know each tile write is disjoint, but that information is
lost after decomposition (without analysis to reconstruct it).

Example (pseudo-MLIR):

```
vector.transfer_write %vector, %dest[%y, %x], %mask
  : vector<[16]x[8]xi16>, memref<?x?xi16>
```
Is rewritten to:
```
scf.for %slice_idx = %c0 to %c8_vscale step %c1 {
  %upper_slice_mask = vector.extract %mask[%slice_idx] ─┐
    : vector<[8]xi1> from vector<[16]x[8]xi1>           |
  %upper_slice = vector.extract %upper_tile[%slice_idx] |- Store upper tile
    : vector<[8]xi16> from vector<[8]x[8]xi16>          |
  vector.transfer_write %upper_slice,                   |
    %dest[%slice_idx + %y, %x], %upper_slice_mask       |
    : vector<[8]xi16>, memref<?x?xi16>                  ┘
  %lower_slice_idx = %slice_idx + %c8_vscale                 ─┐
  %lower_slice_mask = vector.extract %mask[%lower_slice_idx]  |
    : vector<[8]xi1> from vector<[16]x[8]xi1>                 |
  %lower_slice = vector.extract %lower_tile[%slice_idx]       |- Store lower
    : vector<[8]xi16> from vector<[8]x[8]xi16>                |  tile
  vector.transfer_write %lower_slice,                         |
    %dest[%lower_slice_idx + %y, %x], %lower_slice_mask       |
    : vector<[8]xi16>, memref<?x?xi16>                        ┘
}
```
…(#95424)

Summary:
This patch adds `CMAKE_REQUIRED_FLAGS` for the GPU build so checks like
`check_cxx_compiler_flags` work as expected. This is required because we
need to hack around the potential lack of `nvlink` and `ptxas` for NVPTX
targets and the fact that the AMDGPU target needs `-nogpulib` to avoid
errors on lack of ROCm. This makes a few of the checks pass and also
allows us to just check `-mcpu=native` for architecture detection
instead of finding the tools manually.
Introduce a Loop::getLocStr stolen from LoopVectorize's static function
getDebugLocString in order to have uniform debug output headers across
LoopVectorize, LoopAccessAnalysis, and LoopDistribute. The motivation
for this change is to have UpdateTestChecks recognize the headers and
automatically generate CHECK lines for debug output, with minimal
special-casing.
-1 mask elements are specified to return poison rather than undef
nowadays , so update the reorderScalars() implementation to match.
The lowering of copyprivate clauses with allocatable or pointer
variables was incorrect. This happened because the values passed to
copyVar() are always wrapped in SymbolBox::Intrinsic, which
resulted in allocatable/pointer variables being handled as regular
ones.

This is fixed by providing to copyVar() the attributes of the
variables being copied, to make it possible to detect and handle
allocatable/pointer variables correctly.

Fixes #95801
I've not added any new tests for these, because the original conditions
were wrong (they did not consider streaming mode) and we have tests for
the positive cases.
And rename an argument to avoid an upper/lowercase clash.
AD2605 and others added 25 commits July 8, 2024 08:33
…ster_group (#14113)

This PR is a partial implementation of
[`sycl_ext_oneapi_cuda_cluster_group`](intel/llvm#13594),
introducing the `cluster_size` property to launch a kernel with CUDA's
thread block clusters

Only a small part of the extension specification described in
intel/llvm#13594 is used in this implementation.
To be specific everything after the section "Launching a kernel with a
`cluster_group`" is not included in this PR. A very important point to
note is that this PR still fully represents a functional use case of
using Nvidia's cuda driver cluster launch feature for its primary
purpose which is to accelerate cross-work-group collective operations
(particularly for GEMM), leveraging cross-work group asynchronous
multi-casting of distributed shared memory across work-groups.
This is a high priority feature that is targeted for the next release.

The other parts of the extension specification described in
intel/llvm#13594, primarily related to the
"cluster_group" abstraction is a (user-facing) convenience abstraction
that is not required to be exposed in libraries that optimize such
library collective operations (GEMM). Please therefore focus reviews of
this PR on the relevant aspects of the extension that are required for
the implementation in this PR and the library based application of it as
described in this message.

---------

Signed-off-by: JackAKirk <[email protected]>
…roup.hpp` (#13760)

Not needed anymore after deprecated shuffles/collectives were removed.
We cannot currently update the kernel binary of a node in Whole Graph
Update. Rather than silently accepting inconsistent kernel functions,
which indicates the graphs aren't topologically identical, throw an
error when the kernel types of two nodes are mismatched.

This change requires removing the unittest for barrier nodes in Whole
Graph Update as the mock infrastructure does not
setup the internal `CG` class to the depth required to test working
functionality. This
functionality is already covered by
`test-e2e/Graph/Update/whole_update_barrier_node.cpp`
…n (#14407)

pre-commit PR for
oneapi-src/unified-runtime#1809

---------

Signed-off-by: Neil R. Spruit <[email protected]>
Co-authored-by: Kenneth Benzie (Benie) <[email protected]>
This PR:
1. Adds a new Intel FPGA experimental SYCL extension that introduces
support for `task_sequences` which provides a sub kernel task level
parallelism interface.
2. Updates the fpga_kernel_interface SYCL extension to add the
`stall_enable_clusters` and `stall_free_clusters` properties. These are
used with task sequences as well.

---------

Co-authored-by: Steffen Larsen <[email protected]>
Co-authored-by: GarveyJoe <[email protected]>
Co-authored-by: Abhishek Tiwari <[email protected]>
Co-authored-by: Xue, Bowen <[email protected]>
Co-authored-by: Adel Ejjeh <[email protected]>
Co-authored-by: Ejjeh, Adel <[email protected]>
…424)

It would appear that I missed a file in #14175. I looked over the files
again and did not catch anything else, but do let me know if there is
more.
Couldn't do this earlier because of a driver issue.

Signed-off-by: Sarnie, Nick <[email protected]>
Bump certifi to 2024.07.04

Certifi 2024.07.04 removes root certificates from "GLOBALTRUST" from the
root store. These are in the process of being removed from Mozilla's
trust store.

GLOBALTRUST's root certificates are being removed pursuant to an
investigation which identified "long-running and unresolved compliance
issues". Conclusions of Mozilla's investigation can be found
[here](https://groups.google.com/a/mozilla.org/g/dev-security-policy/c/XpknYMPO8dI).

Upgrading will resolve this issue.
I missed this in
intel/llvm@4a87b2c
so the change didn't actually do anything.

Signed-off-by: Sarnie, Nick <[email protected]>
…(#14425)

I verified manually this option fixes the issue that was being tracked
internally.

Thanks to @aelovikov-intel for help finding the option that was
required.

Signed-off-by: Sarnie, Nick <[email protected]>
Needed until KhronosGroup/SYCL-CTS#907 is merged

Signed-off-by: Sarnie, Nick <[email protected]>
This is mostly relevant when sycl-ls might be run in the context of
really old drivers/lzloader. It is a workaround for some Win test
systems.
One of the SYCL 1.2 exception subclasses being removed during the ABI
breaking window since these subclasses have been removed in SYCL 2020.

We still want to be able to carry information about backend error code,
so `MPIErr` data member in `sycl::exception` remains. However, since
none of the standard ctors can accept it (per the specification), I've
introduced a new helper `detail::set_pi_error` to set it. It should only
be used when propagating an error comming from the PI/UR level and *NOT*
for the error conditions originated in the SYCL RT (which is a change
from previous implementation but a logical/justifiable one).
`assert.cpp`: replace use of deprecated
`get_info<info::device::extensions>` with `opencl::has_extension`

`string_test.cpp`: replace use of deprecated `get_pointer` with
`get_multi_ptr`
Not removing the exception subclass itself as that would result in merge
conflicts with our PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.