Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Advance cursors through stores in strength reduction #105267

Merged
merged 16 commits into from
Oct 15, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Jul 22, 2024

This allows us to look through CSE'd locals when an IV was CSE'd.

Example diff:

private static void CopyInts(int[] arr1, int[] arr2)
{
    if (arr2.Length != arr1.Length)
        return;

    for (int i = 0; i < arr1.Length; i++)
    {
        arr2[i] = arr1[i];
    }
}
@@ -1,14 +1,16 @@
 G_M55903_IG03:  ;; offset=0x000C
-       xor      eax, eax
        test     r8d, r8d
-       jle      SHORT G_M55903_IG05
-						;; size=7 bbWeight=0.50 PerfScore 0.75
+       jle      SHORT G_M55903_IG06
+						;; size=5 bbWeight=0.50 PerfScore 0.62
 
-G_M55903_IG04:  ;; offset=0x0013
-       mov      r10d, eax
-       mov      r9d, dword ptr [rcx+4*r10+0x10]
-       mov      dword ptr [rdx+4*r10+0x10], r9d
-       inc      eax
-       cmp      r8d, eax
-       jg       SHORT G_M55903_IG04
-						;; size=20 bbWeight=3.96 PerfScore 18.81
+G_M55903_IG04:  ;; offset=0x0011
+       mov      eax, 16
+						;; size=5 bbWeight=0.25 PerfScore 0.06
+
+G_M55903_IG05:  ;; offset=0x0016
+       mov      r10d, dword ptr [rcx+rax]
+       mov      dword ptr [rdx+rax], r10d
+       add      rax, 4
+       dec      r8d
+       jne      SHORT G_M55903_IG05
+						;; size=17 bbWeight=3.96 PerfScore 17.82

arm64 especially benefits from strength reduction, e.g. a benchmark for the above:

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-GXMAGJ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-MLUOLU : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Mean Error Ratio
CopyInts Main 6.683 μs 0.0010 μs 1.00
CopyInts PR 5.014 μs 0.0003 μs 0.75

This allows us to look through CSE'd locals when an IV was CSE'd.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 22, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jul 22, 2024

FYI @AndyAyersMS ... not sure if this makes preview 7 in time.

@jakobbotsch

This comment was marked as resolved.

This comment was marked as resolved.

@jakobbotsch

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

@jakobbotsch

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

@EgorBo

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

@EgorBo
Copy link
Member

EgorBo commented Jul 22, 2024

@EgorBot -intel

using BenchmarkDotNet.Attributes;
using System.Runtime.CompilerServices;

namespace Loops
{
    public class StrengthReduction
    {
        private int[] _arrayInts;
        private int[] _dest;

        [GlobalSetup]
        public void Setup()
        {
            _arrayInts = Enumerable.Range(0, 10000).Select(i => i).ToArray();
            _dest = new int[_arrayInts.Length];
        }

        [Benchmark]
        public void CopyInts()
        {
            Copy(_dest, _arrayInts);
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static void Copy(int[] dest, int[] src)
        {
            if (src.Length != dest.Length)
                return;

            for (int i = 0; i < dest.Length; i++)
            {
                dest[i] = src[i];
            }
        }
    }
}

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@EgorBot
Copy link

EgorBot commented Jul 22, 2024

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
  Job-NORFNC : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-ACQJKF : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio
CopyInts Main 3.422 μs 0.0148 μs 1.00
CopyInts PR NA NA ?
Benchmarks with issues:
StrengthReduction.CopyInts: Job-ACQJKF(Toolchain=PR)

BDN_Artifacts.zip

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok taking this. You have 45 mins to merge...

@EgorBot
Copy link

EgorBot commented Jul 22, 2024

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
  Job-CDNSFW : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-AWFJUW : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio
CopyInts Main 3.423 μs 0.0068 μs 1.00
CopyInts PR 3.173 μs 0.0085 μs 0.93

BDN_Artifacts.zip

@EgorBot
Copy link

EgorBot commented Jul 23, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-GXMAGJ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-MLUOLU : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Mean Error Ratio
CopyInts Main 6.683 μs 0.0010 μs 1.00
CopyInts PR 5.014 μs 0.0003 μs 0.75

BDN_Artifacts.zip

Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 6, 2024
@jakobbotsch jakobbotsch reopened this Oct 7, 2024
@dotnet dotnet unlocked this conversation Oct 7, 2024
@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @AndyAyersMS (you already approved, but this has changed a bit since then).

Diffs. In some cases a size regression, but almost always a perfscore improvement. For example, here's a canonical size regression:
image

Comment on lines +2005 to +2008
while ((cur != nullptr) && !cur->OperIs(GT_ARR_ADDR))
{
cur = cur->gtGetParent(nullptr);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole syntactic treatment of GT_ARR_ADDR to figure out what address ranges are inside arrays is unfortunate. It's pretty fragile and leads e.g. to #108706 not just working out properly. Not sure how to improve this much however -- maybe we can keep some breadcrumbs about known managed ranges during the cursor advancement and query it any time we need to prove something here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For advanced loop opts (parallelization, vectorization) it's not uncommon to create a "raised up" representation that describes the array element as an affine function of the IVs, so maybe we can look into something like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like that could be a direction to go, although I guess it would mean that we go from INDEX_ADDR to expanded nodes in morph and then back to some equivalent of INDEX_ADDR as part of these opts. But maybe that's fine (and perhaps the structure could be recovered on VNs, similar to the comment on ParseArrayAddress suggests).

@jakobbotsch
Copy link
Member Author

Ping @AndyAyersMS for a re-look

Comment on lines +2005 to +2008
while ((cur != nullptr) && !cur->OperIs(GT_ARR_ADDR))
{
cur = cur->gtGetParent(nullptr);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For advanced loop opts (parallelization, vectorization) it's not uncommon to create a "raised up" representation that describes the array element as an affine function of the IVs, so maybe we can look into something like this?

CursorInfo& cursor = m_intermediateIVStores.BottomRef(i);
GenTreeLclVarCommon* store = cursor.Tree->AsLclVarCommon();
JITDUMP(" Replacing [%06u] with a zero constant\n", Compiler::dspTreeID(store->Data()));
// We cannot remove these stores entirely as that will break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could set the ssa def trees to null, as that is a valid state (see eg #108548), though it might require some tweaking if a lot of code assumes a null def tree means initial value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps in some way remove the def from the SSA information, though I suppose that requires updating SSA numbers which requires finding all uses, so not really tractable with our representation.

This seems to work fine for now though, so I'll probably keep it as is, but it definitely falls into our SSA updating woes...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants