Potential optimisations for the AArch64 DRC back-end #178

cuavas · 2025-01-17T19:48:42Z

cuavas
Jan 17, 2025
Maintainer

We don’t want to mess with the DRC back-ends too much before release now that they seem to be working. This discussion is to keep track of optimisations to look at later.

cuavas · 2025-01-17T19:51:28Z

cuavas
Jan 17, 2025
Maintainer Author

call_arm_addr is currently too pessimistic about immediate displacements. The immediate displacement is 26 bits, but it’s specified in words, not bytes. The instruction has ±128M reach, not ±32M.

It should be simple enough to just change the check:

void drcbe_arm64::call_arm_addr(a64::Assembler &a, const void *offs) const
{
	const uint64_t codeoffs = a.code()->baseAddress() + a.offset();
	const int64_t reloffs = (int64_t)offs - codeoffs;
	if (is_valid_immediate_signed(reloffs, 26 + 2))
	{
		a.bl(offs);
	}
	else
	{
		get_imm_relative(a, SCRATCH_REG1, uintptr_t(offs));
		a.blr(SCRATCH_REG1);
	}
}

0 replies

cuavas · 2025-01-17T20:11:00Z

cuavas
Jan 17, 2025
Maintainer Author

emit_ldr_str_base_mem needs to know the operand size to correctly check the valid range of immediate offsets. Immediate offsets are scaled by the operand size, so the current implementation is overly pessimistic. It could be done something like this:

template <unsigned Shift>
void drcbe_arm64::emit_ldr_str_base_mem(a64::Assembler &a, a64::Inst::Id opcode, const a64::Reg &reg, const void *ptr) const
{
	// If it can fit as a constant offset
	const int64_t diff = (int64_t)ptr - (int64_t)m_baseptr;
	if (is_valid_immediate_signed(diff / size, 9 + Shift))
	{
		a.emit(opcode, reg, arm::Mem(BASE_REG, diff));
		return;
	}

	// If it can fit as an offset relative to PC
	const uint64_t codeoffs = a.code()->baseAddress() + a.offset();
	const int64_t reloffs = (int64_t)ptr - codeoffs;
	if (is_valid_immediate_signed(reloffs, 21))
	{
		a.adr(MEM_SCRATCH_REG, ptr);
		a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG));
		return;
	}

	if (diff > 0 && is_valid_immediate(diff, 16))
	{
		a.mov(MEM_SCRATCH_REG, diff);
		a.emit(opcode, reg, arm::Mem(BASE_REG, MEM_SCRATCH_REG));
		return;
	}

	if (diff > 0 && emit_add_optimized(a, MEM_SCRATCH_REG, BASE_REG, diff))
	{
		a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG));
		return;
	}
	else if (diff < 0 && emit_sub_optimized(a, MEM_SCRATCH_REG, BASE_REG, diff))
	{
		a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG));
		return;
	}

	if (diff >= 0)
	{
		int shift = 0;
		int max_shift = 0;

		if (opcode == a64::Inst::kIdLdrb || opcode == a64::Inst::kIdLdrsb)
			max_shift = 0;
		else if (opcode == a64::Inst::kIdLdrh || opcode == a64::Inst::kIdLdrsh)
			max_shift = 1;
		else if (opcode == a64::Inst::kIdLdrsw)
			max_shift = 2;
		else
			max_shift = (reg.isGpW() || reg.isVecS()) ? 2 : 3;

		for (int i = 0; i < 64 && max_shift > 0; i++)
		{
			if ((uint64_t)ptr & ((uint64_t)(1) << i))
			{
				shift = i;
				break;
			}
		}

		if (shift > max_shift)
			shift = max_shift;

		if (is_valid_immediate(diff >> shift, 32))
		{
			a.mov(MEM_SCRATCH_REG, diff >> shift);

			if (shift)
				a.emit(opcode, reg, arm::Mem(BASE_REG, MEM_SCRATCH_REG, arm::Shift(arm::ShiftOp::kLSL, shift)));
			else
				a.emit(opcode, reg, arm::Mem(BASE_REG, MEM_SCRATCH_REG));

			return;
		}
	}

	const uint64_t pagebase = codeoffs & ~make_bitmask<uint64_t>(12);
	const int64_t pagerel = (int64_t)ptr - pagebase;
	if (is_valid_immediate_signed(pagerel, 33))
	{
		const uint64_t targetpage = (uint64_t)ptr & ~make_bitmask<uint64_t>(12);
		const uint64_t pageoffs = (uint64_t)ptr & util::make_bitmask<uint64_t>(12);

		a.adrp(MEM_SCRATCH_REG, targetpage);

		if (is_valid_immediate_signed(pageoffs, 9 + Shift))
		{
			a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG, pageoffs));
		}
		else
		{
			a.add(MEM_SCRATCH_REG, MEM_SCRATCH_REG, pageoffs);
			a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG));
		}

		return;
	}

	// Can't optimize it at all, most likely becomes 4 MOV commands
	a.mov(MEM_SCRATCH_REG, ptr);
	a.emit(opcode, reg, arm::Mem(MEM_SCRATCH_REG));
}

Then update the emit_{ld,st}r*_mem wrappers to supply the shift. The trickiest ones are emit_ldr_mem and emit_str_mem because the operand size depends on the destination register type (W or X) – the others all have explicit operand sizes.

1 reply

cuavas Jan 17, 2025
Maintainer Author

Actually, the function already works out the operand size. If that code was lifted to the top of the function, the max_shift it calculates could be used when checking the range for immediate displacements.

cuavas · 2025-01-17T20:19:47Z

cuavas
Jan 17, 2025
Maintainer Author

get_mem_absolute tries to be clever about loading from an address when the base pointer register hasn’t been loaded yet. However, the only thing it’s used for is loading the base pointer on entry. Since the desired value is known and doesn’t change over the lifetime of the object, it isn’t really necessary to load it from the member variable. You can just apply the logic to generate the value in the register directly.

1 reply

987123879113 Jan 20, 2025

The code looked different originally where I used that much more. Originally I never had BASE_REG so that came a lot later. But yeah it's not really needed anymore.

cuavas · 2025-01-17T20:26:23Z

cuavas
Jan 17, 2025
Maintainer Author

get_imm_relative is too quick to try being clever. It tries to calculate the value relative to the base pointer register before anything else. However:

Many values can be generated in a single instruction with no data dependencies, e.g. generating small signed integers by adding/subtracting with Wzr or Xzr. These should be preferred over adding/subtracting with the base pointer register.
An adr instruction doesn’t generate a data dependency, so it should be preferred over adding/subtracting with the base pointer register.

7 replies

cuavas Jan 20, 2025
Maintainer Author

The trouble is, small values are often within 4GB of the PC, so it ends up choosing adrp/add when it shouldn’t. Also, remember movz and movn allow a shift of 16, 32 or 48, so they’re more efficient for values with lots of trailing zero bits.

As we saw, it was choosing adrp/add for general immediates like 0xffffffff, which could be generated with a single movn forcing the destination to a W register.

The patterns that can turn into a single mov/movn are:

xxxx 0000 0000 0000 -> mov with shift of 48
0000 xxxx 0000 0000 -> mov with shift of 32
ffff xxxx 0000 0000 -> movn with shift of 32
0000 0000 xxxx 0000 -> mov with shift of 16
ffff ffff xxxx 0000 -> movn with shift of 16
0000 0000 0000 xxxx -> mov with shift of zero
ffff ffff ffff xxxx -> movn with shift of zero
0000 0000 ffff xxxx -> movn with shift of zero, force output to W register even for 64-bit instruction (upper bits implicitly cleared)

987123879113 Jan 20, 2025

Again, unless I'm missing something. smaller numbers are going to fall to the mov.

	a.brk(1);
	a.mov(TEMP_REG1, -123);
	get_imm_relative(a, TEMP_REG2, -123);

->

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT (code=1, subcode=0x110020034)
    frame #0: 0x0000000110020034
->  0x110020034: brk    #0x1
    0x110020038: mov    x9, #-0x7b ; =-123
    0x11002003c: mov    x10, #-0x7b ; =-123

cuavas Jan 20, 2025
Maintainer Author

Why were we seeing adrp/add for 0xffffffff the other day? Remember when we had to add the .x() in the adrp to fix it trying to use a W register as a destination for adrp?

987123879113 Jan 20, 2025

Just not small enough I guess?

	a.mov(TEMP_REG1, -123); // turns into single mov
	a.mov(TEMP_REG2, 0xffffffff); // turns into inverted wide immediate mov (so same as movn)
	a.mov(TEMP_REG3, 0x7ffffffffff); // turns into a bitmask immediate mov
	get_imm_relative(a, TEMP_REG2, -123);  // turns into mov
	get_imm_relative(a, TEMP_REG3, 123); // turns into mov
	get_imm_relative(a, SCRATCH_REG1, 0x1ffffff); // turns into a bitmask immediate mov
	get_imm_relative(a, SCRATCH_REG1, 0x1fffff4); // turns into 2 movs
	get_imm_relative(a, SCRATCH_REG1, 0x7fffff4); // turns into a adrp
	get_imm_relative(a, SCRATCH_REG1, 0x7ffffff); // turns into an adrp
	get_imm_relative(a, SCRATCH_REG1, 0xfffffff); // turns into adrp
	get_imm_relative(a, SCRATCH_REG1, 0xffffffff); // turns into adrp

    0x1060a0038: mov    x9, #-0x7b ; =-123
    0x1060a003c: mov    w10, #-0x1 ; =-1
    0x1060a0040: mov    x11, #0x7ffffffffff ; =8796093022207
    0x1060a0044: mov    x10, #-0x7b ; =-123
    0x1060a0048: mov    x11, #0x7b ; =123
    0x1060a004c: mov    w12, #0x1ffffff ; =33554431
    0x1060a0050: mov    x12, #0xfff4 ; =65524
    0x1060a0054: movk   w12, #0x1ff, lsl #16
    0x1060a0058: adrp   x12, -1040545
    0x1060a005c: add    x12, x12, #0xff4
    0x1060a0060: adrp   x12, -1040545
    0x1060a0064: add    x12, x12, #0xfff
    0x1060a0068: adrp   x12, -1007777
    0x1060a006c: add    x12, x12, #0xfff
    0x1060a0070: adrp   x12, -24737
    0x1060a0074: add    x12, x12, #0xfff

I think a better optimization would be to first just check if it's a bitmask immediate and if it is then it should be able to fit in one mov. Then you can do some other small checks like if it's a 16-bit value (signed or unsigned), or if it's a 32-bit value then it might be better to express it as 2 movs instead of an adrp/add. I'll defer to you if if it's faster to use a single adr instead of 2 (or 3?) movs.

987123879113 Jan 20, 2025

void drcbe_arm64::get_imm_relative(a64::Assembler &a, const a64::Gp &reg, const uint64_t val) const
{
	if (is_valid_immediate_mask(val, 8)) // maybe make this check if it's a GpW or GpX and set the size accordingly, just for testing purposes
	{
		a.mov(reg.x(), val);
		return;
	}

	if (is_valid_immediate(val, 32) || is_valid_immediate_signed(val, 32))
	{
		a.mov(reg, val);
		return;
	}

...

^ This cleans it up a lot already.

	// Don't turn these into mask immediates
	get_imm_relative(a, TEMP_REG3, -125);
	get_imm_relative(a, TEMP_REG3, -32765);
	get_imm_relative(a, TEMP_REG3, -2147483645);

	get_imm_relative(a, TEMP_REG3, 125);
	get_imm_relative(a, TEMP_REG3, 32765);
	get_imm_relative(a, TEMP_REG3, 2147483645);

	get_imm_relative(a, TEMP_REG3, 253);
	get_imm_relative(a, TEMP_REG3, 65533);
	get_imm_relative(a, TEMP_REG3, 4294967293);

	get_imm_relative(a, TEMP_REG2, -123);
	get_imm_relative(a, TEMP_REG3, 123);
	get_imm_relative(a, SCRATCH_REG1, 0x1ffffff);
	get_imm_relative(a, SCRATCH_REG1, 0x1fffff4);
	get_imm_relative(a, SCRATCH_REG1, 0x7fffff4);
	get_imm_relative(a, SCRATCH_REG1, 0x7ffffff);
	get_imm_relative(a, SCRATCH_REG1, 0xfffffff);
	get_imm_relative(a, SCRATCH_REG1, 0xffffffff);

->

    0x1060a0038: mov    x11, #-0x7d ; =-125
    0x1060a003c: mov    x11, #-0x7ffd ; =-32765
    0x1060a0040: mov    x11, #-0x7ffffffd ; =-2147483645
    0x1060a0044: mov    x11, #0x7d ; =125
    0x1060a0048: mov    x11, #0x7ffd ; =32765
    0x1060a004c: mov    x11, #0xfffd ; =65533
    0x1060a0050: movk   w11, #0x7fff, lsl #16
    0x1060a0054: mov    x11, #0xfd ; =253
    0x1060a0058: mov    x11, #0xfffd ; =65533
    0x1060a005c: mov    w11, #-0x3 ; =-3
    0x1060a0060: mov    x10, #-0x7b ; =-123
    0x1060a0064: mov    x11, #0x7b ; =123
    0x1060a0068: mov    w12, #0x1ffffff ; =33554431
    0x1060a006c: mov    x12, #0xfff4 ; =65524
    0x1060a0070: movk   w12, #0x1ff, lsl #16
    0x1060a0074: mov    x12, #0xfff4 ; =65524
    0x1060a0078: movk   w12, #0x7ff, lsl #16
    0x1060a007c: mov    w12, #0x7ffffff ; =134217727
    0x1060a0080: mov    w12, #0xfffffff ; =268435455
    0x1060a0084: mov    w12, #-0x1 ; =-1

Then you could add further optimizations for shifts, etc.

cuavas · 2025-01-17T20:34:43Z

cuavas
Jan 17, 2025
Maintainer Author

ROLAND with immediate mask and shift where the mask has contiguous ones at each end and contiguous zeroes in the middle can be turned into a ror followed by a bfc. For example if the mask is 0xff0000ff, the contiguous zeroes can be cleared with a.bfc(output, 8, 16).

These kinds of masks can be generated with PowerPC rlwinm instructions, but they’re relatively rare compared to the cases where all the one bits are contiguous, so this optimisation may not be worthwhile.

0 replies

cuavas · 2025-01-18T14:20:27Z

cuavas
Jan 18, 2025
Maintainer Author

op_and with an immediate argument with only one contiguous group of zero bits can be optimised to a bfc instruction.

1 reply

987123879113 Jan 20, 2025

Commented on the wrong one so I'll recomment it here. I meant it for this but probably also applicable to ROLAND.

This could be implemented pretty easily I think.

Refactor bool is_valid_immediate_mask(uint64_t val, size_t bytes) to use something like bool is_valid_immediate_mask_whatever(uint64_t val, size_t bytes, uint32_t &tail, uint32_t &width) internally with it just throwing away the tail and width outputs, and then call is_valid_immediate_mask_whatever(~mask, inst.size(), tail, width) inside op_and.

galibert · 2025-01-18T17:05:28Z

galibert
Jan 18, 2025
Maintainer

It should be possible to implement the equivalent of memory_access_specific directly in the generated code (it's a mask, a shift, a lookup of an object pointer in an array and a call to a virtual method of that object). The gain should be significant.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAMEdev

Potential optimisations for the AArch64 DRC back-end #178

{{title}}

Replies: 7 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

MAMEdev

Potential optimisations for the AArch64 DRC back-end #178

cuavas Jan 17, 2025 Maintainer

Replies: 7 comments · 10 replies

cuavas Jan 17, 2025 Maintainer Author

cuavas Jan 17, 2025 Maintainer Author

cuavas Jan 17, 2025 Maintainer Author

cuavas Jan 17, 2025 Maintainer Author

987123879113 Jan 20, 2025

cuavas Jan 17, 2025 Maintainer Author

cuavas Jan 20, 2025 Maintainer Author

987123879113 Jan 20, 2025

cuavas Jan 20, 2025 Maintainer Author

987123879113 Jan 20, 2025

987123879113 Jan 20, 2025

cuavas Jan 17, 2025 Maintainer Author

cuavas Jan 18, 2025 Maintainer Author

987123879113 Jan 20, 2025

galibert Jan 18, 2025 Maintainer

cuavas
Jan 17, 2025
Maintainer

Replies: 7 comments 10 replies

cuavas
Jan 17, 2025
Maintainer Author

cuavas
Jan 17, 2025
Maintainer Author

cuavas Jan 17, 2025
Maintainer Author

cuavas
Jan 17, 2025
Maintainer Author

cuavas
Jan 17, 2025
Maintainer Author

cuavas Jan 20, 2025
Maintainer Author

cuavas Jan 20, 2025
Maintainer Author

cuavas
Jan 17, 2025
Maintainer Author

cuavas
Jan 18, 2025
Maintainer Author

galibert
Jan 18, 2025
Maintainer