Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few questions about VexiiRiscv L1 caches and Tilelink #31

Open
petrmikheev opened this issue Oct 24, 2024 · 35 comments
Open

A few questions about VexiiRiscv L1 caches and Tilelink #31

petrmikheev opened this issue Oct 24, 2024 · 35 comments

Comments

@petrmikheev
Copy link

Trying to migrate my hobby project SoC from VexRiscv to VexiiRiscv. I suspect there is a bug in FetchL1Plugic:

After fetching first few commands FetchL1Plugin produces unknown ('xxxx) output if cache memory was not zero-initialized.

Simulation in iverilog:
image

After a few clock cycles 'xxxx propagates to registers in PcPlugin and simulation gets stuck.

But if I either set bootMemClear = true or disable cache (fetchL1Enable = false) in ParamSimple, it works fine.

@Dolu1990
Copy link
Member

Hi,

bootMemClear is the intended way to avoid those x-prop.
Overall, as far i looked into the x-prop happening in Vexiiriscv (with bootMemClear = false), they all look like false bug only induced by x-prop creating impossible cases.
In other words, x-prop creating simulation only bugs.

All the test i run VexiiRiscv on, are run by Verilator, which instead of x-prop, use 1 and 0 and randomize all the values at init.

I kinda don't feel great about putting bootMemClear by default, because it consume hardware ressources and timings. Maybe instead i should put a big warning message at the very end of the VexiiRiscv hardware generation to warn about it ?

Trying to migrate my hobby project SoC from VexRiscv to VexiiRiscv

Nice ^^
What SoC / memory bus are you working with ?

@petrmikheev
Copy link
Author

In other words, x-prop creating simulation only bugs.
All the test i run VexiiRiscv on, are run by Verilator, which instead of x-prop, use 1 and 0 and randomize all the values at init.

Thanks! Makes perfect sense then.

Maybe instead i should put a big warning message at the very end of the VexiiRiscv hardware generation to warn about it?

Can be enough to mention it in documentation, somewhere near "Run a simulation" section. It would save me some time.

What SoC / memory bus are you working with?

Previously used Axi4, now trying Tilelink.
SpinalHDL components for Tilelink look very nice, but it is not easy to follow all the magic that happens underneath. For example struggling to get the meaning of

// from vexiiriscv/soc/litex/Soc.scala
vexii.lsuL1Bus.setDownConnection(a = withCoherency.mux(StreamPipe.HALF, StreamPipe.FULL), b = StreamPipe.HALF_KEEP, c = StreamPipe.FULL, d = StreamPipe.M2S_KEEP, e = StreamPipe.HALF)
vexii.dBus.setDownConnection(a = StreamPipe.HALF, d = StreamPipe.M2S_KEEP)

and understand why it should be written exactly this way.

By the way, what is the best place to ask questions? Here? Or maybe in SpinalHDL google group?

Dolu1990 added a commit to SpinalHDL/VexiiRiscv-RTD that referenced this issue Oct 25, 2024
@Dolu1990
Copy link
Member

Can be enough to mention it in documentation, somewhere near "Run a simulation" section. It would save me some time.

Added :)

but it is not easy to follow all the magic that happens underneath

Yes, i realy need to provide a better tutorial how to use the tilelink API. I got a bit of funding to do it and should start that very soon.

vexii.lsuL1Bus.setDownConnection(a = withCoherency.mux(StreamPipe.HALF, StreamPipe.FULL), b = StreamPipe.HALF_KEEP, c = StreamPipe.FULL, d = StreamPipe.M2S_KEEP, e = StreamPipe.HALF)

So up would mean connections toward masters, down mean connections toward slaves.
cpu -> .. -> up -> lsuL1Bus -> down -> ..-> peripherals

setDownConnection is a way to ask the interconnect to add some pipelining stages between lsuL1Bus and the address decoder which will serve the down connections.

a,b,c,d,e refer to the 5 tilelink channels.

StreamPipe.HALF is a lightweight pipeline stage which fully cut all combinatorial paths.
StreamPipe.FULL is a "heavy weight" one which also cut all combinatorial path
StreamPipe.M2S_KEEP is a ligtweight one which cut the valid + data path, but not the ready path

So all of this is related to timings optimization / synthesis results.

By the way, what is the best place to ask questions? Here? Or maybe in SpinalHDL google group?

Github issue are good, this repo is good aswell ^^

@petrmikheev
Copy link
Author

I have a few probably stupid questions.

setDownConnection is a way to ask the interconnect to add some pipelining stages between lsuL1Bus and the address decoder which will serve the down connections.

Do I understand right that lsuL1Bus is the bus from cache to memory, and when cache is enabled dBus is used only for non-cachable regions?
I.e.

     ->  fetch L1 cache -> iBus -> memory
cpu  ->  LSU L1 cache -> lsuL1Bus -> memory
     ->              dBus       -> IO regions

StreamPipe.HALF is a lightweight pipeline stage which fully cut all combinatorial paths.
StreamPipe.FULL is a "heavy weight" one which also cut all combinatorial path
StreamPipe.M2S_KEEP is a ligtweight one which cut the valid + data path, but not the ready path

In https://spinalhdl.github.io/SpinalDoc-RTD/master/SpinalHDL/Libraries/stream.html it is written that after halfPipe() the bandwidth divided by two. Half pipe on channel A then should mean that during all write operations it can send each beat of data only on every second clock cycle, and the bus works twice slower than it could. Seems I am missing something... Do maybe all writes go through channel C rather than A?

Also I can not figure out what param.lsuL1Coherency actually change. Let's take several use cases:

  1. Tilelink bus with 2 masters (a. VexiiRiscv core with cache; b. video controller) and one slave (memory). Video controller makes only READ requests, but should trigger writeback if the requested data is cached and modified by the core.
  2. Same as before, but add DMA controller to the same bus. It does both READ and WRITE operations (should trigger writeback & invalidate in VexiiRiscv cache), but DMA controller doesn't have its own cache.
  3. Add to the same bus a second VexiiRiscv core that also has a cache.

Do all the use cases require lsuL1Coherency to be enabled? Or only the last one?

My current use case is the first one. I have video controller with Axi4ReadOnly interface and I am trying to figure out how to connect it to tilelink and make sure that cpu will writeback cached data when needed.

@petrmikheev petrmikheev changed the title FetchL1Plugin relies that cache memory is initialized A few questions about VexiiRiscv L1 caches and Tilelink Oct 28, 2024
@petrmikheev
Copy link
Author

By the way, what is the best place to ask questions? Here? Or maybe in SpinalHDL google group?

Github issue are good, this repo is good aswell ^^

I renamed the issue to better match the content :-)

@Dolu1990
Copy link
Member

Hi,

Do I understand right that lsuL1Bus is the bus from cache to memory, and when cache is enabled dBus is used only for non-cachable regions?

Yes

Do maybe all writes go through channel C rather than A?

Yes right for coherent caches. They send "aquire" requests on channel A (single beat), receive "aquire" responses on channel D (multi beat), and "release" dirty/clean cache lines on channel C.

Tilelink bus with 2 masters (a. VexiiRiscv core with cache; b. video controller) and one slave (memory). Video controller makes only READ requests, but should trigger writeback if the requested data is cached and modified by the core.

Right

Same as before, but add DMA controller to the same bus. It does both READ and WRITE operations (should trigger writeback & invalidate in VexiiRiscv cache), but DMA controller doesn't have its own cache.

Right

Do all the use cases require lsuL1Coherency to be enabled? Or only the last one?

All cases. As soon as you have multiple masters (cpu, dma), and a given CPU has a data cache, then that CPU need lsuL1Coherency. (or it will need to do cache flush/invalidate in software)

My current use case is the first one. I have video controller with Axi4ReadOnly interface and I am trying to figure out how to connect it to tilelink and make sure that cpu will writeback cached data when needed.

One reference for that is the litex SoC.scala.

What matter is that the CPU / DMA busses are merged together and then go into either a CacheFiber (l2), either into the HubFiber (no l2).
Those CacheFiber / HubFiber will handle all the coherency channel (a,b,c,d,e) and provide to the memory system (DDR) only the regular channel to do get/put (a,d).

@petrmikheev
Copy link
Author

Thanks for the answers!

(or it will need to do cache flush/invalidate in software)

By the way, does VexiiRiscv support any cache flush instructions? I've seen only fence.i and sfence.vma, but they are not about flushing particular cache lines.

And how fetch L1 interacts with lsu L1 if lsuL1Coherency is disabled? Does fence.i flushes the whole lsu L1 to ensure that iBus will see correct data?

What matter is that the CPU / DMA busses are merged together and then go into either a CacheFiber (l2), either into the HubFiber (no l2).

Ah! That's what I missed!
If DMA uses only A,D channels and only simple get/put requests, will CacheFiber/HubFiber automatically generate flush/invalidate requests to CPU?

@Dolu1990
Copy link
Member

By the way, does VexiiRiscv support any cache flush instructions? I've seen only fence.i and sfence.vma, but they are not about flushing particular cache lines.

Currently, there is nothing to flush particular cache lines. Idealy, CMO would need to be implemented to allow VexiiRiscv usages without memory coherency.
Todo ^^

Does fence.i flushes the whole lsu L1 to ensure that iBus will see correct data?

Yes

If DMA uses only A,D channels and only simple get/put requests, will CacheFiber/HubFiber automatically generate flush/invalidate requests to CPU?

Yes

@Dolu1990
Copy link
Member

The HubFiber will generate flush / invalidate for every request, while the CacheFiber has a dictionnary which track which cache has what, and so, will only generate usefull flush / invalidate requests

@petrmikheev
Copy link
Author

Idealy, CMO would need to be implemented to allow VexiiRiscv usages without memory coherency.
Todo ^^

CMO would be great! I hope it will also allow to avoid useless read-before-write requests to memory - for example when memcpy is going to fill the whole cache line with new data, but cpu decides to read this cache line from memory first.

The HubFiber will generate flush / invalidate for every request, while the CacheFiber has a dictionnary which track which cache has what, and so, will only generate usefull flush / invalidate requests

Interesting. Does it mean that if I use HubFiber it can make sense to connect iBus around it to reduce the number of invalidate requests?

                DMA  ------------------> TilelinkHub -\
    / ->  lsu L1 cache   -> lsuL1Bus -/                >--- memory
CPU | ->  fetch L1 cache -> iBus ---------------------/
    \ --------------------> dBus ----> IO

Or fence.i flushes the whole lsu L1 only if coherency is off?

@Dolu1990
Copy link
Member

I hope it will also allow to avoid useless read-before-write requests to memory - for example when memcpy is going to fill the whole cache line with new data, but cpu decides to read this cache line from memory first.

Ahhh that is because now, the cache is implemented as "write-back"
So, cache line will always be readed, even if the CPU only write in it. It won't go away

Write-through cache don't have that issue, but instead, they spam the memory system with little write requests.

Interesting. Does it mean that if I use HubFiber it can make sense to connect iBus around it to reduce the number of invalidate requests?

Hmmm, if you have multiple CPU, you would then need to other CPU to manuly flush themself if they modify code sections.
But else, yes.

fence.i always flush the whole D$ + I$ for both cases.
Fondamentaly, when coherency is on, it wouldn't be necessary to flush D$

@petrmikheev
Copy link
Author

So, cache line will always be readed, even if the CPU only write in it. It won't go away

I mean that there can be a special instruction that makes L1 cache think that the cache line is already loaded without an a actual read. I've looked through CMO spec and see that it doesn't have it. But can be a custom instruction.

char *src = ..., *dst = ...;
for (int i = 0; i < size / 64; ++i) {
  fakeReadCacheLine(dst);  // custom instruction
  for (int j = 0; j < 64; j += 4) *(int*)(dst + j) = *(int*)(src + j);
  src += 64, dst += 64;
}

@Dolu1990
Copy link
Member

Ahhhhh lol, i didn't know about it.
Yes there could be a custom instruction for that.

@petrmikheev
Copy link
Author

petrmikheev commented Oct 31, 2024

I am stuck with it. Could you please help to figure out why it doesn't work?

I want to connect a blackbox to a tilelink bus. It will use only 256-byte read requests.

class VideoController extends BlackBox {
  val io = new Bundle {
    ...
    val tl_bus = master(tilelink.Bus(tilelink.BusParameter(
      addressWidth = 32,
      dataWidth    = 64,
      sizeBytes    = 256,
      sourceWidth  = 0,
      sinkWidth    = 0,
      withBCE      = false,
      withDataA    = false,
      withDataB    = false,
      withDataC    = false,
      withDataD    = true,
      node         = null
    )))
  }
  noIoPrefix()
  mapClockDomain(clock=io.clk, reset=io.reset)
}
  val coherent_bus = tilelink.fabric.Node().forceDataWidth(64)
  val mem_bus = tilelink.fabric.Node().forceDataWidth(64)

  val tilelink_hub = new tilelink.coherent.HubFiber()
  tilelink_hub.up << coherent_bus
  mem_bus << tilelink_hub.down

  val video_ctrl = new VideoController()
  val video_bus = tilelink.fabric.Node.down()
  val video_bus_fiber = fiber.Fiber build new Area {
    video_bus.m2s forceParameters tilelink.M2sParameters(
      addressWidth = 32,
      dataWidth = 64,
      masters = List(
        tilelink.M2sAgent(
          name = video_ctrl,
          mapping = List(
            tilelink.M2sSource(
              id = SizeMapping(0, 1),
              emits = tilelink.M2sTransfers(get = tilelink.SizeRange(256, 256))
            )
          )
        )
      )
    )
    video_bus.s2m.supported load tilelink.S2mSupport.none()
    video_bus.bus << video_ctrl.io.tl_bus
  }
  coherent_bus << video_bus  // this line causes error

The error (the first one, there are many of them) is

[error]  Error detected in phase PhaseCreateComponent
[error] ********************************************************************************
[error] ********************************************************************************
[error] 
[error] Bundle assignment is not complete. toplevel/video_bus_noDecoder_toDown_d_payload : ChannelD need 'data' but toplevel/video_bus_to_coherent_bus_up_bus_d_payload : ChannelD doesn't provide it.
[error]     spinal.lib.Stream.connectFrom(Stream.scala:317)
[error]     spinal.lib.Stream.$less$less(Stream.scala:117)
[error]     spinal.lib.Stream.$greater$greater(Stream.scala:122)
[error]     spinal.lib.bus.tilelink.Bus.$less$less(Bus.scala:418)
[error]     spinal.lib.bus.tilelink.fabric.Node$$anon$2$$anon$6.<init>(Node.scala:213)
[error]     spinal.lib.bus.tilelink.fabric.Node$$anon$2.$anonfun$noDecoder$1(Node.scala:207)
[error]     spinal.lib.bus.tilelink.fabric.Node$$anon$2.<init>(Node.scala:207)
[error]     spinal.lib.bus.tilelink.fabric.Node.$anonfun$thread$1(Node.scala:77)

I used https://spinalhdl.github.io/SpinalDoc-RTD/master/SpinalHDL/Libraries/Bus/tilelink/tilelink_fabric.html#example-cpufiber as an example.

@Dolu1990
Copy link
Member

Ahhhh i think the reason is that the memory coherency hub / l2 can't handle memory requests bigger than one cache line. So no memory request more than 64 bytes is allowed.

So either :

  • you need to tilelink.SizeRange(64)
  • Would need to design a bridge to split memory 256 bytes memory accesses into 64 bytes ones, but overall, i think the best is realy to stick to "no more than 64 bytes". That allows to stream line a lot of things

@petrmikheev
Copy link
Author

Yes, changed to 64 and now it successfully generates verilog! Thanks!

By the way, is it possible to generate a more explanatory error message in such situation? No idea how would I debug it without your help.

@Dolu1990
Copy link
Member

I will take a look tomorrow ^^

The situation right now is that Hub filter memory accesses to only allow the compatible one, which create a video DMA with no memory access at all, => no read / no write => withDataD = false

@petrmikheev
Copy link
Author

Would need to design a bridge to split memory 256 bytes memory accesses into 64 bytes ones, but overall, i think the best is realy to stick to "no more than 64 bytes". That allows to stream line a lot of things

The only reason to use big requests was that in this case memory throughput is a bit higher. But it is not that significant.

@Dolu1990
Copy link
Member

One thing, i recently updated the documentation, in particular around the MicroSoc, you may find interresting things there, in particular :
https://spinalhdl.github.io/VexiiRiscv-RTD/master/VexiiRiscv/Soc/microsoc.html#adding-a-custom-peripheral

I recently added a tilelink video DMA, see :
https://github.com/SpinalHDL/VexiiRiscv/blob/dev/src/main/scala/vexiiriscv/soc/litex/Soc.scala#L152

Good luck :D

@petrmikheev
Copy link
Author

petrmikheev commented Oct 31, 2024

Thanks!

I used both MicroSoc and litex SoC as examples.

Things like demo.node at 0x10003000 of bus32 look as magic. It is really nice and convenient. But when I do something wrong and the magic breaks, it is sometimes hard to dig into details. So for now my preferred approach is to write custom modules in verilog and connect them together using SpinalHDL.

@Dolu1990
Copy link
Member

But when I do something wrong and the magic breaks, it is sometimes hard to dig into details

Yes i agree. Error reporting isn't great when it goes into Fiber / Tilelink stuff.

For the Raw SpinalHDL things using component and so on, it should be good :D

@petrmikheev
Copy link
Author

fence.i always flush the whole D$ + I$ for both cases.

Actually fence.i doesn't flush D$. At least if coherency is enabled. Checked experimentally. So I have to connect iBus to the tilelink hub rather than to memory directly.

@Dolu1990
Copy link
Member

Dolu1990 commented Nov 1, 2024

Ahhh right, i just checked the code again, when coherency is enabled, the LsuPlugin will instead just wait that the storebuffer are drained out.

    val coherentFlusher = l1.coherency.get generate new Area{
      invalidationPorts.foreach(_.cmd.ready := withStoreBuffer.mux(storeBuffer.empty, True))
    }

@petrmikheev
Copy link
Author

Hi! Sometimes a have deadlocks somewhere in my SoC. I suppose they are caused by attempts to read or write invalid memory regions that are not connected to any slave interfaces. The problem is that I don't have a way to debug it: in my setup there is no integration with standard tools and no easy way for things like logic analyzer over jtag. And scenarios are too complicated for simulation (e.g. boot linux and run gcc on device).

Is there a way to add to tilelink hub a default slave interface that will be used if the address is not mapped to anything else? I want to add a module that catches all invalid requests and sends debug information via UART.

@Dolu1990
Copy link
Member

Hi,

Yes it is by design. The memory masters have ways to figure out at hardware elaboration the exact memory mapping.
So the idea is that the masters check that in hardware, and only issue legal memory requests.

So, what is done for DMA which doesn't integrate those check, is to directly add on their DMA bus a spinal.lib.bus.tilelink.fabric.TransferFilter
As :

val filter = new fabric.TransferFilter()

Is there a way to add to tilelink hub a default slave interface that will be used if the address is not mapped to anything else?

Hmm that TransferFilter behave like a default slave. The difference is that it intercept the trafic, like a bridge (instead of behaving like a regular peripheral)

@petrmikheev
Copy link
Author

It seems that this time I faced an actual bug in VexiiRiscv.

param.lanes = 2 causes deadlock in some cases.

Unfortunately I can't provide a simple reproducing example. The deadlock happens when I run gcc on device (compiling a trivial hello world). Can't simulate in iverilog - it would take days of single-thread simulation. However with param.lanes = 1 everything works fine, and with param.lanes = 2 when I run gcc there is always a deadlock (though linux boots and some other programs I tried work fine).

VexiiParams: https://github.com/petrmikheev/endeavour/blob/master/rtl/spinal/src/main/scala/endeavour/VexiiCore.scala#L40
The script I use to compile rv32 gcc: https://github.com/petrmikheev/endeavour/blob/master/scripts/build_toolchain.sh

@Dolu1990
Copy link
Member

Ahhh,
I will try dual issue on hardware, running debian to see how well it goes, and if it crash. That should be a good test case :)
Hopefully i can reproduce.

How does it crash freeze, i mean, can you reproduce easily ? is it kinda always dead lock on the same app ?

@Dolu1990
Copy link
Member

Another question is : Are all the timings passing place / route with positive slack ?

@petrmikheev
Copy link
Author

How does it crash freeze, i mean, can you reproduce easily ? is it kinda always dead lock on the same app ?

Yes, always dead lock on the same place on every attempt. And it is definitely not just an infinite loop in software - it stopped reacting to timer interrupts.

Another question is : Are all the timings passing place / route with positive slack ?

Timing analyzer reports that it should work fine even at 85C temperature.

@Dolu1990
Copy link
Member

In particular, which app is freezing ?

@Dolu1990
Copy link
Member

Can you send me your linux / buildroot binaries ?
Sometime, bugs only appear is specific binaries versions

@petrmikheev
Copy link
Author

In particular, which app is freezing ?

Console gcc!
In particular - the binary /usr/libexec/gcc/riscv32-unknown-linux-gnu/13.2.0/cc1.
I build it using https://github.com/riscv/riscv-gnu-toolchain (commit 6d7b5b7).

Can you send me your linux / buildroot binaries ?

I don't use buildroot... And an attempt to reproduce my setup in all details would require soldering.

The goal of my project was to go the whole path on my own: design fpga board, solder it, design SoC capable to run linux, write linux drivers for my custom peripherals, implement sbi and bootloader, manually bring together linux kernel and all the software (to better understand how linux works), and finally to be able to write and compile code directly on the device.

And (if param.lanes = 1) it works!
image

But reproducing the bug on your side can be a big problem. My only hope is that if you use same VexiiRiscv params, run 32-bit debian, and try to compile hello world with gcc, the freeze will happen too.

@Dolu1990
Copy link
Member

I tried some configs running 64 bits debian :

--mmu-sync-read  --with-mul --with-div --allow-bypass-from=0 --performance-counters=0 --fetch-l1 --fetch-l1-ways=2 --lsu-l1 --lsu-l1-ways=2  --with-lsu-bypass --relaxed-branch --with-rva --with-supervisor --fetch-l1-ways=4 --fetch-l1-mem-data-width-min=64 --lsu-l1-ways=4 --lsu-l1-mem-data-width-min=64 --xlen=64 --with-rvc --with-rvf --with-rvd --fma-reduced-accuracy --fpu-ignore-subnormal --with-btb --with-ras --with-gshare --fetch-l1-hardware-prefetch=nl --fetch-l1-refill-count=3 --fetch-l1-mem-data-width-min=128 --lsu-l1-mem-data-width-min=128 --lsu-software-prefetch --lsu-hardware-prefetch rpt --performance-counters 9 --lsu-l1-store-buffer-ops=32 --lsu-l1-refill-count 4 --lsu-l1-writeback-count 4 --lsu-l1-store-buffer-slots=4 --lanes=2 --decoders=2 --with-late-alu --relaxed-div 

So, mainly, dual issue, with late-alu, and a few other things.
So far all seems to work well. But i forgot to try with the --with-dispatcher-buffer
Synthesis in progress XD

@Dolu1990
Copy link
Member

One thing i can see on your side, is that you only have 1 way of 4Kb for each i$ d$
This may lead to some dead lock when you have the store buffer enabled (seems it isn't your case)
But still, can you try with 4 ways for both i$d$ ?

@Dolu1990
Copy link
Member

Also, which version of linux do you use ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants