Skip to content

Latest commit

 

History

History
560 lines (490 loc) · 26.4 KB

software.md

File metadata and controls

560 lines (490 loc) · 26.4 KB

Software and compilers for RISC-V

What would have helped me if somebody had told me right from the beginning.

The RISC-V gcc compilers

RISC-V exists in different 'flavors', first the base integer instruction set (RV32I). It has optional extensions: multiplications and divisions (M), single-precision and double-precision floating points (F,D), atomic instructions (A).

The compiler to be used depends of course of which instructions are supported by the processor (Note: it is also possible to emulate instructions by catching illegal instructions in an exception handler and doing the instruction's work in software, but exceptions are not supported yet by FemtoRV32). In our case we will need a compiler targeting RV32I or RV32IM depending on what is configured in RTL/femtosoc_config.v (on the IceStick can't be anything else than RV32I). In the future, I plan to implement support for F, and maybe D.

There is also something else to be known about RISC-V gcc compilers (that took me some time to understand). Besides the targeted instruction set, gcc also exists in two different flavors, depending on whether they target a Linux system (then GCC is called riscv64-linux-gnu-gcc) or a microcontroller (riscv64-unknown-elf-gcc). All the executable of the toolchain start with the same prefix. Note that the prefix starts with riscv64, but the toolchain supports both 32 bits riscv target (that we are using) and 64 bits.

In our case we need a compiler for microcontrollers (FemtoRV32 is not Linux-capable yet). The packages generally available in Linux distributions are targeting a Linux system, so we cannot use them directly (still possible for baremetal firmwares, but not for generating FemtoRV32 compatible elf, more on this below). It is also a good thing to have all the possible combinations of RV32I(M)(A)(F)(D) installed, if you want to play with different core configurations, and implement a FPU in Verilog, or in an exception handler.

Some precompiled toolchains are available in SIFIVE's website, here. They can be used directly under Linux or Windows 10/WSL. FemtoRV32's Makefile (FIRMWARE/makefile.inc) automatically downloads it. It is a big package (300Mb or so), but then you have all possible combination of instruction sets, and more importantly, the associated libraries. It is important, because for instance, if you use RV32I, then you do not have hardware multiplication. You need both a function for that (link with the right gcc library) and you need to tell the compiler to generate a call to that function instead of a MUL instruction. It also concerns all the floating point operations, that can be either implemented by using integer multiplication, or calling the software version. There are many combinations. This is what you chose at the beginning of FIRMWARE/makefile.inc, by setting ARCH (to rv32i or rv32im for now, but everything is ready for all the other extensions, such as c,a,f...).

If you want to know more / to have an even more baremetal approach:

It is also possible to use gcc to only produce object .o files and then use ld to link with a library that implements the needed functions (instead of using the standard C runtime). If you do that, it will work even if your compiler is targeted towards a Linux runtime (because you do not use the C libraries). This is what I was doing at the beginning, however I do not recommend to do that, because everything gets painful (need to reinvent the wheel, to reimplement memcpy() etc..., and there is no floating point). If you really want to do that, take a look at FIRMWARE/LIBFEMTOC/Makefile, uncomment MISSING_OBJECTS and MISSING_OBJECTS_WITH_DIR. If you want to learn more about how this works and how to find these functions, the sources for the toolchain and libraries are here. I found the sources for the multiplication function here: riscv-gcc/libgcc/config/riscv/muldi.S. Note that it is the same function for 64-bits (muldi) and 32-bits (mulsi) multiplication, with a macro to select the right one. Division and modulo are in the same directory. Another interesting function you will find there is the C runtime startup function: riscv-newlib/libgloss/riscv/crt0.S. It is good to see it, because for when we will run elf executables, we know what the C runtime expects from us.

Producing a RISC-V executables: bare metal ELF and standard ELF

By default, gcc produces executables in the ELF format (Executable and Linkable Format). Now we want to convert it into something we can load in our risc-v processor on the FPGA. There are two different ways of doing that (and we will do both):

  1. convert the ELF executable into FIRMWARE/firmware.hex, an ascii file in hexadecimal, that can be directly loaded by Verilog's readmemh() function to initialize the RAM, as done in RTL/femtosoc.v. This solution is used by the smaller devices (IceStick), that do not have SDCard reader or that have not enough RAM.

  2. use the ELF executable as is, copied on an SDCard, and start the program using a minimalistic/crappy OS (FemtOS). Note that this solution requires to have the OS preloaded in the RAM (this is why we need both solutions).

  3. copy the executable in the device's SPI Flash. FPGA boards often have a tiny 8-legged chip used to store the configuration of the FPGA. It is flash memory, with a few megabytes of memory, that is accessed through a serial protocol. The configuration of the FPGA only takes a few tenth of kilobytes. There is a lot of space we can use to store code there, and directly make FemtoRV32 execute it ! This requires a special linker script and C startup function.

Bare metal ELF

Let us start to take a deeper look at solution 1). In fact, there is also something important that we need: by default, gcc will produce an executable with a certain memory map, that does not necessarily fits our need. For instance, the text segment starts at address 0x10000 (=64kb). We do not even have this quantity of RAM on the IceStick. Note that it would be possible to wire the address decoder in femtosoc.v in such a way that RAM artificially starts at this address, but it is also possible to tell gcc to use a different memory map. To do so, I am using a linker script, in FIRMWARE/CRT_BAREMETAL/femtorv32.ld, with the following contents:

MEMORY
{
   BRAM (RWX) : ORIGIN = 0x0000, LENGTH = 0x40000
}
SECTIONS
{
    .text :
    {
        crt0.o (.text) 
        *(.text)
    }
}

Disclaimer: I do not fully understand what I'm doing here, linker scripts seem to be a scientific discipline on its own, but at least what I've done here seems to fit my needs !

In the same directory, there is also crt0.S, my function to initialize the C runtime, that replaces gcc's default one:

.include "femtorv32.inc"

.text
.global _start
.type _start, @function

_start:
.option push
.option norelax
     li gp,IO_BASE       #   Base address of memory-mapped IO
.option pop

     lw sp,IO_RAM(gp)      # Read RAM size in hw config register and
        #   initialize SP one position past end of RAM
	
     # Should find a way of clearing BSS here...
	
     call main
     tail exit

It does different things:

  • it loads the IO base address in the global pointer gp, so that reading and writing to/from memory-mapped peripherals can be done in one instruction.
  • it initializes the stack pointer sp at the end of the RAM. The end of the RAM is queried from a memory-mapped hardware configuration register (see RTL/DEVICES/HardwareConfig.v).
  • it calls main
  • and finally it calls exit

For our bare metal scenario, it is absolutely necessary to replace the default crt0.S, because the default one does not initialize the stack pointer. If you don't beleive me, if you installed the sources of the riscv toolchain, you can take a look at riscv-newlib/libgloss/riscv/crt0.S. It is because it is the job of the OS to do so (and we are writing sort-of an OS, so it is our job !).

Then, in FIRMWARE/EXAMPLES/Makefile I'm generating the executable as follows:

%.bm_elf: %.o $(RV_BINARIES)
	$(RVLD) $(RVLDFLAGS) -T$(FIRMWARE_DIR)/CRT_BAREMETAL/femtorv32.ld $< -o $@ \
	-L$(FIRMWARE_DIR)/CRT_BAREMETAL -L$(FIRMWARE_DIR)/LIBFEMTORV32 -L$(FIRMWARE_DIR)/LIBFEMTOC \
	-lfemtorv32 -lfemtoc $(RVGCC_LIB)

All the used macros are defined in FIRMWARE/makefile.inc. More explanations below:

  • I call it bm_elf for bare metal elf, to make the difference with standard executables (that will come later).
  • Dependency on $(RV_BINARIES) is used to automatically download the precompiled toolchain from SiFive;
  • RVLDFLAGS=-m elf32lriscv -b elf32-littleriscv --no-relax makes sure the right elf format will be used. The --no-relax makes sure that the global pointer gp, that I'm using for storing the mapped IO page for faster IO, will not be used for something else (that is, making long jumps in one instruction rather than two).
  • The -T option specifies a linker script
  • Important and tricky: note that ld automatically links any crt0.o file it finds in its linker path. Since we have included -LFIRMWARE/CRT_BAREMETAL in the link path, it will do so ! (figured out by linking crt0.o manually then it complained about duplicated symbols !)
  • Then I'm linking my FemtoRV32 support library (-lfemtorv32) and my femtolibc with some libc replacement functions (printf is a big beast, mine is much smaller, though it does not have all the functionalities).
  • Finally, RVGCC_LIB, also defined in FIRMWARE/makefile.inc, that points to libc.a, libm.a and libgcc.a (libgcc.a has integer multiplication / division and floating point functions) for the specified architecture (RV32I or RV32IM). This is why it is good to have the complete toolchain for embedded systems. Now if you don't have it, you can comment-out the definition of RVGCC_LIB in FIRMWARE/makefile.inc and edit FIRMWARE/LIBFEMTOC/Makefile, uncomment MISSING_OBJECTS,MISSING_OBJECTS_WITH_DIR. This will compile what's necessary for most included demos (except floating points).

OK, so at this point we are able to produce an ELF from a C program, that will be implanted at address 0x00000000. Now we need to generate from it an ASCII hex file that can be understood by Verilog readmemh() function. There is a objcopy command that can help:

$ riscv64-unknown-elf-objcopy -O verilog firmware.bm_elf firmware.hex    

... unfortunately, we are not there yet, because it does not format it exactly in the way Verilog expects it (or at least I did not find any way of doing that). There are several ways of fixing that, for instance, Claire Wolf (picorv32 author) is using a Python script. I decided to write a small C++ program called firmware_words that does the job. In addition it checks that everything fits in the memory declared in RTL/femtosoc_config.v. Then, the generated file FIRMWARE/firmware.hex is used to initialize the RAM in FIRMWARE/femtosoc.v, using the readmemh() Verilog command.

Th Makefiles in the FIRMWARE subdirectories do all these steps. It is used as follows:

$ cd FIRMWARE/EXAMPLES
$ make xxx.hex
$ cd ..

(where xxxx is the name of the program you want to compile). It updates FIRMWARE/firmware.hex, then you are ready to program the FPGA, with make ULX3S or make ICESTICK or ...

Well, I'm happy, the problem is solved, but since we needed an additional program, could we not make it read the ELF directly and generate the Verilog hex from it ? (and next, if we know how to read the ELF format, could we not include that in FemtOS, so that it can directly select and run programs from the SDCard ?). The answer to both questions is YES !

Reading ELF executables

Olof Kindgren wrote a Verilog plugin here for FuseSoc, that uses the standard libelf. However, it is a nightmare to compile under Windows (I don't know if it is even possible). So my idea was different, can we write the minimal amount of code that fits our needs ? The ELF (Executable and Linking Format) is complicated, because it does what it's name tells: it contains what's necessary for loading programs, that can be dynamically linked. In our case, we are only using statically linked executables, so we can ignore most of the information in the ELF file, besides the code sections of course ! Let us take a look at the contents of an ELF file:

$readelf -a firmware.bm_elf

Wow, lots of things in there. Let us take a look and try to guess:

  • first, there is an ELF header, with magic numbers (expected to see this), header sizes (can be used to do sanity checks: do we have the correct structures declared in the read_elf function we are writing ?), architecture, 32 or 64 bits, bytesex, OS, and then, number of program headers, offset of program headers, number of section headers, offsets of section header. So we know we are going to open the file, fread() the header, then fseek() to the offsets where there is something that interests us (program headers or section headers, no idea of what it is for now), then fread them.
  • then we see a list of section headers, the names are interesting, .text, I see also readonly data .rodata, and uninitialized data .bss and .sbss, yes our code is probably there. BTW, what is sbss ? Google tells me small data, it is for data that can be put in a page that is faster to access. OK, so we will need to load these segments, or zero the .bss or .sbss ones. Then there are many other segments, with debug information, symbol tables, string tables, shared string table, we probably do not need that. How can we figure out ?

Next step: read some code, let us take a look at /usr/include/elf.h on a Linux system. Well, it is very general, it has the definitions for both 64 bits and 32 bits system, and for any architecture. BTW, we used the system's readelf command instead of the one from the riscv toolchain and it worked ! The meta-information is completely independent on the architecture and system, nice. Wow, elf.h is a priceless source of information, all the fields and constants are documented, love it ! There we learn that what we need is fread-ing the Elf32_Ehdr structure at the beginning of the file, then fseek-ing in the file, at the e-shoff field. Then we read e_shnum section headers. Good. Then, later in elf.h, we find the Elf32_Shdr structure, with the explanations for all the columns we could see in the table output by readelf:

Field Description
sh_type we are interested in PROGBITS (load the section) and NOBITS (clear the memory)
sh_flags SHF_ALLOC tells us which section should be really allocated in memory
sh_addr where the section will be mapped in memory
sh_offset where section data is in the file
sh_size number of bytes in the section

Great ! Now we know exactly what we have to do: for each section of type PROGBITS, if SHF_ALLOC is set in the flags, we need to fseek at sh_offset, then fread sh_size bytes that we will store at sh_addr. For each section of type NOBITS, if SHF_ALLOC is set in the flags, we need to clear sh_size at sh_addr. This is implemented in FIRMWARE/LIBFEMTORV32/femto_elf.h/.c. I have declared a small structure to keep track of some information. In particular I can change the base address, because I'm using the same code to load an ELF to a buffer (then base_address points to the buffer), or to load an ELF in FemtOS (then base_adress is NULL). I also keep track of the beginning of the text segment (because to execute the file, FemtOS jumps there), and the maximum address, to make sure everything fits in memory before loading it:

typedef uint32_t elf32_addr; 
typedef struct {
  void*      base_address; /* Base memory address (NULL on normal operation). */
  elf32_addr text_address; /* The address of the text segment.                */
  elf32_addr max_address;  /* The maximum address of a segment.               */
} Elf32Info;

Now, in femto_elf.c, the function that loads the ELF is as follows (if you look at the actual file, it has some sanity checks that I removed for legibility):

int elf32_parse(const char* filename, Elf32Info* info) {
  Elf32_Ehdr elf_header;
  Elf32_Shdr sec_header;
  FILE* f = fopen(filename,"r");
  uint8_t* base_mem = (uint8_t*)(info->base_address);
  info->text_address = 0;
  
  /* read elf header */
  fread(&elf_header, 1, sizeof(elf_header), f);
  
  /* read all section headers */  
  for(int i=0; i<elf_header.e_shnum; ++i) {
    
    fseek(f,elf_header.e_shoff + i*sizeof(sec_header), SEEK_SET);
    fread(&sec_header, 1, sizeof(sec_header), f);    
    
    /* The sections we are interested in are the ALLOC sections. Skip the other ones. */
    if(!(sec_header.sh_flags & SHF_ALLOC)) continue;

    /* I assume that the first PROGBITS section is the text segment */
    if(sec_header.sh_type == SHT_PROGBITS && info->text_address == 0) {
      info->text_address = sec_header.sh_addr;
    }

    /* Update max address */ 
    info->max_address = MAX(
      info->max_address, 
      sec_header.sh_addr + sec_header.sh_size
    );
     
    /* PROGBIT, INI_ARRAY and FINI_ARRAY need to be loaded. */
    if(
       sec_header.sh_type == SHT_PROGBITS ||
       sec_header.sh_type == SHT_INIT_ARRAY ||
       sec_header.sh_type == SHT_FINI_ARRAY
    ) {
	if(info->base_address != NO_ADDRESS) {
	  fseek(f,sec_header.sh_offset, SEEK_SET);
	  fread(
               base_mem + sec_header.sh_addr, 1,
	       sec_header.sh_size, f
	  );
	}
    }

    /* NOBITS need to be cleared. */    
    if(sec_header.sh_type == SHT_NOBITS && info->base_address != NO_ADDRESS) {	
      memset(base_mem + sec_header.sh_addr, 0, sec_header.sh_size);
    }
  }  
  fclose(f);
  
  return ELF32_OK;
}

(if you take a look at femto_elf.c, you will see that I've copied structures definitions and constants from elf.h, so that it compiles everywhere, even in uncivilized Windows countries).

Ok, here we are, so now I am using femto_elf.h/femto_elf.c in my firmware_words utility that outputs a Verilog ASCII .hex file to initialize the RAM.

FemtOS and standard ELF executables

Now we are equipped with what's necessary to write a very basic and crappy operating system. I'm doing that on the ULX3S. It is a bare metal executable, that displays a list of files on the OLED display, lets the user select them with the buttons, and execute the selected file. Its sources are in FIRMWARE/EXAMPLES/commander.c, so you can build it with:

$ cd FIRMWARE
$ ./make_firmware.sh EXAMPLES/commander.c
$ cd ..

(then you $make ULX3S).

Now you can copy some programs on an SDCard, insert it into the ULX3S, and run them (up and down buttons to select, right to run). The reset button is the one near the SDCard.

The executable are produced by:

$ cd FIRMWARE/EXAMPLES
$ make xxx.elf

(you can also make everything and copy all the executables to the SDCard). Now if you look at the rule in FIRMWARE/EXAMPLES/Makefile, it is very simple:

%.elf: %.o $(RV_BINARIES)
	$(RVGCC) $(RVCFLAGS) $< -o $@ -Wl,-gc-sections \
	-L$(FIRMWARE_DIR)/LIBFEMTORV32 -L$(FIRMWARE_DIR)/LIBFEMTOC -lfemtorv32 -lfemtoc -lm
  • the macros are defined in FIRMWARE/makefile.inc
  • the -Wl,-gc-sections flag is just to make sure the linker eliminates the code that is not used (probably not mandatory)

Here we can directly use the default memory map, that places user code at address '0x10000' (that is, 64Kb). Since FemtOS commander fits in 64Kb, it is perfect for us !

There is something stupid though: a lot of code is duplicated, for instance if you run ST_NICCC, that accesses a file on the SDCard, all the FAT32 library (by @ultraembedded) is loaded twice: once in FemtOS, and once in the program image. OK, it is only a few tenths of Kbs, but I do not like it, it is not good practice.

There are two different things that we could do:

  • implement shared library support
  • implement system calls

For the first option, I will need to learn much more about the ELF format. For the second option, I will need to implement priviledged instructions and exceptions. This is probably what I'll do next.

Running programs from the SPI Flash

The IceStick does not have much BRAM (8K in total, 6K available), so complex programs cannot fit, and the limit is quickly reached. However, it is possible to run code directly from the SPI Flash. In RTL/femtosoc_config.v, select RTL/CONFIGS/icestick_spi_flash_config.v, then edit it to select the devices that are installed. It uses the "mini-femtorv32" processor, that has a smaller LUT footprint, and that can run code directly from the SPI Flash. Then some examples are available in FIRMWARE/SPI_FLASH. To compile and install one of the programs, e.g. mandelbrot.c:

$ make mandelbrot.prog

Note that it is much sloooowwwwer than running code from BRAM directly, this is because the SPI Flash is accessed through a serial protocol, that needs 52 cycles for reading a 32-bits word.

It works as follows: the SPI Flash is mapped at address 0x800000. FemtoRV32 is configured to jump directly to this address. Then there is a linker script FIRMWARE/CRT_BAREMETAL/spi_flash.ld, taken from picorv32 that sends the different segments of the code either to the SPI flash or to the BRAM. Read-only data (.rodata) and small read-only data (.srodata) segments are sent to the SPI flash, and uninitialized data segments (.bss and .sbss) are sent to the BRAM. Initialized data segments (.data, .sdata) are sent to the BRAM, and have initialization data stored in the SPI flash. Then the C runtime startup, also inspired from picorv32, copies the initialization data from the SPI flash to the BRAM. The start and end address of the memory zone to be copied are exported by the linker script. If you want to learn more about linker scripts, see the links at the end of this page.

SPI Flash and Function attributes

The SPI Flash is much slower than the BRAM. It takes 44 cycles to fetch an instruction from it ! (to be compared with 1 cycle for BRAM). Some functions are used often, and it may be useful to install them in fast memory instead of SPI Flash. The linker script spi_flash.ld in CRT_BAREMETAL installs one copy of the __mulsi3,__divsi3 and __udivsi3 functions, used to multiply and divide integer numbers on RV32I processors, and it is worth it (they are very short, so they do not eat up too much BRAM). Now you may want to install some of your own functions in BRAM. Here is an example about how to do it, in FIRMWARE/SPI_FLASH/riscv_logo_OLED.v:

void draw_frame(int frame) __attribute((section(".fastcode"))); 

void draw_frame(int frame) {
....
}

The first line declares a GCC attribute indicating in which section the function should reside. Then the linker script FIRMWARE/CRT_BAREMETAL/spi_flash.ld will know that this function should be put in the .data_and_fastcode segment that resides in BRAM and that is copied there at program startup by the C runtime FIRMWARE/CRT_BAREMETAL/crt0_spiflash.S.

C++

It is also possible to compile C++ programs for FemtoRV. Some examples are included in the FIRMWARE/CPP_EXAMPLES subdirectory. The simplest one (cpp_test) works on the IceStick. However, the tiny raytracer does not: the C++ runtime eats up too much of the available 6kB BRAM, and does not leave sufficient stack space for the program (but it works on the ULX3S that has 256kB of BRAM). Note that the C version in EXAMPLES/tinyraytracer.c works on the IceStick (and fits in 6 KB).

Dynamic allocation works, and uses under the hood an implementation of sbrk(). There is an example in Claire Wolf's picorv32, here. It is quite simple, the only tricky thing is how to initialize the brk at the end of the code segment, but there is an _end symbol defined by the linker, somewhere in the linker script, that does the job. Now the C++ standard libraries are huge (more than 1 megabyte), and we only got (for now) 256 kBytes. But if we do not use iostreams, then we can link with libsupc++ instead, it has the bare minimum to be able to run C++ programs. You will need also to pass the -nostdlib flag when linking. Some examples and the Makefile are here. Note that on the ULX3S, the 256 kBytes limit is quickly reached. We will need a SDRAM controller to have more space !

References and links