main.tex

\documentclass[12pt]{article}
% my packages
\usepackage{fullpage}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{verbatim}
\usepackage{color}
\usepackage{boxedminipage}
\usepackage{enumitem}
\usepackage[caption=false]{subfig}
\usepackage{url}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{hyperref}

\definecolor{commentcolor}{rgb}{0,0.6,0}
\definecolor{keywordcolor}{rgb}{0,0,0.8}
\definecolor{numbercolor}{rgb}{0.5,0.5,0.5}
\definecolor{stringcolor}{rgb}{0.58,0,0.82}
\lstset{%
    language=Verilog,                        % closest to BSV
    backgroundcolor=\color{white},           % choose the background color; you must add \usepackage{color} or \usepackage{xcolor}
    basicstyle=\ttfamily\bfseries,              % the size of the fonts that are used for the code
    belowskip=0.5\baselineskip,            % from: http://tex.stackexchange.com/questions/118730/avoid-empty-vert-space-after-lstlisting
    breakatwhitespace=false,                 % sets if automatic breaks should only happen at whitespace
    breaklines=true,                         % sets automatic line breaking
    captionpos=b,                            % sets the caption-position to bottom
    commentstyle=\color{commentcolor},       % comment style
    deletekeywords={...},                    % if you want to delete keywords from the given language
    escapeinside={\%*}{*)},                  % if you want to add LaTeX within your code
    extendedchars=true,                      % lets you use non-ASCII characters; for 8-bits encodings only, does not work with UTF-8
    frame=single,                            % adds a frame around the code
    %frame=none,                              % doesn't add a frame around the code
    keepspaces=true,                         % keeps spaces in text, useful for keeping indentation of code (possibly needs columns=flexible)
    %columns=fixed,                           %
    columns=flexible,                        %
    keywordstyle=\color{keywordcolor},       % keyword style
    morekeywords={%
        call,%Called method
        def,%Defined method
        inverted,%Inverted interface
        type,typedef,valueOf,                                    % Type related keywords
        method,endmethod,action,endaction,                  % Methods and actions
        Action,ActionValue,interface,endinterface,          % Interfaces
        Vector,replicate,replicateM,                        % Vector stuff
        Bit,Int,UIng,Reg,Integer,let,tagged,union,struct,   % Basic types
        TAdd,TMul,TDiv,                                     % Type operations
        rule,endrule,return,                                % Rules keywords
        pack,unpack,zeroExtend,signExtend,                  % Common bit functions
        case,matches,endcase                                % Case statements
        synthesize,True,False,Empty,*,...},                 % Etc.
    numbers=left,                            % where to put the line-numbers; possible values are (none, left, right)
    numbersep=5pt,                           % how far the line-numbers are from the code
    numberstyle=\tiny\color{numbercolor},    % the style that is used for the line-numbers
    rulecolor=\color{black},                 % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. comments (green here))
    showspaces=false,                        % show spaces everywhere adding particular underscores; it overrides 'showstringspaces'
    showstringspaces=false,                  % underline spaces within strings only
    showtabs=false,                          % show tabs within strings adding particular underscores
    stepnumber=1,                            % the step between two line-numbers. If it's 1, each line will be numbered
    stringstyle=\color{stringcolor},         % string literal style
    tabsize=2,                               % sets default tabsize to 2 spaces
    title=\lstname                           % show the filename of files included with \lstinputlisting; also try caption instead of title
}

\newcommand{\mycomment}[1]{\emph{\textcolor{red}{[#1]}}}

\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\inst}[1]{\textsf{#1}}

\begin{document}

\title{RiscyOO Design Document}
\author{Sizhuo Zhang \\ szzhang@csail.mit.edu \\ MIT CSAIL}
\date{}
\maketitle

\section{Overview}

This document describes the design of the RiscyOO processor, which has been released publicly at \url{https://github.com/csail-csg/riscy-OOO}.
As a convention, we use ``\code{//}'' to denote the root of the git repo when referring to the path of source files.

RiscyOO implements the RISC-V 64-bit instruction set with the IMAFD extensions, i.e., RV64G.
It is a out-of-order superscalar multiprocessor which can boot multicore Linux and AWS F1 FPGA.

\begin{figure}[!htb]
    \centering
    \includegraphics[width=\columnwidth]{fig/system_crop.pdf}
    \caption{Overall system}\label{fig:sys}
\end{figure}

Figure~\ref{fig:sys} shows the overall system.
The RiscyOO processor is implemented on the AWS FPGA.
The FPGA is connected to a x86 host machine via PCIe.
The x86 host machine runs a piece of software to do the following:
\begin{itemize}
    \item Initialize the processor on FPGA (e.g., copy Linux image to the DRAM of FPGA).
    \item Receive character outputs from the processor on FPGA and display the characters on the terminal of the x86 host.
    This communication is using the HTIF protocol, which is used in the RISC-V ISA simulator Spike~\cite{spike} (our fork of Spike is at \code{//tools/riscv-isa-sim}).
    \item Send keyboard input on the x86 host to the processor on FPGA.
    This communication is also using the HTIF protocol.
    \item Query performance counters in the processor on FPGA.
    \item Receive terminate signals from FPGA.
\end{itemize}
The communication between FPGA and x86 software via PCIe is using the Connectal framework~\cite{connectal}, our fork of Connectal is at \code{//connectal}).

\noindent\textbf{Document Organization:}
Section~\ref{sec:core} describes the processor-core microarchitectures.
In particular, we describe the interface (including conflict matrices), implementation, future improvement and the path of source files for important modules.
Section~\ref{sec:uncore} describes the uncore, including the shared L2 cache, DRAM, etc.
Section~\ref{sec:sw} describes the software running on the x86 host machine.
Section~\ref{sec:bench} lists the git repos for the benchmark programs that have been run on the processor.

This document assumes the readers know the details of RISC-V and Bluespec SystemVerilog (BSV), and will use RISC-V/BSV terms without explanations.

\section{Processor Core}\label{sec:core}

Figure~\ref{fig:core} shows the overall structure of the processor core.
Modules are represented by boxes.
The main processor pipeline is the following:
\begin{itemize}
    \item Fetch pipeline $\rightarrow$ Rename stage $\rightarrow$ ROB and all execution pipelines (ALU/Branch, FPU/Int-Mul/Int-Div, and Mem) $\rightarrow$ Commit stage.
\end{itemize}
The rename stage divides the processor core into the front-end and the back-end.
The front-end, i.e., the fetch pipeline, is an in-order pipeline, while out-of-order execution happens all in the back-end.

An instruction is fetched and decoded in the fetch pipeline.
The rename stage performs register renaming and enters the instruction int ROB and one of the execution pipelines.
When the instruction finishes execution and leaves the execution pipeline, it notifies the ROB that it is executed.
Finally the commit stage commits instructions from ROB in order.

Below we introduce some basic information about the core, which is not tied to each specific module.

\noindent\textbf{Superscalarity:}
The processor core is also superscalar.
The fetch pipeline, rename stage and commit stage can all handle multiple instructions in one cycle.
\emph{The size of superscalarity is represented by numeric type \code{SupSize} in the source code and in the rest of this document.}
The core also has multiple execution pipelines.
The numbers of ALU/Branch (abbreviated as ALU in the following) execution pipelines and FPU/Int-Mul/Int-Div execution pipelines (abbreviated as FPU execution pipeline in the following) are parametrized.
However, there can only be one memory execution pipeline

\begin{figure}
    \centering
    \includegraphics[width=\columnwidth]{fig/core_crop.pdf}
    \caption{Overall structure of the processor core}\label{fig:core}
\end{figure}

\noindent\textbf{Scheduling Convention:}
Since there are many modules in the core, assigning conflict matrices to different module interfaces to maximize the concurrency of rules can be difficult.
To simplify the problem, we following the convention of \emph{reverse pipeline ordering} in assigning conflict matrices.
\emph{That is, rules representing later stages in the processor pipeline (e.g., commit) are ordered before rules representing earlier stages (e.g., rename), and thus, interface methods called in later stages are ordered before methods called in earlier stages.}

Another principle in scheduling is to have common rules fire concurrently.
\emph{It is ok to have uncommon rules (e.g., mis-speculation) conflict with each other and even common rules.}

\noindent\textbf{System Instructions:}
Some instructions can change the context of the processor and are difficult to be executed out of order or concurrently with other instructions.
These instructions are called \emph{system instructions}, and include the following instructions:
\begin{itemize}
    \item \inst{ECALL} which makes a system call,
    \item \inst{EBREAK} which traps for debugger,
    \item \inst{SRET} and \inst{MRET} which return from trap handling,
    \item \inst{SFENCE.VMA} which flushes the TLBs, and
    \item \inst{FENCE.I} which is used for self-modifying code.
\end{itemize}
System instructions are executed in a blocking way, i.e., the instruction will be the only instruction in the ROB.


\noindent\textbf{Detecting Interrupt:}
We detect interrupts at \emph{rename} stage (instead of commit stage).
When an interrupt is detected, the current instruction at the rename stage will be turned into a special instruction which represents the interrupt.
The instruction will be entered into ROB and get processed at the commit stage.
Here we assume the fetch pipeline will keep delivering instructions to the rename stage, i.e., there is no low-power mode which turns off instruction fetch.

The benefits of detecting interrupts at rename stage is to ensure that all instructions in ROB are not affected by interrupts.
As a result, out-of-order commits may be possible, though we did not implement any out-of-order commits.

\noindent\textbf{Read-After-Write Hazards for CSRs:}
Our implementation ensures that there is no read-after-write hazards on CSRs in the processor.
CSRs can be modified by system instructions, exceptions, interrupts, and floating-point instructions.
The invariant of no read-after-write hazards on CSRs is guaranteed in the following way: at the rename stage, if we see a system instruction (i.e., \inst{CSRRW}), we enters the instruction into the ROB only if ROB is empty, and we flush the fetch pipeline at the same time.
As a result, there cannot be any hazards regarding the writes on CSRs by system instructions, exceptions or interrupts.

As for the hazards regarding the writes on CSRs by floating-point instructions.
We notice that only the \inst{CSRRW} instruction, which is a system instruction, can read the CSRs modified by floating-point instructions.
Therefore, the writes by floating-point instructions cannot cause any hazards.

It should be noted that there is no write-after-write or write-after-read hazards on CSRs, because all CSR writes are made at commit stage.

\noindent\textbf{Organization:}
In the rest of this section, we first introduce the general mechanism to control speculation in the back-end (Section~\ref{sec:specupdate}), and then describe each modules and top-level rules.

\input{kill_scheme}

\input{epoch_manager}

\input{spec_tag_manager}

\input{global_spec_update}

\input{spec_fifo}

\input{spec_poison_fifo}

\input{sb_aggr}

\input{rf_sbcons}

\input{rename_table}

\input{rob}

\input{reservation_station}

\input{lsq}

\input{store_buffer}

\input{d_cache}

\input{i_cache}

\input{d_tlb}

\input{i_tlb}

\input{l2_tlb}

\input{fpu}

\input{int_mul_div}

\input{csr_file}

\input{mmio_core}

\input{fetch}

\input{rename}

\input{alu_exe_pipe}

\input{fpu_exe_pipe}

\input{mem_exe_pipe}

\input{commit}

\input{misc_rule}

\input{core}

\section{Uncore}\label{sec:uncore}

Figure~\ref{fig:uncore} shows the uncore system.
The L2 cache connects to the DRAM and is shared by all the L1 caches.
The MMIO platform module is connected to all the MMIO handler modules (Section~\ref{sec:mmio-core}) in the processor cores.
It handles all the MMIO requests from the cores to access all the MMIO devices, and post interrupts to the cores.
The boot rom is a MMIO device which contains a piece of RISC-V code that is run at the very beginning when the processor starts up.
The code zeros the architectural registers, configures the CSRs, and requests the x86 host to initialize the memory content through the memory loader module.
The memory loader module is another MMIO device that transfers data (Linux image) from the x86 host the memory system (via the coherent L2 cache).
In the following, we will explain each module in the uncore system.

\begin{figure}[!htb]
    \centering
    \includegraphics[width=\columnwidth]{fig/uncore_crop.pdf}
    \caption{Uncore system}\label{fig:uncore}
\end{figure}


\input{l2_cache}

\input{l2_dram_convert}

\input{dram}

\input{mmio_platform}

\input{boot_rom}

\input{mem_loader}

\subsection{Putting it All Together}
Module \code{mkProcDmaWrapper} in \code{//procs/lib/ProcDmaWrapper.bsv} is the top-level module on FPGA, i.e., it contains both cores and uncores.
Its interface is for communicating with the x86 host software.
The major part of the interface has been shown in Figure~\ref{fig:uncore}.
The rest is for debugging or quering performance counters, and we will skip these details.

\section{X86 Host Software}\label{sec:sw}
The x86 host software is a C++ project which uses the Connectal framework to communicate with FPGA.
The main function is at \code{//procs/cpp/testproc.cpp}.
The main thread does the following:
\begin{enumerate}
    \item Parse the command-line arguments.
    The two most important arguments are the binary of the boot-rom program and the BBL (Linux image).
    \item Initialize the boot rom on FPGA.
    We also attach a flattened device tree to the end of the boot-rom binary before the initialization.
    \item Start the processor, and then wait for the processor to signal terminate.
    While waiting, the main thread also checks for keyboard inputs, and sends them to FPGA.
\end{enumerate}
The Connectal-indication thread runs callback functions when the FPGA sends messages to the x86 host.
Below are the major classes that implement the callbacks:
\begin{itemize}
    \item Class \code{ProIndication}: receives the HTIF messages, i.e., the data written into \code{tohost} MMIO location by the RiscyOO processor.
    The handling is done in another thread (class \code{ToHostHandler}) to avoid software deadlock (because the handling may incur more messages between FPGA and host).
    The \code{ProIndication} class also receives the terminate signal from the FPGA.
    \item Class \code{MemLoaderIndication}: receives the request from the FPGA to initialize DRAM.
    Upon receiving the request, the class spawns a new thread to transfer the content of BBL to FPGA.
    \item Class \code{BootRomIndication}: receives the complete signal for the initialization of boot rom from the FPGA.
    The main thread will not start the processor until this complete signal is received.
\end{itemize}

\section{Benchmarks}\label{sec:bench}
We have cross-compiled (part of) the following benchmarks to RISC-V and they are available publicly:
\begin{itemize}
    \item PARSEC~\cite{bienia11benchmarking}: \url{https://github.com/csail-csg/parsec}, branch \code{master}.
    \item GAP (graph algorithms)~\cite{beamer2015gap}: \url{https://github.com/csail-csg/gapbs}, branch \code{riscy-OOO}.
\end{itemize}
The following benchmarks are private to MIT-CSG lab:
\begin{itemize}
    \item SPECINT2006: \url{https://github.mit.edu/riscy/SPEC}, branch \code{master}.
    \item x264 (the x264 benchmark in PARSEC does not work, and this one directly compiles from the original x264 source): \url{https://github.mit.edu/riscy/x264}, branch \code{riscv}.
\end{itemize}
The following pre-built BBLs are private to MIT-CSG lab on the AWS EFS:
\begin{itemize}
    \item PARSEC (including x264): \code{/efs/szzhang/run/parsec\_bbls/normal}.
    \item GAP: \code{/efs/szzhang/run/gapbs\_bbls/normal}.
    \item SPECCINT2006: \code{/efs/szzhang/run/bbls} (one input per benchmark), and  \\ \code{/efs/szzhang/run/ref\_bbls} (all ref inputs).
\end{itemize}

\bibliographystyle{plain}
\bibliography{ref}

\end{document}