-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathmain.tex
336 lines (263 loc) · 16.9 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
\documentclass[12pt]{article}
% my packages
\usepackage{fullpage}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{verbatim}
\usepackage{color}
\usepackage{boxedminipage}
\usepackage{enumitem}
\usepackage[caption=false]{subfig}
\usepackage{url}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{hyperref}
\definecolor{commentcolor}{rgb}{0,0.6,0}
\definecolor{keywordcolor}{rgb}{0,0,0.8}
\definecolor{numbercolor}{rgb}{0.5,0.5,0.5}
\definecolor{stringcolor}{rgb}{0.58,0,0.82}
\lstset{%
language=Verilog, % closest to BSV
backgroundcolor=\color{white}, % choose the background color; you must add \usepackage{color} or \usepackage{xcolor}
basicstyle=\ttfamily\bfseries, % the size of the fonts that are used for the code
belowskip=0.5\baselineskip, % from: http://tex.stackexchange.com/questions/118730/avoid-empty-vert-space-after-lstlisting
breakatwhitespace=false, % sets if automatic breaks should only happen at whitespace
breaklines=true, % sets automatic line breaking
captionpos=b, % sets the caption-position to bottom
commentstyle=\color{commentcolor}, % comment style
deletekeywords={...}, % if you want to delete keywords from the given language
escapeinside={\%*}{*)}, % if you want to add LaTeX within your code
extendedchars=true, % lets you use non-ASCII characters; for 8-bits encodings only, does not work with UTF-8
frame=single, % adds a frame around the code
%frame=none, % doesn't add a frame around the code
keepspaces=true, % keeps spaces in text, useful for keeping indentation of code (possibly needs columns=flexible)
%columns=fixed, %
columns=flexible, %
keywordstyle=\color{keywordcolor}, % keyword style
morekeywords={%
call,%Called method
def,%Defined method
inverted,%Inverted interface
type,typedef,valueOf, % Type related keywords
method,endmethod,action,endaction, % Methods and actions
Action,ActionValue,interface,endinterface, % Interfaces
Vector,replicate,replicateM, % Vector stuff
Bit,Int,UIng,Reg,Integer,let,tagged,union,struct, % Basic types
TAdd,TMul,TDiv, % Type operations
rule,endrule,return, % Rules keywords
pack,unpack,zeroExtend,signExtend, % Common bit functions
case,matches,endcase % Case statements
synthesize,True,False,Empty,*,...}, % Etc.
numbers=left, % where to put the line-numbers; possible values are (none, left, right)
numbersep=5pt, % how far the line-numbers are from the code
numberstyle=\tiny\color{numbercolor}, % the style that is used for the line-numbers
rulecolor=\color{black}, % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. comments (green here))
showspaces=false, % show spaces everywhere adding particular underscores; it overrides 'showstringspaces'
showstringspaces=false, % underline spaces within strings only
showtabs=false, % show tabs within strings adding particular underscores
stepnumber=1, % the step between two line-numbers. If it's 1, each line will be numbered
stringstyle=\color{stringcolor}, % string literal style
tabsize=2, % sets default tabsize to 2 spaces
title=\lstname % show the filename of files included with \lstinputlisting; also try caption instead of title
}
\newcommand{\mycomment}[1]{\emph{\textcolor{red}{[#1]}}}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\inst}[1]{\textsf{#1}}
\begin{document}
\title{RiscyOO Design Document}
\author{Sizhuo Zhang \\ [email protected] \\ MIT CSAIL}
\date{}
\maketitle
\section{Overview}
This document describes the design of the RiscyOO processor, which has been released publicly at \url{https://github.com/csail-csg/riscy-OOO}.
As a convention, we use ``\code{//}'' to denote the root of the git repo when referring to the path of source files.
RiscyOO implements the RISC-V 64-bit instruction set with the IMAFD extensions, i.e., RV64G.
It is a out-of-order superscalar multiprocessor which can boot multicore Linux and AWS F1 FPGA.
\begin{figure}[!htb]
\centering
\includegraphics[width=\columnwidth]{fig/system_crop.pdf}
\caption{Overall system}\label{fig:sys}
\end{figure}
Figure~\ref{fig:sys} shows the overall system.
The RiscyOO processor is implemented on the AWS FPGA.
The FPGA is connected to a x86 host machine via PCIe.
The x86 host machine runs a piece of software to do the following:
\begin{itemize}
\item Initialize the processor on FPGA (e.g., copy Linux image to the DRAM of FPGA).
\item Receive character outputs from the processor on FPGA and display the characters on the terminal of the x86 host.
This communication is using the HTIF protocol, which is used in the RISC-V ISA simulator Spike~\cite{spike} (our fork of Spike is at \code{//tools/riscv-isa-sim}).
\item Send keyboard input on the x86 host to the processor on FPGA.
This communication is also using the HTIF protocol.
\item Query performance counters in the processor on FPGA.
\item Receive terminate signals from FPGA.
\end{itemize}
The communication between FPGA and x86 software via PCIe is using the Connectal framework~\cite{connectal}, our fork of Connectal is at \code{//connectal}).
\noindent\textbf{Document Organization:}
Section~\ref{sec:core} describes the processor-core microarchitectures.
In particular, we describe the interface (including conflict matrices), implementation, future improvement and the path of source files for important modules.
Section~\ref{sec:uncore} describes the uncore, including the shared L2 cache, DRAM, etc.
Section~\ref{sec:sw} describes the software running on the x86 host machine.
Section~\ref{sec:bench} lists the git repos for the benchmark programs that have been run on the processor.
This document assumes the readers know the details of RISC-V and Bluespec SystemVerilog (BSV), and will use RISC-V/BSV terms without explanations.
\section{Processor Core}\label{sec:core}
Figure~\ref{fig:core} shows the overall structure of the processor core.
Modules are represented by boxes.
The main processor pipeline is the following:
\begin{itemize}
\item Fetch pipeline $\rightarrow$ Rename stage $\rightarrow$ ROB and all execution pipelines (ALU/Branch, FPU/Int-Mul/Int-Div, and Mem) $\rightarrow$ Commit stage.
\end{itemize}
The rename stage divides the processor core into the front-end and the back-end.
The front-end, i.e., the fetch pipeline, is an in-order pipeline, while out-of-order execution happens all in the back-end.
An instruction is fetched and decoded in the fetch pipeline.
The rename stage performs register renaming and enters the instruction int ROB and one of the execution pipelines.
When the instruction finishes execution and leaves the execution pipeline, it notifies the ROB that it is executed.
Finally the commit stage commits instructions from ROB in order.
Below we introduce some basic information about the core, which is not tied to each specific module.
\noindent\textbf{Superscalarity:}
The processor core is also superscalar.
The fetch pipeline, rename stage and commit stage can all handle multiple instructions in one cycle.
\emph{The size of superscalarity is represented by numeric type \code{SupSize} in the source code and in the rest of this document.}
The core also has multiple execution pipelines.
The numbers of ALU/Branch (abbreviated as ALU in the following) execution pipelines and FPU/Int-Mul/Int-Div execution pipelines (abbreviated as FPU execution pipeline in the following) are parametrized.
However, there can only be one memory execution pipeline
\begin{figure}
\centering
\includegraphics[width=\columnwidth]{fig/core_crop.pdf}
\caption{Overall structure of the processor core}\label{fig:core}
\end{figure}
\noindent\textbf{Scheduling Convention:}
Since there are many modules in the core, assigning conflict matrices to different module interfaces to maximize the concurrency of rules can be difficult.
To simplify the problem, we following the convention of \emph{reverse pipeline ordering} in assigning conflict matrices.
\emph{That is, rules representing later stages in the processor pipeline (e.g., commit) are ordered before rules representing earlier stages (e.g., rename), and thus, interface methods called in later stages are ordered before methods called in earlier stages.}
Another principle in scheduling is to have common rules fire concurrently.
\emph{It is ok to have uncommon rules (e.g., mis-speculation) conflict with each other and even common rules.}
\noindent\textbf{System Instructions:}
Some instructions can change the context of the processor and are difficult to be executed out of order or concurrently with other instructions.
These instructions are called \emph{system instructions}, and include the following instructions:
\begin{itemize}
\item \inst{ECALL} which makes a system call,
\item \inst{EBREAK} which traps for debugger,
\item \inst{SRET} and \inst{MRET} which return from trap handling,
\item \inst{SFENCE.VMA} which flushes the TLBs, and
\item \inst{FENCE.I} which is used for self-modifying code.
\end{itemize}
System instructions are executed in a blocking way, i.e., the instruction will be the only instruction in the ROB.
\noindent\textbf{Detecting Interrupt:}
We detect interrupts at \emph{rename} stage (instead of commit stage).
When an interrupt is detected, the current instruction at the rename stage will be turned into a special instruction which represents the interrupt.
The instruction will be entered into ROB and get processed at the commit stage.
Here we assume the fetch pipeline will keep delivering instructions to the rename stage, i.e., there is no low-power mode which turns off instruction fetch.
The benefits of detecting interrupts at rename stage is to ensure that all instructions in ROB are not affected by interrupts.
As a result, out-of-order commits may be possible, though we did not implement any out-of-order commits.
\noindent\textbf{Read-After-Write Hazards for CSRs:}
Our implementation ensures that there is no read-after-write hazards on CSRs in the processor.
CSRs can be modified by system instructions, exceptions, interrupts, and floating-point instructions.
The invariant of no read-after-write hazards on CSRs is guaranteed in the following way: at the rename stage, if we see a system instruction (i.e., \inst{CSRRW}), we enters the instruction into the ROB only if ROB is empty, and we flush the fetch pipeline at the same time.
As a result, there cannot be any hazards regarding the writes on CSRs by system instructions, exceptions or interrupts.
As for the hazards regarding the writes on CSRs by floating-point instructions.
We notice that only the \inst{CSRRW} instruction, which is a system instruction, can read the CSRs modified by floating-point instructions.
Therefore, the writes by floating-point instructions cannot cause any hazards.
It should be noted that there is no write-after-write or write-after-read hazards on CSRs, because all CSR writes are made at commit stage.
\noindent\textbf{Organization:}
In the rest of this section, we first introduce the general mechanism to control speculation in the back-end (Section~\ref{sec:specupdate}), and then describe each modules and top-level rules.
\input{kill_scheme}
\input{epoch_manager}
\input{spec_tag_manager}
\input{global_spec_update}
\input{spec_fifo}
\input{spec_poison_fifo}
\input{sb_aggr}
\input{rf_sbcons}
\input{rename_table}
\input{rob}
\input{reservation_station}
\input{lsq}
\input{store_buffer}
\input{d_cache}
\input{i_cache}
\input{d_tlb}
\input{i_tlb}
\input{l2_tlb}
\input{fpu}
\input{int_mul_div}
\input{csr_file}
\input{mmio_core}
\input{fetch}
\input{rename}
\input{alu_exe_pipe}
\input{fpu_exe_pipe}
\input{mem_exe_pipe}
\input{commit}
\input{misc_rule}
\input{core}
\section{Uncore}\label{sec:uncore}
Figure~\ref{fig:uncore} shows the uncore system.
The L2 cache connects to the DRAM and is shared by all the L1 caches.
The MMIO platform module is connected to all the MMIO handler modules (Section~\ref{sec:mmio-core}) in the processor cores.
It handles all the MMIO requests from the cores to access all the MMIO devices, and post interrupts to the cores.
The boot rom is a MMIO device which contains a piece of RISC-V code that is run at the very beginning when the processor starts up.
The code zeros the architectural registers, configures the CSRs, and requests the x86 host to initialize the memory content through the memory loader module.
The memory loader module is another MMIO device that transfers data (Linux image) from the x86 host the memory system (via the coherent L2 cache).
In the following, we will explain each module in the uncore system.
\begin{figure}[!htb]
\centering
\includegraphics[width=\columnwidth]{fig/uncore_crop.pdf}
\caption{Uncore system}\label{fig:uncore}
\end{figure}
\input{l2_cache}
\input{l2_dram_convert}
\input{dram}
\input{mmio_platform}
\input{boot_rom}
\input{mem_loader}
\subsection{Putting it All Together}
Module \code{mkProcDmaWrapper} in \code{//procs/lib/ProcDmaWrapper.bsv} is the top-level module on FPGA, i.e., it contains both cores and uncores.
Its interface is for communicating with the x86 host software.
The major part of the interface has been shown in Figure~\ref{fig:uncore}.
The rest is for debugging or quering performance counters, and we will skip these details.
\section{X86 Host Software}\label{sec:sw}
The x86 host software is a C++ project which uses the Connectal framework to communicate with FPGA.
The main function is at \code{//procs/cpp/testproc.cpp}.
The main thread does the following:
\begin{enumerate}
\item Parse the command-line arguments.
The two most important arguments are the binary of the boot-rom program and the BBL (Linux image).
\item Initialize the boot rom on FPGA.
We also attach a flattened device tree to the end of the boot-rom binary before the initialization.
\item Start the processor, and then wait for the processor to signal terminate.
While waiting, the main thread also checks for keyboard inputs, and sends them to FPGA.
\end{enumerate}
The Connectal-indication thread runs callback functions when the FPGA sends messages to the x86 host.
Below are the major classes that implement the callbacks:
\begin{itemize}
\item Class \code{ProIndication}: receives the HTIF messages, i.e., the data written into \code{tohost} MMIO location by the RiscyOO processor.
The handling is done in another thread (class \code{ToHostHandler}) to avoid software deadlock (because the handling may incur more messages between FPGA and host).
The \code{ProIndication} class also receives the terminate signal from the FPGA.
\item Class \code{MemLoaderIndication}: receives the request from the FPGA to initialize DRAM.
Upon receiving the request, the class spawns a new thread to transfer the content of BBL to FPGA.
\item Class \code{BootRomIndication}: receives the complete signal for the initialization of boot rom from the FPGA.
The main thread will not start the processor until this complete signal is received.
\end{itemize}
\section{Benchmarks}\label{sec:bench}
We have cross-compiled (part of) the following benchmarks to RISC-V and they are available publicly:
\begin{itemize}
\item PARSEC~\cite{bienia11benchmarking}: \url{https://github.com/csail-csg/parsec}, branch \code{master}.
\item GAP (graph algorithms)~\cite{beamer2015gap}: \url{https://github.com/csail-csg/gapbs}, branch \code{riscy-OOO}.
\end{itemize}
The following benchmarks are private to MIT-CSG lab:
\begin{itemize}
\item SPECINT2006: \url{https://github.mit.edu/riscy/SPEC}, branch \code{master}.
\item x264 (the x264 benchmark in PARSEC does not work, and this one directly compiles from the original x264 source): \url{https://github.mit.edu/riscy/x264}, branch \code{riscv}.
\end{itemize}
The following pre-built BBLs are private to MIT-CSG lab on the AWS EFS:
\begin{itemize}
\item PARSEC (including x264): \code{/efs/szzhang/run/parsec\_bbls/normal}.
\item GAP: \code{/efs/szzhang/run/gapbs\_bbls/normal}.
\item SPECCINT2006: \code{/efs/szzhang/run/bbls} (one input per benchmark), and \\ \code{/efs/szzhang/run/ref\_bbls} (all ref inputs).
\end{itemize}
\bibliographystyle{plain}
\bibliography{ref}
\end{document}