forked from jserv/xv6-x86_64
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.CLS
146 lines (101 loc) · 4.63 KB
/
README.CLS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
CPU LOCAL STORAGE
Basically a similar concept to thread local storage, but in Xv6's case
these are per-cpu, not per-thread.
GLOBAL REGISTER VARIABLES
-------------------------
Xv6:32 uses this gcc trick to generate GS: relative access to a few
globals for cpu-local-storage:
extern struct cpu *cpu asm("%gs:0);
Sadly this does not work on x86-64, instead generating a pc-relative
load and various unhappiness results. In this case and the other
options I explored, I took a look at a chunk of code generated by
a common expression using a structure from cpu local storage:
if (proc->killed) ...
with asm("%gs:4") on i386
: 65 a1 04 00 00 00 mov %gs:0x4,%eax
: 8b 40 24 mov 0x24(%eax),%eax
: 85 c0 test %eax,%eax
with asm("%gs:8") on x86-64
: 65 48 8b 05 04 00 00 mov %gs:0x8(%rip),%rax
: 00
: 8b 40 50 mov 0x50(%rax),%eax
: 85 c0 test %eax,%eax
This results in rax = [ gs + rip + 8 ] which is never what we want...
With -O1, in both cases the mov and test are combined into something like
: 65 a1 04 00 00 00 mov %gs:0x4,%eax
: 83 78 24 00 cmpl $0x0,0x24(%eax)
__THREAD MODIFIER
-----------------
gcc supports a construct for thread-local variables:
extern __thread struct cpu *cpu;
with __thread and -mtls-direct-seg-refs on i386
: 48 c7 c0 f8 ff ff ff mov $0xfffffffffffffff8,%rax
: 64 48 8b 00 mov %fs:(%rax),%rax
: 8b 40 50 mov 0x50(%rax),%eax
: 85 c0 test %eax,%eax
with __thread and -mtls-direct-seg-refs on x86-64
: b8 fc ff ff ff mov $0xfffffffc,%eax
: 65 8b 00 mov %gs:(%eax),%eax
: 8b 40 24 mov 0x24(%eax),%eax
: 85 c0 test %eax,%eax
The choice of segment (fs or gs) seems to be baked into gcc and
is chosen based on 32bit or 64bit compilation mode.
The linker generates an TLS section:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000001000 0xffffffff80100000 0x0000000000100000
0x000000000000da00 0x0000000000016778 RWE 1000
TLS 0x000000000000ea00 0xffffffff8010da00 0x000000000010da00
0x0000000000000000 0x0000000000000010 R 8
and TLS symbols:
168: 0000000000000008 8 TLS GLOBAL DEFAULT 5 proc
233: 0000000000000000 8 TLS GLOBAL DEFAULT 5 cpu
In this model I just point fs (or gs) at the top of a page of local
storage space I allocate (since I only have a handful of local items
to track).
These are a bit less convenient because of the negative indexing and
the fact that you're at the compiler and linker's whim for where things
end up. Also they require longer (and probably slower) instruction
sequences to allow the local index to be patched up by the linker.
Lack of control over which segment register is used is a further
downside.
MACROS AND INLINE ASSEMBLY
--------------------------
#define __local_get(n) ({ \
uint64 res; \
asm ("mov %%gs:" #n ",%0" : "=r" (res)); \
res; \
})
#define __local_put(n, v) ({ \
uint64 val = v; \
asm ("mov %0, %%gs:" #n : : "r" (val)); \
})
#define __proc() ((struct proc*) __local_get(4))
if (__proc()->killed) ...
x86-64 without optimization:
: 65 48 8b 04 25 08 00 mov %gs:0x4,%rax
: 00 00
: 48 89 45 d0 mov %rax,-0x30(%rbp)
: 48 8b 45 d0 mov -0x30(%rbp),%rax
: 8b 40 50 mov 0x50(%rax),%eax
: 85 c0 test %eax,%eax
x86-64 with -O1:
: 65 48 8b 04 25 08 00 mov %gs:0x4,%rax
: 00 00
: 83 78 50 00 cmpl $0x0,0x50(%rax)
i386 without optimization:
: 65 8b 1d 04 00 00 00 mov %gs:0x4,%ebx
: 89 5d f4 mov %ebx,-0xc(%ebp)
: 8b 45 f4 mov -0xc(%ebp),%eax
: 8b 40 24 mov 0x24(%eax),%eax
: 85 c0 test %eax,%eax
i386 with -O1:
: 65 a1 04 00 00 00 mov %gs:0x4,%eax
: 83 78 24 00 cmpl $0x0,0x24(%eax)
These are less efficient than the others when compiling unoptimized
(though that's an unusual state), but they cost no more than the
global register variable trick originally used and have the benefit
of generating correct code for both 32 and 64 bit modes.
They do have the downside that you can't use one construct
for both setting and getting the contents of a local storage
variable.