-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy pathllvm-irc.txt
executable file
·281 lines (268 loc) · 14.1 KB
/
llvm-irc.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
11:15 Joining chat room...
Started talking in maya on Friday 02/07/2010 11:15:51 AM
Started talking in maya on Friday 02/07/2010 11:15:51 AM
Room topic is: uniwide is the shits!
11:16 davidt_ Hi all!
davidt_ rl: Hey, be you there?
11:25 rl hey davidt_
11:27 davidt_ ahh cool
11:28 davidt_ I've been working on improving the aliasing in the llvm backend
11:29 rl good!
davidt_ but, I don't think there is any improvement in the code you sent me
davidt_ sadly
rl hmm
11:30 davidt_ what was it that you were hoping to see changed? Simon and I thought you were hoping with more alias info LLVM could reorder some instructions and improve the register allocation to get rid of the spills and reloads
davidt_ is that right?
rl lemme find the code
davidt_ k
11:32 rl ok got it
rl davidt_: here is an example
11:33 rl hmm, sec
11:36 rl I ought to look at the entire loop actually
rl but
11:37 rl mostly it's about spills and reloads, that's right
davidt_ ok sure
rl davidt_: one thing where llvm is confused is this
11:38 rl aaargh damn x86 assembly
11:39 davidt_ hehhe, yeah it really hurts the brain with all the addressing modes
11:41 rl but unless I'm mistaken it often does this: load [Sp + n] into a reg, use that reg, then spill that reg somewhere, then reload that reg, modify it and then store it in [Sp + n]
11:42 rl or rather load, mod, spill, reload, store
11:43 rl right, here's an example from the code
rl movl (%ebp), %eax
rl movl %eax, 28(%esp) ## 4-byte Spill
rl ...
rl incl 28(%esp) ## 4-byte Folded Spill
rl ...
rl movl 28(%esp), %ecx ## 4-byte Reload
rl ...
rl movl %ecx, (%ebp)
11:44 rl this is not really optimal
11:45 rl I suspect it does that because it thinks that other stores can modify (%ebp) (which is [Sp + 0])
davidt_ ok, one sec, need to browse through a few assembly files to see if this has changed with my changes
11:46 rl although now I can't see why it should think that any more
11:50 davidt_ yeah that behavior is still occurring
rl hmmm
11:51 davidt_ one thing i did though to test how much room I had to improve on in giving llvm more alias info
davidt_ was simply to modify llvm's alias anlysis pass to always return NoAlias
davidt_ and then see how that affected the generated code
rl and?
11:52 davidt_ still same behaviour
davidt_ code is very similar still, still has the spills and reloads
11:53 rl hmm
rl well, let's forget about aliasing for a second
rl this is the Cmm code:
rl _sges::I32 = _sbe6::I32 - 1;
rl _sget::I32 = I32[Sp + 76] + 1;
rl _sgeu::I32 = I32[Sp + 72] + 1;
rl _sgev::I32 = _sbdY::I32 - 1;
rl _sgew::I32 = I32[Sp + 48] + 1;
rl _sgex::I32 = _sbdT::I32 - 1;
rl _sgey::I32 = I32[Sp + 24] + 1;
rl _sgez::I32 = _sbdO::I32 - 1;
rl _sgeA::I32 = I32[Sp + 0] + 1;
rl I32[Sp + 96] = _sges::I32;
rl I32[Sp + 76] = _sget::I32;
rl I32[Sp + 72] = _sgeu::I32;
rl I32[Sp + 52] = _sgev::I32;
rl I32[Sp + 48] = _sgew::I32;
rl I32[Sp + 28] = _sgex::I32;
rl I32[Sp + 24] = _sgey::I32;
rl I32[Sp + 4] = _sgez::I32;
rl I32[Sp + 0] = _sgeA::I32;
11:54 rl this is what llvm generates:
rl movl (%ebp), %eax
rl movl %eax, 28(%esp) ## 4-byte Spill
rl movl 24(%ebp), %ecx
rl movl 76(%ebp), %ebx
rl decl 40(%esp) ## 4-byte Folded Spill
rl movl 72(%ebp), %edi
rl movl 48(%ebp), %edx
rl incl 28(%esp) ## 4-byte Folded Spill
rl movl 40(%esp), %eax ## 4-byte Reload
rl incl %ebx
rl incl %edi
rl incl %edx
rl incl %ecx
rl movl %eax, 96(%ebp)
rl movl %ebx, 76(%ebp)
rl movl 36(%esp), %ebx ## 4-byte Reload
rl movl %edi, 72(%ebp)
rl movl 16(%esp), %edi ## 4-byte Reload
rl decl %ebx
rl decl %edi
rl movl %ebx, 52(%ebp)
rl movl %edx, 48(%ebp)
rl movl 32(%esp), %edx ## 4-byte Reload
rl movl %edi, 28(%ebp)
rl movl %ecx, 24(%ebp)
rl movl 28(%esp), %ecx ## 4-byte Reload
rl decl %edx
rl movl %edx, 4(%ebp)
rl movl %ecx, (%ebp)
11:55 rl I would have thought that this would improve if you just use getelemwhatever instead of explicit pointer arithmetic
11:56 rl I mean, it's a bit ridiculous for llvm to generate this for what should be a couple of inc/dec instructions
11:57 davidt_ one sec, I'll send you an email with the llvm code I now generate and the assembly now generated
rl ok
12:01 rl davidt_: btw, *ideally* i'd like llvm to notice that we're actually working with arrays and simply use pointers in the loop instead of base address + counter
12:03 rl but I guess we'll just have to fix that in the haskell code
12:04 -> K_1 has joined maya
12:09 davidt_ ok nearly done, just had to find the right loop again which took some time
12:10 rl
12:16 davidt_ ok sent
12:17 davidt_ i'm off to lunch now sorry, so if you feel like waiting around i'll be back in about 30 - 45 min
davidt_ or just send me an email
davidt_ the other thing is, the fabled new code generator for GHC Simon thinks should bring some good improvements to this code
rl davidt_: ok, I'll look at it and I'll probably be here in an hour or so (food now!)
12:18 davidt_ since it binds stack variables and free variables to local cmm variables at the start of a function instead of just loading them each time
davidt_ so itll have like x = I32[Sp + 20] at the start
davidt_ instead of pasting I32[Sp + 20] everywhere
01:06 davidt_ ok im back
01:22 rl davidt_: in the code that you sent me, what is %sh8J?
01:24 davidt_ %sh8J = alloca i32, i32 1
davidt_ the optimiser will remove the stack allocation though and convert it to SSA form
01:25 rl ic
01:26 rl this is the llvm code:
rl store i32 %num6, i32* %sh8J
rl ...
rl %numv = load i32* %sh8J
rl %numw = load i32** %Sp_Var
rl %numx = getelementptr i32* %numw, i32 0
rl store i32 %numv, i32* %numx
01:27 rl anyway
rl I completely fail to understand why llvm feels compelled to generate this:
01:28 rl movl (%ebp), %eax
rl movl %eax, 16(%esp) # 4-byte Spill
rl ...
rl incl 16(%esp) # 4-byte Folded Spill
rl movl 16(%esp), %eax # 4-byte Reload
rl movl %eax, (%ebp)
rl this is awful
01:29 rl actually, it's even worse
rl it starts off with this:
rl movl (%ebp), %eax
rl movsd %xmm2, 8(%ecx,%eax,8)
rl movl (%ebp), %eax
01:30 rl barring an utterly broken optimiser, the only explanation is that it think that the store might alias (%ebp) and thus has to reload it
01:31 rl davidt_: this is actually exactly the kind of problem that alias annotation should solve
01:32 rl is it perhaps not running some pass necessary to make use of aliasing info?
01:33 davidt_ i dont think so, I had a little look at the optimisation passes and it seems like it should
rl hmm
davidt_ the code I sent you I simply used the -O3 class
davidt_ so I cant see how that wouldn't use alias analysis
01:34 rl this last bit of assembly is *very* weird then
davidt_ and you can add a flag to llvm to get it to print out the alias analysis information, when its accessed,
davidt_ and that prints out sane looking stuff
01:35 rl my nose tells me that if you can get rid of the second load here
rl things ought to improve significantly
01:36 davidt_ second load where sorry?
rl movl (%ebp), %eax
rl movsd %xmm2, 8(%ecx,%eax,8)
rl movl (%ebp), %eax
davidt_ k sure, I'm not that sure what I can change in the code I generate to improve the situation though
rl it loads (%ebp), stores something, then loads (%ebp) again
01:37 rl it must be thinking that the store might modify (%ebp)
rl otherwise it would eliminate the second load (I hope)
01:42 rl davidt_: what does this mean: %nulE = bitcast i32* %nulD to i32*
davidt_ thats a redundant instruction
davidt_ i'll fix that up at some point but it shouldn't cause any problems
01:43 davidt_ the optimiser will just drop it
rl ok
01:44 rl anyway, there must be some way to find out why it isn't eliminating the second load in the assembly snippet
davidt_ sure but I think we may need to involve the llvm developers to figure out why
rl imo it would be worthwhile to chase that because it seems to be a symptom of a deeper problem
01:45 rl and I suspect that we might get much better code if we solve that
01:46 rl davidt_: perhaps make the llvm code for the loop self-contained and ask?
01:47 davidt_ yeah, i'll have to do something like that
davidt_ try to minimise it as much as possible as well
rl yeah that would be cool
davidt_ i don't think it will go down that well if i just send them a 100 lines of assembly
rl wimps
01:48 davidt_ and it'll be much easier for us to play around with then anway
01:49 davidt_ i'm wondering if this cmm line is the cause of all the troubles though:
davidt_ F64[I32[R1 + 4] + 8 + (I32[Sp + 0] << 3)] = _sger::F64;
rl hmm looks ok to me
01:51 davidt_ well, I don't really have anything more than that :), I guess I'll start on minimising the test case
rl davidt_: you could try to move that store past all the loads in the llvm code and see what happens
01:52 rl or rather move all stores right to the end
01:55 davidt_ will try, one sec...
01:58 rl davidt_: actually, just moving that one store might be enough
02:00 davidt_ this is what the llvm code looks like after the optimiser has gotten through with it: http://gist.github.com/461320
davidt_ it does just leave that one store, I'll try moving it now
02:03 rl hmm
02:05 rl I'm beginning to wonder if llvm's reg allocator just sucks horribly
02:10 davidt_ i've had that suspicion before as well
02:11 davidt_ ok output when the store is moved: http://gist.github.com/461332
02:12 davidt_ doesn't seem any better to me
02:13 rl well, at least it doesn't load (%ebp) twice
02:14 rl it's still doing weird things though
02:15 rl 24(%ebp) -> eax -> 20(%esp) -> +1 -> edx -> 24(%ebp)
02:16 rl ok, then it's just the register allocator that sucks
davidt_ i'll try now quickly using the other allocator
davidt_ that bqpbqpp one
02:17 rl perhaps we could benl23 to contribute his allocator to llvm
rl *get
02:18 davidt_ hehe yeah that would be cool
02:19 davidt_ i dont know much about allocators but it has always suprised me that LLVM's main allocator is a linear scan allocator, I know its good for JIT since its fast but the biggest llvm users are static compilers currently
02:20 rl well implementing allocators in C is hard work
02:21 rl perhaps benl23 can rewrite the whole thing in Haskell while he's at it
02:22 davidt_ well as long as benl23 is doing the work I'm happy to propose him for anything
02:24 rl that's the right attitude!
02:27 davidt_ the different allocator makes very little difference still, the movl (%ebp), %eax ->movsd %xmm2, 8(%ecx,%eax,8) -> movl (%ebp), %eax, code is still there for example
02:28 rl davidt_: are all your aliasing patches in the head now?
davidt_ yes
02:30 rl I'll try to look at a couple of small loops over the weekend, perhaps I can spot something
rl 4 eyes are better than 2
davidt_ indeed, thanks for the help. I'll try to reduce the example we have down to something more manageable and then we can get the llvm developers involved
rl it's fairly clear that the assembly that llvm generates at least in this case is horrible
02:31 rl it's interesting that it seems to reorder instructions quite extensively and apparently that happens *before* the spills are introduces
02:32 rl perhaps it's doing instruction scheduling first, then reg allocation and then everything falls over because of the high register pressure
02:33 rl but that would be a bit stupid
02:35 davidt_ http://llvm.org/docs/CodeGenerator.html#high-level-design
02:36 rl hah
02:37 davidt_
rl it would explain a lot if it doesn't take care how many live registers it produces during insn scheduling
02:40 rl it picks a nice ordering for the insns and then it finds out that it doesn't have enough registers and then it has to spill/reload and then we get slow code
02:41 rl whereas if it moved loads to where the data is actually needed and then scheduled taking the number of available regs into account we'd get fast code
02:42 davidt_ yeah I can see that happening, but you'd think that LLVM would be smarter about it
02:44 rl The code generator is based on the assumption that the instruction selector will use an optimal pattern matching selector to create high-quality sequences of native instructions. Alternative code generator designs based on pattern expansion and aggressive iterative peephole optimization are much slower. This design permits efficient compilation (important for JIT environments) and aggressive optimization (used when generating code offline) by allowing
rl components of varying levels of sophistication to be used for any step of compilation.
02:47 rl of course optimal scheduling in NP complete even without the reg stuff but still...
02:48 davidt_ so.... can we mark this down as mystery solved perhaps?
02:51 rl well this still doesn't explain why it loads (%ebp) twice in that snippet
rl there was a list of llvm passes somewhere...
02:53 davidt_ http://llvm.org/docs/Passes.html
rl cheers
02:56 rl davidt_: so GHC generates this
02:57 rl _sges::I32 = _sbe6::I32 - 1;
rl _sget::I32 = I32[Sp + 76] + 1;
rl _sgeu::I32 = I32[Sp + 72] + 1;
rl _sgev::I32 = _sbdY::I32 - 1;
rl _sgew::I32 = I32[Sp + 48] + 1;
rl _sgex::I32 = _sbdT::I32 - 1;
rl _sgey::I32 = I32[Sp + 24] + 1;
rl _sgez::I32 = _sbdO::I32 - 1;
rl _sgeA::I32 = I32[Sp + 0] + 1;
rl I32[Sp + 96] = _sges::I32;
rl I32[Sp + 76] = _sget::I32;
rl I32[Sp + 72] = _sgeu::I32;
rl I32[Sp + 52] = _sgev::I32;
rl I32[Sp + 48] = _sgew::I32;
rl I32[Sp + 28] = _sgex::I32;
rl I32[Sp + 24] = _sgey::I32;
rl I32[Sp + 4] = _sgez::I32;
rl I32[Sp + 0] = _sgeA::I32;
rl I suspect llvm might generate much better code if the loads and stores were adjacent
rl ie
02:58 rl _sget::I32 = I32[Sp + 76] + 1;
rl I32[Sp + 76] = _sget::I32;
rl ...
rl you'd think it ought to do that on its own but apparently it doesn't
02:59 rl I'm not sure this is something we want to do, though
rl I thought llvm might have a separate pass for lowering loads but it doesn't seem to
03:02 rl davidt_: anyway, let me fiddle a bit over the weekend
03:03 rl football now
davidt_ thanks, let me know how it goes
davidt_ yes indeed
davidt_ time to load up the internet tv next to my zsh shell
rl hehe
03:04 rl have fun