-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmain-vignette.html
317 lines (284 loc) · 313 KB
/
main-vignette.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="author" content="Olivier François" />
<meta name="date" content="2018-01-15" />
<title>Simulating phenotypes and evaluating GWAS methods with naturalgwas</title>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>
<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20both%3B%0Amargin%3A%200%200%2010px%2010px%3B%0Apadding%3A%204px%3B%0Awidth%3A%20400px%3B%0Aborder%3A%201px%20solid%20%23CCCCCC%3B%0Aborder%2Dradius%3A%205px%3B%0Abackground%2Dcolor%3A%20%23f6f6f6%3B%0Afont%2Dsize%3A%2013px%3B%0Aline%2Dheight%3A%201%2E3%3B%0A%7D%0A%23TOC%20%2Etoctitle%20%7B%0Afont%2Dweight%3A%20bold%3B%0Afont%2Dsize%3A%2015px%3B%0Amargin%2Dleft%3A%205px%3B%0A%7D%0A%23TOC%20ul%20%7B%0Apadding%2Dleft%3A%2040px%3B%0Amargin%2Dleft%3A%20%2D1%2E5em%3B%0Amargin%2Dtop%3A%205px%3B%0Amargin%2Dbottom%3A%205px%3B%0A%7D%0A%23TOC%20ul%20ul%20%7B%0Amargin%2Dleft%3A%20%2D2em%3B%0A%7D%0A%23TOC%20li%20%7B%0Aline%2Dheight%3A%2016px%3B%0A%7D%0Atable%20%7B%0Amargin%3A%201em%20auto%3B%0Aborder%2Dwidth%3A%201px%3B%0Aborder%2Dcolor%3A%20%23DDDDDD%3B%0Aborder%2Dstyle%3A%20outset%3B%0Aborder%2Dcollapse%3A%20collapse%3B%0A%7D%0Atable%20th%20%7B%0Aborder%2Dwidth%3A%202px%3B%0Apadding%3A%205px%3B%0Aborder%2Dstyle%3A%20inset%3B%0A%7D%0Atable%20td%20%7B%0Aborder%2Dwidth%3A%201px%3B%0Aborder%2Dstyle%3A%20inset%3B%0Aline%2Dheight%3A%2018px%3B%0Apadding%3A%205px%205px%3B%0A%7D%0Atable%2C%20table%20th%2C%20table%20td%20%7B%0Aborder%2Dleft%2Dstyle%3A%20none%3B%0Aborder%2Dright%2Dstyle%3A%20none%3B%0A%7D%0Atable%20thead%2C%20table%20tr%2Eeven%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0A%7D%0Ap%20%7B%0Amargin%3A%200%2E5em%200%3B%0A%7D%0Ablockquote%20%7B%0Abackground%2Dcolor%3A%20%23f6f6f6%3B%0Apadding%3A%200%2E25em%200%2E75em%3B%0A%7D%0Ahr%20%7B%0Aborder%2Dstyle%3A%20solid%3B%0Aborder%3A%20none%3B%0Aborder%2Dtop%3A%201px%20solid%20%23777%3B%0Amargin%3A%2028px%200%3B%0A%7D%0Adl%20%7B%0Amargin%2Dleft%3A%200%3B%0A%7D%0Adl%20dd%20%7B%0Amargin%2Dbottom%3A%2013px%3B%0Amargin%2Dleft%3A%2013px%3B%0A%7D%0Adl%20dt%20%7B%0Afont%2Dweight%3A%20bold%3B%0A%7D%0Aul%20%7B%0Amargin%2Dtop%3A%200%3B%0A%7D%0Aul%20li%20%7B%0Alist%2Dstyle%3A%20circle%20outside%3B%0A%7D%0Aul%20ul%20%7B%0Amargin%2Dbottom%3A%200%3B%0A%7D%0Apre%2C%20code%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0Aborder%2Dradius%3A%203px%3B%0Acolor%3A%20%23333%3B%0Awhite%2Dspace%3A%20pre%2Dwrap%3B%20%0A%7D%0Apre%20%7B%0Aborder%2Dradius%3A%203px%3B%0Amargin%3A%205px%200px%2010px%200px%3B%0Apadding%3A%2010px%3B%0A%7D%0Apre%3Anot%28%5Bclass%5D%29%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0A%7D%0Acode%20%7B%0Afont%2Dfamily%3A%20Consolas%2C%20Monaco%2C%20%27Courier%20New%27%2C%20monospace%3B%0Afont%2Dsize%3A%2085%25%3B%0A%7D%0Ap%20%3E%20code%2C%20li%20%3E%20code%20%7B%0Apadding%3A%202px%200px%3B%0A%7D%0Adiv%2Efigure%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0Aimg%20%7B%0Abackground%2Dcolor%3A%20%23FFFFFF%3B%0Apadding%3A%202px%3B%0Aborder%3A%201px%20solid%20%23DDDDDD%3B%0Aborder%2Dradius%3A%203px%3B%0Aborder%3A%201px%20solid%20%23CCCCCC%3B%0Amargin%3A%200%205px%3B%0A%7D%0Ah1%20%7B%0Amargin%2Dtop%3A%200%3B%0Afont%2Dsize%3A%2035px%3B%0Aline%2Dheight%3A%2040px%3B%0A%7D%0Ah2%20%7B%0Aborder%2Dbottom%3A%204px%20solid%20%23f7f7f7%3B%0Apadding%2Dtop%3A%2010px%3B%0Apadding%2Dbottom%3A%202px%3B%0Afont%2Dsize%3A%20145%25%3B%0A%7D%0Ah3%20%7B%0Aborder%2Dbottom%3A%202px%20solid%20%23f7f7f7%3B%0Apadding%2Dtop%3A%2010px%3B%0Afont%2Dsize%3A%20120%25%3B%0A%7D%0Ah4%20%7B%0Aborder%2Dbottom%3A%201px%20solid%20%23f7f7f7%3B%0Amargin%2Dleft%3A%208px%3B%0Afont%2Dsize%3A%20105%25%3B%0A%7D%0Ah5%2C%20h6%20%7B%0Aborder%2Dbottom%3A%201px%20solid%20%23ccc%3B%0Afont%2Dsize%3A%20105%25%3B%0A%7D%0Aa%20%7B%0Acolor%3A%20%230033dd%3B%0Atext%2Ddecoration%3A%20none%3B%0A%7D%0Aa%3Ahover%20%7B%0Acolor%3A%20%236666ff%3B%20%7D%0Aa%3Avisited%20%7B%0Acolor%3A%20%23800080%3B%20%7D%0Aa%3Avisited%3Ahover%20%7B%0Acolor%3A%20%23BB00BB%3B%20%7D%0Aa%5Bhref%5E%3D%22http%3A%22%5D%20%7B%0Atext%2Ddecoration%3A%20underline%3B%20%7D%0Aa%5Bhref%5E%3D%22https%3A%22%5D%20%7B%0Atext%2Ddecoration%3A%20underline%3B%20%7D%0A%0Acode%20%3E%20span%2Ekw%20%7B%20color%3A%20%23555%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%0Acode%20%3E%20span%2Edt%20%7B%20color%3A%20%23902000%3B%20%7D%20%0Acode%20%3E%20span%2Edv%20%7B%20color%3A%20%2340a070%3B%20%7D%20%0Acode%20%3E%20span%2Ebn%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Efl%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Ech%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Est%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Eco%20%7B%20color%3A%20%23888888%3B%20font%2Dstyle%3A%20italic%3B%20%7D%20%0Acode%20%3E%20span%2Eot%20%7B%20color%3A%20%23007020%3B%20%7D%20%0Acode%20%3E%20span%2Eal%20%7B%20color%3A%20%23ff0000%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%0Acode%20%3E%20span%2Efu%20%7B%20color%3A%20%23900%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%20code%20%3E%20span%2Eer%20%7B%20color%3A%20%23a61717%3B%20background%2Dcolor%3A%20%23e3d2d2%3B%20%7D%20%0A" rel="stylesheet" type="text/css" />
</head>
<body>
<h1 class="title toc-ignore">Simulating phenotypes and evaluating GWAS methods with naturalgwas</h1>
<h4 class="author"><em>Olivier François</em></h4>
<h4 class="date"><em>2018-01-15</em></h4>
<div id="summary" class="section level4">
<h4>Summary</h4>
<p>Association studies of polygenic traits with genomic loci are notoriously difficult when those studies are conducted at large geographical scales. The difficulty arises as genotype frequencies often vary in geographic space and across distinct environments in a complex fashion. Those large-scale variations are known to yield false positives in association testing approaches. The R package <strong>naturalgwas</strong> allows researchers to simulate trait association from observed genotypes, modelling gene by environment interactions where environment is derived from a climate database. The R package includes an oracle method correcting false associations by using the true set of confounding variables. The simulated data and the oracle method can be used 1) to compare the ability of association methods to correctly remove confounding factors for a particular data set, 2) to evaluate power to detect causal variants, and 3) to assess the influence of various parameters including the number of causal variants, effect sizes and gene by environment interaction.</p>
<hr />
</div>
<div id="introduction" class="section level4">
<h4>Introduction</h4>
<p>This vignette presents a short tutorial on how to simulate phenotypes for “natural” organisms, and evaluate genome-wide association studies (GWAS) when field sampling has been performed at a large geographical scale. We illustrate the simulation approach with some phenotypic traits simulated from data of the plant species <em>Arabidopsis thaliana</em> (Atwell et al. 2010). To start with the R package <strong>naturalgwas</strong>, load the R functions in memory space as follows.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#install.packages("cate")</span>
<span class="co">#devtools::install_github("bcm-uga/lfmm")</span>
<span class="kw">library</span>(naturalgwas)</code></pre></div>
<div id="data-files" class="section level5">
<h5>Data files</h5>
<p>Running the main simulation function <strong>simu_pheno</strong>, and running the <strong>oracle</strong> method requires three files as input to the program: 1) a file encoding individual genotypes, 2) a file with individual geographic coordinates, and 3) a genetic map. Those data must be loaded in the R environment. The simulation could also work without geographic data by specifying a environmental variable. For example, we considered SNP data from the chromosome 5 of European accession/ecotypes of the plant species <em>A. thaliana</em>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">data</span>(A.thaliana)
genotype <-<span class="st"> </span>A.thaliana$genotype
chrpos <-<span class="st"> </span>A.thaliana$chrpos </code></pre></div>
<p>For SNPs, the <strong>genotype</strong> matrix encodes each individual genotype in a row. Each locus corresponds to a specific column. Genotypes are encoded as 0,1,2 for diploids, and 0,1 for haploids. Those numbers represent the number of reference or derived allele at each particular locus. <em>A. thaliana</em> is a diploid species with very high levels of inbreeding. In our example, 162 genotypes were encoded as haploids considering 53,859 SNPs from chromosome 5 (Atwell et al. 2010). Let’s print some genotypic values for three individuals at the first ten loci. We have a matrix of size 3 times 10 filled with 0 and 1 values.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dim</span>(genotype)</code></pre></div>
<pre><code>## [1] 162 53859</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">genotype[<span class="kw">sample</span>(<span class="dv">162</span>, <span class="dv">3</span>),<span class="dv">1</span>:<span class="dv">10</span>]</code></pre></div>
<pre><code>## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## [1,] 1 1 1 1 1 1 1 0 1 1
## [2,] 0 0 1 1 1 1 0 0 1 1
## [3,] 0 1 1 1 1 1 1 0 1 1</code></pre>
<p>The <strong>coordinate</strong> variable is a two-column matrix that contains longitude and latitude for each individual in the sample. Longitude (°E) and latitude (°N) must be encoded in the decimal format. Note that headers should be ignored when loading data into the R program. Let us show the location of the sampling sites in Europe (each point corresponds to an individual in the sample).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(maps)
coordinates <-<span class="st"> </span>A.thaliana$coord
<span class="kw">plot</span>(coordinates,
<span class="dt">pch =</span> <span class="dv">19</span>, <span class="dt">cex =</span> .<span class="dv">7</span>, <span class="dt">col =</span> <span class="st">"green4"</span>,
<span class="dt">xlab =</span> <span class="st">"Longitude (°E)"</span>,
<span class="dt">ylab =</span> <span class="st">"Latitude (°N)"</span>)
<span class="kw">map</span>(<span class="dt">add =</span> T, <span class="dt">interior =</span> <span class="ot">FALSE</span>)</code></pre></div>
<p><img src="" /><!-- --></p>
</div>
<div id="simulating-artificial-phenotypes-from-real-genotypes" class="section level5">
<h5>Simulating artificial phenotypes from real genotypes</h5>
<p>For simulating artificial phenotypes from real genotypes and accounting for genotype by environment interactions, we need 1) a proxy variable for the enviromnent, 2) a reference set of weakly linked causal loci for which we wish to simulate trait association, 3) and a set of confounder factors typically arising from complex population structure and isolation-by-distance.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> env <-<span class="st"> </span><span class="kw">get_climate</span>(coordinates)
ref.set <-<span class="st"> </span><span class="kw">create_refset</span>(chrpos, <span class="dt">window =</span> <span class="dv">501</span>)
confounder <-<span class="st"> </span><span class="kw">create_factor</span>(genotype, <span class="dt">K =</span> <span class="dv">10</span>)</code></pre></div>
<p><img src="" /><!-- --></p>
<p>The above commands create an artificial environmental variable by combining 19 bioclimatic variables extracted from the ‘worldclim’ database, and they take a few minutes to download. The SNP reference set consists of picking one representant SNP in each window of size <strong>window</strong>. In the third line, 10 confounders are computed from the data. The confounders correspond to the first ten principal components of the genotype matrix using PCA from LEA (Frichot and Francois 2015). The PCA screeplot for the SNP data is displayed.</p>
<p>In addition, the <strong>create_factor</strong> function estimates the amount of residual variation in the data and an order of magnitude for effect sizes.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> confounder$sigma </code></pre></div>
<pre><code>## [1] 81.91204</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> confounder$base </code></pre></div>
<pre><code>## [1] 93.90588</code></pre>
<p>Simulation of phenotypic traits requires the following arguments: 1) a SNP genotype matrix, 2) a set of confounders created with <strong>create_factor</strong>, an environmental variable, a reference set of weakly linked SNPs created with <strong>create_refset</strong>, a number of causal SNPs to be drawn from the reference set, and values for effect size and GxE interaction parameters.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> n.causal <-<span class="st"> </span><span class="dv">6</span>
eff.size <-<span class="st"> </span><span class="fl">5.0</span>
sim <-<span class="st"> </span><span class="kw">simu_pheno</span>(A.thaliana$genotype,
confounder,
env,
ref.set, <span class="dt">ncausal =</span> n.causal,
<span class="dt">effect.size =</span> eff.size, <span class="dt">gxe =</span> .<span class="dv">5</span>)
ss <-<span class="st"> </span><span class="kw">sum</span>(<span class="kw">apply</span>(genotype[,sim$causal.set], <span class="dv">2</span>, var))
ss*(eff.size)^<span class="dv">2</span>*confounder$base^<span class="dv">2</span>/(ss*(eff.size)^<span class="dv">2</span>*confounder$base^<span class="dv">2</span> +<span class="st"> </span>confounder$sigma^<span class="dv">2</span> +<span class="st"> </span>confounder$base^<span class="dv">2</span>)
(ss*(eff.size)^<span class="dv">2</span>*confounder$base^<span class="dv">2</span> +<span class="st"> </span>confounder$sigma^<span class="dv">2</span>)/(ss*(eff.size)^<span class="dv">2</span>*confounder$base^<span class="dv">2</span> +<span class="st"> </span>confounder$sigma^<span class="dv">2</span> +<span class="st"> </span>confounder$base^<span class="dv">2</span>)</code></pre></div>
<p>In this simulation, a polygenic trait is associated with 6 causal variants, sampled from the SNP reference set. Each causal variant is associated with an effect size of 5 units. Gene by environment interactions are modelled by using the <em>env</em> variable created from ‘worldclim’.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> <span class="kw">hist</span>(<span class="kw">scale</span>(sim$phenotype), <span class="dt">main =</span> <span class="st">"Trait value"</span>)</code></pre></div>
<p><img src="" /><!-- --></p>
<p>To check whether the simulated trait displays any geographic variation, the trait variable can interpolated on a geographic map as follows.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> <span class="kw">library</span>(fields)
fit =<span class="st"> </span>fields::<span class="kw">Krig</span>(coordinates, <span class="kw">scale</span>(sim$phenotype), <span class="dt">theta =</span> <span class="dv">10</span>, <span class="dt">m =</span> <span class="dv">2</span>)
<span class="kw">surface</span>(fit, <span class="dt">extrap =</span> <span class="ot">TRUE</span>, <span class="dt">xlab =</span> <span class="st">"Longitude"</span>, <span class="dt">ylab =</span> <span class="st">"Latitude"</span>, <span class="dt">levels =</span> <span class="kw">c</span>(<span class="dv">0</span>))
<span class="kw">map</span>(<span class="dt">add =</span> <span class="ot">TRUE</span>, <span class="dt">interior =</span> F)</code></pre></div>
<p><img src="" /><!-- --></p>
</div>
<div id="running-the-oracle-method" class="section level5">
<h5>Running the oracle method</h5>
<p>Our next step is to perform a GWAS for the simulated phenotype, and to evaluate the relative power of methods to detect causal variants. To this objective, we use the ‘linear regression’ method as an oracle algorithm. The method is very close to the simulation model, and it becomes an oracle method when the true confounders are provided in arguments.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">pv.oracle <-<span class="st"> </span><span class="kw">oracle</span>(<span class="kw">scale</span>(sim$phenotype),
genotype,
confounder, <span class="dt">K =</span> <span class="dv">10</span>)$pv</code></pre></div>
<p>Now visualize the Manhattan plot for the oracle method. The vertical bars correspond to the causal variants in the simulation.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> <span class="kw">plot</span>(-<span class="kw">log10</span>(pv.oracle), <span class="dt">cex =</span> .<span class="dv">4</span>, <span class="dt">col =</span> <span class="st">"grey"</span>, <span class="dt">main =</span> <span class="st">"Manhattan plot (Oracle)"</span>)
<span class="kw">points</span>(sim$causal, -<span class="kw">log10</span>(pv.oracle)[sim$causal], <span class="dt">type =</span> <span class="st">"h"</span>, <span class="dt">lty =</span> <span class="dv">1</span>, <span class="dt">col =</span> <span class="st">"blue"</span>)
<span class="kw">abline</span>(-<span class="kw">log10</span>(<span class="fl">0.05</span>/<span class="kw">length</span>(pv.oracle)), <span class="dv">0</span>, <span class="dt">lty =</span> <span class="dv">2</span>)</code></pre></div>
<p><img src="" /><!-- --></p>
</div>
<div id="heritability-and-mixed-linear-models" class="section level5">
<h5>Heritability and mixed linear models</h5>
<p>In this section, we used a mixed model approach and the R package <strong>rrBLUP</strong> to provide an heritability estimate for the simulated trait.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(rrBLUP)
G <-<span class="st"> </span>A.thaliana$genotype
<span class="kw">rownames</span>(G) <-<span class="st"> </span>A.thaliana$ecotype.id[,<span class="dv">1</span>]
data <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">y =</span> <span class="kw">scale</span>(sim$phenotype),
<span class="dt">gid =</span> A.thaliana$ecotype.id[,<span class="dv">1</span>])
<span class="co">#predict breeding values and compute h2</span>
kb <-<span class="st"> </span><span class="kw">kin.blup</span>(<span class="dt">data =</span> data,
<span class="dt">geno=</span><span class="st">"gid"</span>,
<span class="dt">pheno =</span> <span class="st">"y"</span>,
<span class="dt">K =</span> <span class="kw">A.mat</span>(G))
<span class="kw">cat</span>(<span class="st">"Heritability = "</span>, kb$Vg/(kb$Vg +<span class="st"> </span>kb$Ve), <span class="st">"</span><span class="ch">\n</span><span class="st">"</span>)</code></pre></div>
<pre><code>## Heritability = 0.9366343</code></pre>
<p>To provide a better estimate for the K matrix, we could use a latent factor model as follows.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## fit latent factors using an LFMM
mod.lfmm <-<span class="st"> </span>lfmm::<span class="kw">lfmm_ridge</span>(<span class="dt">Y =</span> A.thaliana$genotype,
<span class="dt">X =</span> <span class="kw">scale</span>(sim$phenotype),
<span class="dt">K =</span> <span class="dv">6</span>)
Kinship <-<span class="st"> </span>mod.lfmm$U %*%<span class="st"> </span><span class="kw">t</span>(mod.lfmm$U)/<span class="kw">nrow</span>(mod.lfmm$U)
<span class="kw">rownames</span>(Kinship) <-<span class="st"> </span>A.thaliana$ecotype.id[,<span class="dv">1</span>]
## predict breeding values and compute h2
kb <-<span class="st"> </span><span class="kw">kin.blup</span>(<span class="dt">data =</span> data,
<span class="dt">geno=</span><span class="st">"gid"</span>,
<span class="dt">pheno =</span> <span class="st">"y"</span>,
<span class="dt">K =</span> Kinship)
<span class="kw">cat</span>(<span class="st">"Heritability = "</span>, kb$Vg/(kb$Vg +<span class="st"> </span>kb$Ve), <span class="st">"</span><span class="ch">\n</span><span class="st">"</span>)</code></pre></div>
<pre><code>## Heritability = 0.03081077</code></pre>
<p>The simulated trait does not exhibit a very high heritability value. Let us perform an association study by using a mixed linear model.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">geno <-<span class="st"> </span><span class="kw">t</span>(A.thaliana$genotype)
<span class="kw">colnames</span>(geno) <-<span class="st"> </span>A.thaliana$ecotype.id[,<span class="dv">1</span>]
G <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">marker =</span> <span class="dv">1</span>:<span class="kw">nrow</span>(A.thaliana$chrpos),
A.thaliana$chrpos,
geno,
<span class="dt">check.names =</span> <span class="ot">FALSE</span>)
pheno <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">line =</span> A.thaliana$ecotype.id[,<span class="dv">1</span>],
<span class="dt">y =</span> <span class="kw">scale</span>(sim$phenotype))
mlm <-<span class="st"> </span><span class="kw">GWAS</span>(<span class="dt">pheno =</span> pheno,
<span class="dt">geno =</span> G,
<span class="dt">K =</span> Kinship,
<span class="dt">n.PC =</span> <span class="dv">5</span>,
<span class="dt">plot =</span> <span class="ot">FALSE</span>)</code></pre></div>
<pre><code>## [1] "GWAS for trait: y"
## [1] "Variance components estimated. Testing markers."</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## Manhattan plot
<span class="kw">plot</span>(mlm$y, <span class="dt">cex =</span> .<span class="dv">4</span>, <span class="dt">col =</span> <span class="st">"grey"</span>,
<span class="dt">ylab =</span> <span class="st">"-log10(pvalues)"</span>,
<span class="dt">main =</span> <span class="st">"Manhattan plot (mlm)"</span> )
<span class="kw">points</span>(sim$causal, mlm$y[sim$causal],
<span class="dt">type =</span> <span class="st">"h"</span>, <span class="dt">lty =</span> <span class="dv">1</span>, <span class="dt">col =</span> <span class="st">"green4"</span>)
<span class="kw">abline</span>(-<span class="kw">log10</span>(<span class="fl">0.05</span>/<span class="kw">length</span>(pv.oracle)), <span class="dv">0</span>, <span class="dt">lty =</span> <span class="dv">2</span>)</code></pre></div>
<p><img src="" /><!-- --></p>
<p>The results of a mixed linear model using the LFMM kinskip estimate are very close to the oracle method.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">qqplot</span>(-<span class="kw">log10</span>(pv.oracle), mlm$y, <span class="dt">col =</span> <span class="st">"grey"</span>, <span class="dt">cex =</span> .<span class="dv">4</span>, <span class="dt">ylab =</span><span class="st">"mlm scores"</span>)
<span class="kw">abline</span>(<span class="dv">0</span>,<span class="dv">1</span>, <span class="dt">col =</span> <span class="st">'green4'</span>)</code></pre></div>
<p><img src="" /><!-- --></p>
<p>Let us compare the results of method that estimate confounders at the same time as it computes association with phenotype. The <strong>cate</strong> method implements latent factor mixed models. Based on a principal component analysis of the genotype data, we evaluate that around <span class="math inline">\(K = 6\)</span> factors explain population structure, and 6 hidden factors are used in cate. In cate, the pvalues are computed as follows.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> pheno <-<span class="st"> </span><span class="kw">scale</span>(sim$phenotype)
pv.cate <-<span class="st"> </span>cate::<span class="kw">cate</span>( ~<span class="st"> </span>pheno,
<span class="dt">X.data =</span> <span class="kw">data.frame</span>(pheno),
<span class="dt">Y =</span> genotype,
<span class="dt">r =</span> <span class="dv">6</span>,
<span class="dt">calibrate =</span> <span class="ot">TRUE</span>)$beta.p.value</code></pre></div>
<p>Now let us visualize the Manhattan plot for the cate method and compare it to the Manhattan plot for the oracle method. The vertical bars correspond to the causal variants in the simulation.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> <span class="kw">plot</span>(-<span class="kw">log10</span>(pv.cate), <span class="dt">cex =</span> .<span class="dv">4</span>, <span class="dt">col =</span> <span class="st">"grey"</span>, <span class="dt">main =</span> <span class="st">"Manhattan plot (Latent factor model)"</span> )
<span class="kw">points</span>(sim$causal, -<span class="kw">log10</span>(pv.cate)[sim$causal], <span class="dt">type =</span> <span class="st">"h"</span>, <span class="dt">lty =</span> <span class="dv">1</span>, <span class="dt">col =</span> <span class="st">"orange"</span> )
<span class="kw">abline</span>(-<span class="kw">log10</span>(<span class="fl">0.05</span>/<span class="kw">length</span>(pv.oracle)), <span class="dv">0</span>, <span class="dt">lty =</span> <span class="dv">2</span>)</code></pre></div>
<p><img src="" /><!-- --></p>
<p>We eventually compare the results of the latent factor model with the oracle method.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> ## qqplot
<span class="kw">qqplot</span>(-<span class="kw">log10</span>(pv.oracle), -<span class="kw">log10</span>(pv.cate) , <span class="dt">pch =</span> <span class="dv">19</span>, <span class="dt">cex =</span> .<span class="dv">4</span>, <span class="dt">col =</span> <span class="st">"grey"</span> )
<span class="kw">abline</span>( <span class="dv">0</span>, <span class="dv">1</span>, <span class="dt">lwd =</span> <span class="dv">2</span>, <span class="dt">col =</span> <span class="st">"orange"</span> )</code></pre></div>
<p><img src="" /><!-- --></p>
<p>In this situation, the latent factor model and the oracle method performed similarly. Other testing methods could be evaluated similarly (glm, mixed models, etc). We compare the mixed modeling approach and cate below.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> ## qqplot
<span class="kw">qqplot</span>(mlm$y, -<span class="kw">log10</span>(pv.cate) , <span class="dt">pch =</span> <span class="dv">19</span>, <span class="dt">cex =</span> .<span class="dv">4</span>, <span class="dt">col =</span> <span class="st">"grey"</span> )
<span class="kw">abline</span>( <span class="dv">0</span>, <span class="dv">1</span>, <span class="dt">lwd =</span> <span class="dv">2</span>, <span class="dt">col =</span> <span class="st">"orange"</span> )</code></pre></div>
<p><img src="" /><!-- --></p>
</div>
<div id="comparison-with-lfmm" class="section level5">
<h5>Comparison with LFMM</h5>
<p>For comparisons with <strong>cate</strong>, we also use <span class="math inline">\(K = 6\)</span> factors in <strong>lfmm</strong>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">lfmm.mod <-<span class="st"> </span>lfmm::<span class="kw">lfmm_ridge</span>(<span class="dt">Y =</span> genotype,
<span class="dt">X =</span> <span class="kw">scale</span>(sim$phenotype),
<span class="dt">K =</span> <span class="dv">6</span>)</code></pre></div>
<p>We compute the pvalues as follows</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> p <-<span class="st"> </span>lfmm::<span class="kw">lfmm_test</span>(<span class="dt">Y =</span> genotype,
<span class="dt">X =</span> <span class="kw">scale</span>(sim$phenotype),
<span class="dt">lfmm =</span> lfmm.mod,
<span class="dt">calibrate =</span> <span class="st">"gif"</span>)
pv.lfmm <-<span class="st"> </span>p$calibrated.pvalue</code></pre></div>
<p>Now let us visualize the Manhattan plots for <strong>lfmm</strong> and <strong>cate</strong>. The vertical bars correspond to the causal variants in the simulation.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> <span class="kw">plot</span>( -<span class="kw">log10</span>(pv.lfmm), <span class="dt">cex =</span> .<span class="dv">4</span>, <span class="dt">col =</span> <span class="st">"grey"</span>, <span class="dt">main =</span> <span class="st">"Manhattan plot (lfmm)"</span> )
<span class="kw">points</span>( sim$causal, -<span class="kw">log10</span>(pv.lfmm)[sim$causal], <span class="dt">type =</span> <span class="st">"h"</span>, <span class="dt">lty =</span> <span class="dv">1</span>, <span class="dt">col =</span> <span class="st">"blue"</span> )
<span class="kw">abline</span>(-<span class="kw">log10</span>(<span class="fl">0.05</span>/<span class="kw">length</span>(pv.oracle)), <span class="dv">0</span>, <span class="dt">lty =</span> <span class="dv">2</span>)</code></pre></div>
<p><img src="" /><!-- --></p>
<p>We eventually compare the results of the latent factor model with the oracle method.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> ## plot
<span class="kw">qqplot</span>(-<span class="kw">log10</span>(pv.oracle), -<span class="kw">log10</span>(pv.lfmm) , <span class="dt">pch =</span> <span class="dv">19</span>, <span class="dt">cex =</span> .<span class="dv">4</span>, <span class="dt">col =</span> <span class="st">"grey"</span> )
<span class="kw">abline</span>( <span class="dv">0</span>, <span class="dv">1</span>, <span class="dt">lwd =</span> <span class="dv">2</span>, <span class="dt">col =</span> <span class="st">"blue"</span> )</code></pre></div>
<p><img src="" /><!-- --></p>
</div>
<div id="conditional-tests-lfmm" class="section level5">
<h5>conditional tests LFMM</h5>
<p>Based on previous analysis, we understood that only a handful of SNP could be detected with those simulation parameters. Here, we experimented <em>forward conditional tests</em>, in which the top hits are recursively included as covariates in an LFMM estimation/testing approach (12 cycles).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">cond.test <-<span class="st"> </span>lfmm::<span class="kw">forward_test</span>(<span class="dt">Y =</span> A.thaliana$genotype,
<span class="dt">X =</span> <span class="kw">scale</span>(sim$phenotype),
<span class="dt">K =</span> <span class="dv">6</span>,
<span class="dt">niter =</span> <span class="dv">12</span>)</code></pre></div>
<p>We compare the results with the truth.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># proposed candidates</span>
<span class="kw">cat</span>(<span class="st">"Candidate loci:"</span>)</code></pre></div>
<pre><code>## Candidate loci:</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">sort</span>(cond.test$candidates)</code></pre></div>
<pre><code>## [1] 8791 19536 19551 19588 19646 19663 19702 19703 19721 22796 26303
## [12] 34820</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># truth</span>
<span class="kw">cat</span>(<span class="st">"Causal SNPs:"</span>)</code></pre></div>
<pre><code>## Causal SNPs:</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">sim$causal.set</code></pre></div>
<pre><code>## [1] 8768 22796 26303 29309 34820 46844</code></pre>
</div>
</div>
<div id="package-reference" class="section level4">
<h4>Package reference</h4>
<ul>
<li>Francois O, Caye K (2017). naturalgwas: An R package for simulating phenotypes and evaluating GWAS methods with large-scale geographic sampling.</li>
</ul>
</div>
<div id="references" class="section level4">
<h4>References</h4>
<ul>
<li><p>Atwell S, Huang YS, Vilhjalmsson BJ, et al. (2010). Genome-wide association study of 107 phenotypes in <em>Arabidopsis thaliana</em> inbred lines. Nature 465, 627-631.</p></li>
<li><p>Francois O, Martins H, Caye K, Schoville SD (2016). Controlling false discoveries in genome scans for selection. Molecular Ecology 25, 454-469.</p></li>
<li><p>Frichot E, Francois O (2015). LEA: an R package for Landscape and Ecological Association studies. Methods in Ecology and Evolution 6(8), 925-929.</p></li>
</ul>
</div>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>