forked from mortenarendt/dataanalyssisconsumerscience
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_IntroChaptersGeneric.Rmd
341 lines (206 loc) · 19.2 KB
/
01_IntroChaptersGeneric.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
# Introduction to R
R is a free software with a complete programming language for statistical computing and graphics. It is used at many universities and companies since it is always updated and open source.
Before staring your calculations in R you should always update the R version on your computer. You can install a graphical interface for R, called R studio. It will use the underlying version of R on your computer -- so you have to have R installed too. The principles by R and R Studio are the same -- BUT R Studio has a better interface for non-programmers. Both R and R Studio can be used on all types of computers.
R is command-line based and it provides a wide variety of statistical methods (linear and nonlinear modelling, classical statistical tests, classification, clustering, ...). Advanced methods are available via extension packages (more than 10.000 at the moment)
Always be sure to have the latest version of both R and R Studio on your computer - update version just before you need to use the program.
## How to get started - understanding R (and RStudio)
A quick intro-tour of R Studio is given here:
```{r, echo=FALSE}
vembedr::embed_youtube("FIrsOBy5k58")
```
And a short intro from our courses:
```{r, echo=FALSE}
vembedr::embed_youtube("MA0B4VzNeDM")
```
### Organise and save scripts
A script is a rundown from A-Z (start to end) of data analysis. A script should be self-contained. I.e. the first lines sets the libraries and imports the data, there after you may want to wrangle the data a bit (changing features as.numeric, as.factor,..., renaming columns, etc.). Thereafter the analysis starts.
Think of a script like making a meal: You need raw-materials (carrots, onions,...) - That is the data You need a kitchen - That is R as the software. You need knifes, pots and pans - That is the packages.
All is needed to work and hence you need to specify them in the script.
In larger projects where the same dataset may be used for several different analysis, it may be wise to have several scripts. One for importing data and modifying it (starts with import and ends with save() as an .RData file). One for descriptive analysis, one for inference, one for plots etc. So you can create a sequence of scripts to keep overview. However, this is only needed for larger projects. In small analysis you can easily include all in one script. ) Remember to put a little narrative (after a "\#" at the top off you script explaining the purpose.
To get started go to upper left corner and open a new script. **Remember to save your script as well**. After you've created a new script, try out our codes:
```{r}
1+2
a <- 2+2
b <- 5+3
```
What happens if you write the letter a in the editor and run it? What about the letter b?
```{r}
a
b
a+b
```
... or this one?
```{r, eval=FALSE}
A+B
```
R is essentially (also a) calculator, but it is case sensitive.
In the Editor:
"\#" is the start of a comment (means: will not evaluated/ read by the program). This is how you can make comments in your script:
```{r}
# I want to add 2 and 5
2+5
# whoop it is 7!
```
":"Generates a sequence (e.g. 1:10 is the numbers from 1 to 10)
```{r}
from1to10 <- 1:10
from10to1 <- 10:1
from1to10
from10to1
```
In the Console:
"\>" indicates that R is ready for a new code.
"+" Instead of "\>" means that the program is waiting for you. (you probably made a mistake in the script you tried to run) -- by [ESC] the "+" turns to a "\>" again
"NA" (Not Available) is indicating a missing value
"NaN" (Not a Number) is the result of an 'illegal' operation e.g. log(-1) Red sentences means there is an error. R will stop calculating at the first error it meets.
## How to import data
### Import data from R-package
In this book several datasets are used targeting different research questions. However, a fair part of the analysis tools are common. That is, descriptive analysis, plots, response correlations etc.
The data is included in the R-packgage *data4consumerscience* you get by running the code below. Be aware that you need devtools package to install packages from github, so you need to run both code lines.
```{r, eval=FALSE}
# install data-package
install.packages('devtools')
devtools::install_github('mortenarendt/data4consumerscience')
```
The data is also available as excel sheets, and can be loaded using packages capable of reading from Excel.
Before you start you data import, you have to make sure the data set contains all the information you need and the format of the data (columns and rows) is correct. You can import in many different ways.
### Importing a csv file
If the data is not already an csv file, but an excel file, you need to convert it: Open your Excel file, as it is in xls or xlsx format. Convert this file to csv format. NB: Some data collection tools will provide you with your data in csv and some xlsx/xls format. In Excel, you choose the "save as" and then choose \*.csv.
Then move in to R, and write:
```{r, eval = F}
DATASET1 <- read.csv2(file.choose())
```
The file.choose() function makes you point towards the file you want. You can also simply write the path to the file directly.
Actually, by using the file.choose() the first time you import data will prombt the path, and you can simply copy paste this from the console to your script avoiding point and click every time you want to analyse these data.
```{r, eval = F}
DATASET1 <- read.csv2('~/path/to/the/data/myfile.csv')
```
You decide the names/titles of your datasets and models, just do not use other signs than "." and avoid non-English letters. We called it "DATASET1". R will open a new window (sometimes hidden behind your other open windows), open the window to choose the wanted csv file. The data set will now also appear in the upper right corner as a line. If you double click a data set in this box, it will open in the editor window
You can import any \*.csv format dataset, when you try it out.
Trouble shooting: \* Try new csv format in Excel when saving the file in csv format \* Try to write read.csv(file.choose()) instead \* Try another import function (see below)
### Importing an Excel file/sheet
If you have data as excel, you may utilize packages for directly importing, without the need to convert to csv.
If your Excel file contains more than one sheet, you have to import each sheet separately.
Here we use the package **readxl** with the function read_excel. If the data is not in the same folder as your script, then include the path to the data, or move the data to the script's location. The example below imports from an excel file ( _DatasetRbook.xlsx_) a sheet (named e.g. _BuffetData_ ), positioned in a folder (named _data_ ) in the current position.
You can download this dataset from [here](https://github.com/mortenarendt/dataanalyssisconsumerscience/blob/master/data/iBuffet.xlsx).
When you have to find the path for the file on your computer, you place your cursor within the '' in the command and click the tabulator button. Your computer files will appear, and you can find the path for your file easily. If you cannot find the path, try to use the file.choose() command to find the file, and then copy paste the path from the Console (where you find your output).
```{r, eval=FALSE}
library(readxl)
BuffetConsumption <- read_excel('./DatasetRbook.xlsx', sheet = 'PastaBuffet')
BuffetSurvey <- read_excel('./DatasetRbook.xlsx',sheet = 'PastaSurvey')
```
The first part of the model sentence is what we want to call our dataset, here we chose "BuffetConsumption" in the first line (that is the same as the sheet in the Excel file for simplicity). You decide the names/titles of your datasets and model, just do not use other signs than "." and avoid non-English letters.
Try to import sheets from an excel file.
BuffetConsumption is consumption data in grams from a buffet. The data is from 16 different persons, who came on Day 1 and Day 2 to eat Pasta with legumes and/or Pasta with mushroom. In the dataset, there is one line per buffet station per participant per experimental day.
BuffetSurvey is survey data collected in SurveyXact. The dataset contains data on liking, motivation, choices etc. linked to the particular buffet data. Survey could also contain demographics for the participants such as age, gender, eating habits etc. These are general and different from the former, in that they have nothing to do with the current buffet. This type of data is not included in the SurveyData.
### Clipboard import
Last resort is to import via your clipboard. Go to Excel and mark the data you want to import. Make sure there are headings in the data you have marked. Copy the marked data to the clipboard. Go to the Editor and write the following command line:
```{r, eval = F}
DATASET2 <- read.table(file="clipboard", header=TRUE, sep="\t")
```
Meaning read the table you saved in your clipboard and save it as the name "DATASET2" (remember you choose this name). The data has headers and should be separated in cells.
Regardless of importation method -- the dataset will appear in the upper right corner environment as a line, please check it looks correct.
You can import from any Excel dataset, when you try it out.
### Looking at the imported elements
Have a look at the imported elements to ensure that indeed, they mimic the Excel sheets. Use the functions head(), str() and View() is your tools. They will give you the headlines in your data, how your variables are categorized and open the dataset in a new tab.
```{r, eval = F}
head(BuffetConsumption)
str(BuffetConsumption)
View(BuffetConsumption)
```
Try to use the BuffetConsumption dataset. If it does not look as expected, try to import it again using a different method.
### Numbers and factors - changing categorisation
During the import R will automatically categorise your variables: if they are read as numbers or letters. For instance, if day of the experiment is called 1 and 2 in the data file and is then read as numeric (num). As Day 2 is not double the value of Day 1, we need to change this variable into a factor (Factor) or character (chr). Use the str() function to check your variables before your change them. You transform your variables using as.numeric() or as.character().
```{r, eval=FALSE}
BuffetConsumption$Day<-as.factor(BuffetConsumption$Day)
```
Meaning take the variable Day in the dataset you called BuffetConsumption, make it a factor and put it into the same variable name (overwrites it). If you want to have a new variable coded and then keep the old one, simply just give it a new name, e.g. "DayFactor". The dataset will then be extended with one variable, but sometimes it is nice to have both versions.
```{r, eval=FALSE}
BuffetConsumption$DayFactor<-as.factor(BuffetConsumption$Day)
```
## How to edit and merge datasets
Sometimes you have to merge two data sets. This is needed if you have for instance consumption data in one Excel sheet and survey data in another Excel sheet.
Setup the data in Excel such that they match the below in terms of format.
What is important is:
- First row is used on headings and none of these are repeated. I.e. all unique within a sheet
- Data comes from row 2 and then on to the right
- All rows should contain data (NB: empty cell is also data, e.g. an unanswered questions), so all empty rows are removed (not cells)
- Headings between sheets referring to the same: e.g. participant ID should have exactly similar heading
- If you have calculated stuff within Excel such as a sum of the numbers in a column, then these should be removed from the sheet. It is not data!
We suggest that you keep both the original version of the data as a sheet, and the ready-to-import version as a sheet, so you do not accidentally delete data.
### Edit using Tidyverse
The _Consumption_ data is optimal as is. We have the data as long format with all repsonses in *one* coloumn and then the next columns clarifying the design, time, type, person etc. However the _Survey_ data is not optimal directly. We need to revert the data to both long and wide format.
There are several ways to do this, including editing in Excel. Here we show how it can be done in R using _tidyverse_. Tidyverse is a larger framework. For introduction see:
```{r, echo=FALSE}
vembedr::embed_youtube("HPJn1CMvtmI")
```
and maybe visit [tidyverse-homepage](https://www.tidyverse.org/) for resources.
```{r, eval=FALSE}
library(tidyverse)
Surveylong <- BuffetSurvey %>%
pivot_longer(cols = Pasta_with_legumes_is_visually_appealing_to_me:`I_like_the_taste_of_pasta_with_mushrooms!`, names_to = 'question',values_to = 'answ')
Surveywide <- Surveylong %>%
pivot_wider(names_from = question, values_from = answ)
```
The code above does exactly that, with *Surveylong* and *Surveywide* as the resulting data sets. Try to compare BuffetSurvey with Surveywide - did we already have a wide version of the dataset?
What is also introduced here is the _pipe_ operator **%\>%**. It originates from the *dplyr*-package inside the *tidyverse*-package, and is a handy tool for data manipulation. The way it works is, that whatever is written on the **right** side of the operator, will be used as the first argument in the function written on the **left** side of the operator. This means, that `x %\>% f(y)` will result in f(x,y) - or in our case:
`BuffetSurvey %\>% pivot_longer(cols = Pasta_with_legumes_is_visually_appealing_to_me:I_like_the_taste_of_pasta_with_mushrooms!, names_to = 'question',values_to = 'answ')`
will be equal to:
`pivot_longer(BuffetSurvey, cols = Pasta_with_legumes_is_visually_appealing_to_me:I_like_the_taste_of_pasta_with_mushrooms!, names_to = 'question',values_to = 'answ')`
The idea is then to "chain" (or "pipe" as it is also known) `%\>%` together line after line, using different functions in a sequence, which makes the code more readable and often shorter as well.
**pivot_longer** will lengthen the data, by stacking the columns we specified on top of each other, resulting in each row being one single answer to one of the 4 questions. The answer will be in the *answ*-column, the question in the *question*-column, and the rest of the columns can then be used to e.g. group the data.
**pivot_wider** will do the opposite as **pivot_longer**, and spread one column in several columns depending on what the columns contains. In our case the *question*-column in spread into 4 columns, one for each question from the survey, with the numerical values of the answers as their values.
You might have also noticed, that when a variable name contains a space, R needs help understanding that this is indeed a variable. Different symbols are added, and while you CAN write everything just the way R know how to read it, there is an easier way to make sure that everything is written correctly. You can call the variables from the data frame that they originate from, using **"dataframe\$"** and then hit TAB. A list of the variables of which the data frame consist will appear, and from this you can choose the right one - always spelled correctly, and the way R knows how the interpret it.
#### Merging datasets
For the sake of being able to compare consumption (obtained from buffet data) with liking and motives (obtained from the survey data) these data frames needs to be merged. There are several merge options, here we use **left_join()** but **full_join()** and **right_join()** might more suited in some situations - depending on which data set you want to have appear first, and how you want to merge them.
If you feel more comfortable with Excel, you can also merge the two data frames in one Excel sheet before importing it to R.
#### Adding survey to buffets
Merging should be done such that Person and Day in each separate sheet match. If you additionally have demographic data (gender, age, etc.) then obviously only Person should match, as the data is constant over Days.
```{r, eval = F}
Buffet_plus_survey <- BuffetConsumption %>%
left_join(BuffetSurvey, by = c('Person','Day'))
```
**left_join** checks in *BuffetSurvey* and *BuffetConsumption* in columns "Person" and "Day", and will add rows from *BuffetSurvey* to *BuffetConsumption* when values in both columns are the same.
## How to save the data
Use **save.image()** to save everything in the *Environment* (all variables shown in the "Environment"-tab in the upper right corner of RStudio), or use **save()** to specify which elements to save using the "list"-input.
```{r, eval=FALSE}
# Saving everything to the folder of your choice
save.image(file = './data/FolderYouWantToSaveYourProjectTo/AllMyData.RData')
# Saving just the specified datasets and other elements to your folder of choice
save(file = './data/FolderYouWantToSaveYourProjectTo/SomeOfMyData.RData', list = c('Survey','Surveylong_buffet','Surveylong','Buffet_survey''))
```
## How to export data / results to Excel and the like
You can export any data frame from R to excel (for instance using the *rio* package), as well as saving it as .RData for further analysis.
This can obviously be used for exporting your data after some modifications. BUT it is also very useful for exporting data frames with *results* from analysis.
When exporting data, it is also important to tell R where to place the exported file. You do this by specifying the path to the desired folder, followed by the name that **you** choose for the exported file (often it makes sense to choose the same name as the data frame in R).
It is also important to specify the file-extension, to ensure that you create the right file type - in this case *.xlsx*, but rio can also export to other formats such as .txt or .csv.
```{r, eval=FALSE}
# export one data frame
rio::export(Surveylong_buffet,file = './data/YourFolderForNiceTables/Surveylong_buffet.xlsx')
```
## How to load your RData
Once you have saved the data, you can simply load the data directly, and you do not need to do the import-setup every time you want to do an analysis on the data.
This part is not a part of the data import, but it is a good idea just to check that the data indeed is setup as expected.
```{r, eval=FALSE}
load('./data/FolderWhereYourDataAreStored/YourData.RData')
```
## How to clear your environment
When you want to start a new project or a new analysis, it can be useful to clear the environment for the data that you previously used. This can be done either by the code shown below, or by clicking the brush in the top-right part of the RStudio-window.
But be aware - when you clear your environment, you will have to load the data again
```{r, eval=FALSE}
rm(list = ls())
```
## How to R project
R Projects offer a structured and organized way to manage your R-based projects, making it easier to keep your work tidy, reproducible, and collaborative.
Some of the benefits om R Projects are:
**Isolation and Dependency Management:** Each R Project has its own working directory and R environment.
**Data Management:** Each project has its own workspace, isolating variables and objects from other projects.
**Project structuring:** R Projects promotes a structured approach to organizing files
### How to Create an R Project
1. Click on "File" in the top menu.
2. Select "New Project..."
3. Select the project type "New Directory".
4. Select the project type "New project"
5. Specify the project path and project name.
6. Click "Create Project."
7. To switch projects or close projects, select the drop down menu in the upper right corner and select "Your Project Name" or "Close project"