Skip to content

4.1 — Creating an R Notebook

narasi15 edited this page Feb 18, 2020 · 13 revisions

Objectives

An R notebook is a report-like document that contains chunks of text explaining a dataset, and some interactive, executable R code that can be outputted and displayed in cool ways. This journal covers my experience writing my very first R notebook!
Time estimated: 2 d; taken 2 w; date started: 2020-02-01; date completed: 2020-02-18

Progress

  • I had a problem creating a new R Notebook in RStudio. I did not have many of the required packages installed. Specifically the error came from a yaml package that RStudio was not able to install.

    • Turns out I had to manually install the package in R, and re-open RStudio.
    • > installed.packages()
    • In R go to Packages & Data > Package Installer. Select yaml package and hit Install Selected

  • I kept seeing the prompts like Update all/some/none? [a/s/n]: and Do you want to install from sources the package which needs compilation? (Yes/no/cancel) in each run, which was annoying.

  • I realized the mistake was that I was not running RStudio from the Docker image that was provided for us with the necessary packages. This was much easier to get RStudio up and running remotely!
  • I was able to Download the data, and supplementary files using the sample code provided
  • My data contained over 63,000 rows of information, and it was confusing on how to decide what to keep. I decided to go about it in this order:
    • First check for any duplicate row name (there were none!)
    • Define the groups:
      • "Treatment": tells us which was transduced in this test, it could be either empty, ELF1, cell or R8A
      • "trial_num" - replicate number
      • "Test_run": is just a concatenation of the 2 columns Treatment and mock_or_IFN to get the name of a sample test run
    • Then map the 63,678 Ensembl (ENSG) genes to HUGO symbols.
    • Remove the rows with low counts (reduced our data to 14935 rows).
    • Keep the rows that do not map to a HUGO symbol

Conclusions and Outlook

  • Working with RStudio was challenging. It was hard to understand why something was not compiling. After some googling I tried uninstalling and re-installing R and RStudio to start fresh, where I still had problems with certain packages like ``dplyr``.
  • The RPR-GEO2R.R was a little helpful in guiding me initially, but did not provide much information to solve the package issues I was seeing.. Once I got past the issue of setting up RStudio, this was a great reference
  • I found normalization to be the challenging part of the assignment. I attempted to follow the normalization methods provided in the lecture slides, however I could not see any major differences between my normalized data and the original data. I was not able to make any meaningful interpretations out of post-normalization.
  • My data included a lot of outliers in both extremes, so I assumed TTM will allow me to remove the upper and lower percentages to see a difference. However I should have also tried another normalization approach.
  • Ultimately, I was able to produce a file that contained my cleaned, normalized data including the Ensembl genes with their associated HUGO symbols.