Skip to content

Commit

Permalink
commit
Browse files Browse the repository at this point in the history
  • Loading branch information
yyingying00 committed Dec 11, 2023
1 parent ff3a801 commit f4b2c7c
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 3 deletions.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "cf06dc470613197b4fa64b813e863718",
"hash": "caed767c8fd45e6e1d5ba0d98a6048d4",
"result": {
"markdown": "---\ntitle: \"Cheatsheet for R\"\nauthor: \"Yingying Yu\"\ndate: 2023-12-10\ndescription: Useful packages and their famous functions in data analysis\ncategories: [R, Data analysis]\n---\n\n\n# Introduction\n\nDuring the process of data analysis, especially in **data wrangling** and **data visualization**, we often perform similar workflows to reach our target. For example, some functions within the `tidyverse` are really useful to do exploratory data analysis (EDA); to fit our data to a statistical model, we need to first transform our raw data to a specific form of data frame. In addition, we might have a preferred way to present our graphs from the result. This article summarized some of my routine commands, and this page would be keep updating for the future.\n\n# Data wrangling\n\n## Package `tidyverse`\n\n::: callout-note\nThe package `tidyverse` included different small packages, below are what I use frequently:\n\n| Package Name | Core Usage |\n|--------------|------------------------------|\n| `dplyr` | Data manipulation |\n| `tidyr` | Tidy data |\n| `purrr` | Replace loops |\n| `stringr` | Convert strings |\n| `forcats` | Handle categorical variables |\n\nMore `tidyverse` packages can be found on this website: <https://www.tidyverse.org/packages/>.\n:::\n\n------------------------------------------------------------------------\n\n### Package `dplyr`\n\nWhen we first uploaded our data, the first instinct to do is look it up, the `glimpse()` function provide the number of rows and columns of the data frame, as well as the types of data of every column. The base R `table()` function can show the number of each kind of a variable.\n\nLet's use the 2020 U.S. Census data for example. `ca_race` contains the household income of different race/ethnicity by counties in California.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nca_race <- readRDS(\"race.RDS\")\nglimpse(ca_race)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 58\nColumns: 9\n$ GEOID <chr> \"06001\", \"06003\", \"06007\", \"06011\", \"06013\", \"06017\", \"06019…\n$ NAME <chr> \"Alameda\", \"Alpine\", \"Butte\", \"Colusa\", \"Contra Costa\", \"El …\n$ whiteE <dbl> 115050, 89583, 57606, 58780, 111149, 82988, 60299, 50559, 50…\n$ blackE <dbl> 56664, NA, 22302, 14946, 73306, 69519, 39621, 30263, 24107, …\n$ nativeE <dbl> 82661, 47321, 52734, 85208, 78815, NA, 52511, 32813, 41250, …\n$ asianE <dbl> 130236, NA, 49375, NA, 121358, 111964, 68274, 38456, 99030, …\n$ islanderE <dbl> 97500, NA, 118667, NA, 90565, 58516, 45595, NA, NA, 73750, 9…\n$ otherE <dbl> 76546, NA, 48707, 67303, 72866, 69948, 45525, 42094, 41571, …\n$ region <chr> \"Bay Area\", \"Capital Region\", \"Rest\", \"Capital Region\", \"Bay…\n```\n:::\n\n```{.r .cell-code}\ntable(ca_race$region)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n Bay Area Capital Region Central Valley Rest \n 9 9 6 27 \nSouthern California \n 7 \n```\n:::\n:::\n\n\n::: callout-tip\nOne of the most efficient way of using functions within the `dplyr` package is along with the pipe `%>%`. Since the function names are human-readable and easy to follow, stacking up the functions using pipe make your data wrangling process clean and concise.\n:::\n\n**The most common functions are**\\\n`mutate()`: adds new variables that are functions of existing variables\\\n`filter()`: picks cases based on their values in a variable\\\n`group_by()`: perform any operation by group\\\n`summarise()`: calculate the mean, median, sd, min, max, first, last, n, n_distinct of a variable\\\n`arrange()`: changes the ordering of the rows\\\n`case_when()`: vectorise multiple `if_else()` statements\\\n`na.rm = TRUE`: calculate by removing massing values\n\n\n::: {.cell}\n\n```{.r .cell-code}\nca_race <- ca_race %>% \n mutate(total = rowSums(select(.,whiteE, blackE, nativeE, asianE, islanderE, otherE), \n na.rm = TRUE)) %>%\n mutate(level = case_when(total < 50000 ~ \"Less than 50000\",\n total <= 150000 ~ \"50001 - 150000\",\n total <= 300000 ~ \"150001 - 300000\",\n total <= 500000 ~ \"300001 - 500000\",\n total > 500000 ~ \"More than 500000\",\n TRUE ~ \"Rest\"))\nsum_table <- ca_race %>%\n filter(region != \"Rest\") %>% \n group_by(region) %>%\n summarize(White = mean(whiteE),\n Black = mean(blackE, na.rm = TRUE),\n Native = mean(nativeE, na.rm = TRUE),\n Asian = mean(asianE, na.rm = TRUE),\n Islander = mean(islanderE, na.rm = TRUE),\n Other = mean(otherE, na.rm = TRUE)) %>% \n arrange(region)\nsum_table\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 7\n region White Black Native Asian Islander Other\n <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 Bay Area 114912. 69085 84676 120988 92294 72606.\n2 Capital Region 72706. 58022. 50799. 79919. 60816. 59030.\n3 Central Valley 59379. 45909. 47797. 68488. 88245. 48077.\n4 Southern California 77601. 59083. 64966. 95986. 82545. 60632.\n```\n:::\n:::\n\n\nMore usage of `dyplr` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf)\n\n------------------------------------------------------------------------\n\n### Package `tidyr`\n\n**Common functions:**\\\n`pivot_longer()` and `pivot_wider()`: to converts data between long and wide forms\\\n`complete()`: make implicit missing values explicit\\\n`drop_na()`: make explicit missing values implicit\\\n`fill()`: replace missing values with next/previous value\\\n`replace_na()`: replace missing values with a known value\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum_table <- sum_table %>% \n pivot_longer(cols = c(\"White\", \"Black\", \"Native\", \"Asian\", \"Islander\", \"Other\"),\n names_to = \"Race\",\n values_to = \"Value\")\nsum_table\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 24 × 3\n region Race Value\n <chr> <chr> <dbl>\n 1 Bay Area White 114912.\n 2 Bay Area Black 69085 \n 3 Bay Area Native 84676 \n 4 Bay Area Asian 120988 \n 5 Bay Area Islander 92294 \n 6 Bay Area Other 72606.\n 7 Capital Region White 72706.\n 8 Capital Region Black 58022.\n 9 Capital Region Native 50799.\n10 Capital Region Asian 79919.\n# ℹ 14 more rows\n```\n:::\n:::\n\n\nMore usage of `tidyr` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf)\n\n------------------------------------------------------------------------\n\n### Package `forcats`\n\n**Common functions:**\\\n`fct_reorder()`: Reordering a factor by another variable\\\n`fct_infreq()`: Reordering a factor by the frequency of values\\\n`fct_relevel()`: Changing the order of a factor by hand\\\n`fct_lump()`: Collapsing the least/most frequent values of a factor into \"other\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\nca_race$level <- fct_relevel(ca_race$level, c(\"More than 500000\",\n \"300001 - 500000\",\n \"150001 - 300000\",\n \"50001 - 150000\",\n \"Less than 50000\"))\nlevels(ca_race$level)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"More than 500000\" \"300001 - 500000\" \"150001 - 300000\" \"50001 - 150000\" \n```\n:::\n:::\n\n\nMore usage of `forcats` can be found [here](https://forcats.tidyverse.org)\n\n------------------------------------------------------------------------\n\n# Data visualization\n\n## Package `ggplot2`\n\nThis is most famous package to visualize data in the R community. Although `ggplot2` is also part of the `tidyverse` package, we put it here for the sake of separate topics.\n\n**Common functions:**\\\n`geom_point`: adds points to the plot\\\n`geom_line`: connects points with lines\\\n`geom_histogram()`: creates histograms\\\n`geom_bar`: creates bar plots\\\n`geom_boxplot()`: generates boxplots\\\n`geom_vline`: generate reference lines (horizontal, vertical, or diagonal)\\\n`geom_smooth`: adds a smoothed line to a scatterplot\\\n`facet_wrap`: facets the plot into multiple panels based on a categorical variable\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum_table %>% \n ggplot(aes(x = Race, y = Value)) +\n facet_wrap(~ region, nrow = 2) +\n geom_bar(stat = \"Identity\", fill = \"#98103E\") +\n labs(title = \"Household income race and region in California\",\n subtitle = \"2016-2020 American Community Survey\",\n caption = \"Source data from the U.S. Census Bureau\",\n x = \"Categories\",\n y = \"Values\") + \n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](cheatsheet-R_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nMore usage of `ggplot2` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)\n",
"markdown": "---\ntitle: \"Cheatsheet for R\"\nauthor: \"Yingying Yu\"\ndate: 2023-12-10\ndescription: Useful packages and their famous functions in data analysis\ncategories: [R, tidyverse]\n---\n\n\n# Introduction\n\nDuring the process of data analysis, especially in **data wrangling** and **data visualization**, we often perform similar workflows to reach our target. For example, some functions within the `tidyverse` are really useful to do exploratory data analysis (EDA); to fit our data to a statistical model, we need to first transform our raw data to a specific form of data frame. In addition, we might have a preferred way to present our graphs from the result. This article summarized some of my routine commands, and this page would be keep updating for the future.\n\n# Data wrangling\n\n## Package `tidyverse`\n\n::: callout-note\nThe package `tidyverse` included different small packages, below are what I use frequently:\n\n| Package Name | Core Usage |\n|--------------|------------------------------|\n| `dplyr` | Data manipulation |\n| `tidyr` | Tidy data |\n| `purrr` | Replace loops |\n| `stringr` | Convert strings |\n| `forcats` | Handle categorical variables |\n\nMore `tidyverse` packages can be found on this website: <https://www.tidyverse.org/packages/>.\n:::\n\n------------------------------------------------------------------------\n\n### Package `dplyr`\n\nWhen we first uploaded our data, the first instinct to do is look it up, the `glimpse()` function provide the number of rows and columns of the data frame, as well as the types of data of every column. The base R `table()` function can show the number of each kind of a variable.\n\nLet's use the 2020 U.S. Census data for example. `ca_race` contains the household income of different race/ethnicity by counties in California.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nca_race <- readRDS(\"race.RDS\")\nglimpse(ca_race)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 58\nColumns: 9\n$ GEOID <chr> \"06001\", \"06003\", \"06007\", \"06011\", \"06013\", \"06017\", \"06019…\n$ NAME <chr> \"Alameda\", \"Alpine\", \"Butte\", \"Colusa\", \"Contra Costa\", \"El …\n$ whiteE <dbl> 115050, 89583, 57606, 58780, 111149, 82988, 60299, 50559, 50…\n$ blackE <dbl> 56664, NA, 22302, 14946, 73306, 69519, 39621, 30263, 24107, …\n$ nativeE <dbl> 82661, 47321, 52734, 85208, 78815, NA, 52511, 32813, 41250, …\n$ asianE <dbl> 130236, NA, 49375, NA, 121358, 111964, 68274, 38456, 99030, …\n$ islanderE <dbl> 97500, NA, 118667, NA, 90565, 58516, 45595, NA, NA, 73750, 9…\n$ otherE <dbl> 76546, NA, 48707, 67303, 72866, 69948, 45525, 42094, 41571, …\n$ region <chr> \"Bay Area\", \"Capital Region\", \"Rest\", \"Capital Region\", \"Bay…\n```\n:::\n\n```{.r .cell-code}\ntable(ca_race$region)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n Bay Area Capital Region Central Valley Rest \n 9 9 6 27 \nSouthern California \n 7 \n```\n:::\n:::\n\n\n::: callout-tip\nOne of the most efficient way of using functions within the `dplyr` package is along with the pipe `%>%`. Since the function names are human-readable and easy to follow, stacking up the functions using pipe make your data wrangling process clean and concise.\n:::\n\n**The most common functions are**\\\n`mutate()`: adds new variables that are functions of existing variables\\\n`filter()`: picks cases based on their values in a variable\\\n`group_by()`: perform any operation by group\\\n`summarise()`: calculate the mean, median, sd, min, max, first, last, n, n_distinct of a variable\\\n`arrange()`: changes the ordering of the rows\\\n`case_when()`: vectorise multiple `if_else()` statements\\\n`na.rm = TRUE`: calculate by removing massing values\n\n\n::: {.cell}\n\n```{.r .cell-code}\nca_race <- ca_race %>% \n mutate(total = rowSums(select(.,whiteE, blackE, nativeE, asianE, islanderE, otherE), \n na.rm = TRUE)) %>%\n mutate(level = case_when(total < 50000 ~ \"Less than 50000\",\n total <= 150000 ~ \"50001 - 150000\",\n total <= 300000 ~ \"150001 - 300000\",\n total <= 500000 ~ \"300001 - 500000\",\n total > 500000 ~ \"More than 500000\",\n TRUE ~ \"Rest\"))\nsum_table <- ca_race %>%\n filter(region != \"Rest\") %>% \n group_by(region) %>%\n summarize(White = mean(whiteE),\n Black = mean(blackE, na.rm = TRUE),\n Native = mean(nativeE, na.rm = TRUE),\n Asian = mean(asianE, na.rm = TRUE),\n Islander = mean(islanderE, na.rm = TRUE),\n Other = mean(otherE, na.rm = TRUE)) %>% \n arrange(region)\nsum_table\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 7\n region White Black Native Asian Islander Other\n <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 Bay Area 114912. 69085 84676 120988 92294 72606.\n2 Capital Region 72706. 58022. 50799. 79919. 60816. 59030.\n3 Central Valley 59379. 45909. 47797. 68488. 88245. 48077.\n4 Southern California 77601. 59083. 64966. 95986. 82545. 60632.\n```\n:::\n:::\n\n\nMore usage of `dyplr` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf)\n\n------------------------------------------------------------------------\n\n### Package `tidyr`\n\n**Common functions:**\\\n`pivot_longer()` and `pivot_wider()`: to converts data between long and wide forms\\\n`complete()`: make implicit missing values explicit\\\n`drop_na()`: make explicit missing values implicit\\\n`fill()`: replace missing values with next/previous value\\\n`replace_na()`: replace missing values with a known value\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum_table <- sum_table %>% \n pivot_longer(cols = c(\"White\", \"Black\", \"Native\", \"Asian\", \"Islander\", \"Other\"),\n names_to = \"Race\",\n values_to = \"Value\")\nsum_table\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 24 × 3\n region Race Value\n <chr> <chr> <dbl>\n 1 Bay Area White 114912.\n 2 Bay Area Black 69085 \n 3 Bay Area Native 84676 \n 4 Bay Area Asian 120988 \n 5 Bay Area Islander 92294 \n 6 Bay Area Other 72606.\n 7 Capital Region White 72706.\n 8 Capital Region Black 58022.\n 9 Capital Region Native 50799.\n10 Capital Region Asian 79919.\n# ℹ 14 more rows\n```\n:::\n:::\n\n\nMore usage of `tidyr` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf)\n\n------------------------------------------------------------------------\n\n### Package `forcats`\n\n**Common functions:**\\\n`fct_reorder()`: Reordering a factor by another variable\\\n`fct_infreq()`: Reordering a factor by the frequency of values\\\n`fct_relevel()`: Changing the order of a factor by hand\\\n`fct_lump()`: Collapsing the least/most frequent values of a factor into \"other\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\nca_race$level <- fct_relevel(ca_race$level, c(\"More than 500000\",\n \"300001 - 500000\",\n \"150001 - 300000\",\n \"50001 - 150000\",\n \"Less than 50000\"))\nlevels(ca_race$level)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"More than 500000\" \"300001 - 500000\" \"150001 - 300000\" \"50001 - 150000\" \n```\n:::\n:::\n\n\nMore usage of `forcats` can be found [here](https://forcats.tidyverse.org)\n\n------------------------------------------------------------------------\n\n# Data visualization\n\n## Package `ggplot2`\n\nThis is most famous package to visualize data in the R community. Although `ggplot2` is also part of the `tidyverse` package, we put it here for the sake of separate topics.\n\n**Common functions:**\\\n`geom_point`: adds points to the plot\\\n`geom_line`: connects points with lines\\\n`geom_histogram()`: creates histograms\\\n`geom_bar`: creates bar plots\\\n`geom_boxplot()`: generates boxplots\\\n`geom_vline`: generate reference lines (horizontal, vertical, or diagonal)\\\n`geom_smooth`: adds a smoothed line to a scatterplot\\\n`facet_wrap`: facets the plot into multiple panels based on a categorical variable\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum_table %>% \n ggplot(aes(x = Race, y = Value)) +\n facet_wrap(~ region, nrow = 2) +\n geom_bar(stat = \"Identity\", fill = \"#98103E\") +\n labs(title = \"Household income race and region in California\",\n subtitle = \"2016-2020 American Community Survey\",\n caption = \"Source data from the U.S. Census Bureau\",\n x = \"Categories\",\n y = \"Values\") + \n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](cheatsheet-R_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nMore usage of `ggplot2` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)\n",
"supporting": [
"cheatsheet-R_files"
],
Expand Down
2 changes: 1 addition & 1 deletion docs/projects/Cheatsheets/cheatsheet-R.html
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ <h1 class="title">Cheatsheet for R</h1>
</div>
<div class="quarto-categories">
<div class="quarto-category">R</div>
<div class="quarto-category">Data analysis</div>
<div class="quarto-category">tidyverse</div>
</div>
</div>
</div>
Expand Down

0 comments on commit f4b2c7c

Please sign in to comment.