commit

yyingying00 · Dec 11, 2023 · f4b2c7c · f4b2c7c
1 parent ff3a801
commit f4b2c7c
Show file tree

Hide file tree

Showing 2 changed files with 3 additions and 3 deletions.
diff --git a/_freeze/projects/Cheatsheets/cheatsheet-R/execute-results/html.json b/_freeze/projects/Cheatsheets/cheatsheet-R/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "cf06dc470613197b4fa64b813e863718",
+  "hash": "caed767c8fd45e6e1d5ba0d98a6048d4",
   "result": {
-    "markdown": "---\ntitle: \"Cheatsheet for R\"\nauthor: \"Yingying Yu\"\ndate: 2023-12-10\ndescription: Useful packages and their famous functions in data analysis\ncategories: [R, Data analysis]\n---\n\n\n# Introduction\n\nDuring the process of data analysis, especially in **data wrangling** and **data visualization**, we often perform similar workflows to reach our target. For example, some functions within the `tidyverse` are really useful to do exploratory data analysis (EDA); to fit our data to a statistical model, we need to first transform our raw data to a specific form of data frame. In addition, we might have a preferred way to present our graphs from the result. This article summarized some of my routine commands, and this page would be keep updating for the future.\n\n# Data wrangling\n\n## Package `tidyverse`\n\n::: callout-note\nThe package `tidyverse` included different small packages, below are what I use frequently:\n\n| Package Name | Core Usage                   |\n|--------------|------------------------------|\n| `dplyr`      | Data manipulation            |\n| `tidyr`      | Tidy data                    |\n| `purrr`      | Replace loops                |\n| `stringr`    | Convert strings              |\n| `forcats`    | Handle categorical variables |\n\nMore `tidyverse` packages can be found on this website: <https://www.tidyverse.org/packages/>.\n:::\n\n------------------------------------------------------------------------\n\n### Package `dplyr`\n\nWhen we first uploaded our data, the first instinct to do is look it up, the `glimpse()` function provide the number of rows and columns of the data frame, as well as the types of data of every column. The base R `table()` function can show the number of each kind of a variable.\n\nLet's use the 2020 U.S. Census data for example. `ca_race` contains the household income of different race/ethnicity by counties in California.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nca_race <- readRDS(\"race.RDS\")\nglimpse(ca_race)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 58\nColumns: 9\n$ GEOID     <chr> \"06001\", \"06003\", \"06007\", \"06011\", \"06013\", \"06017\", \"06019…\n$ NAME      <chr> \"Alameda\", \"Alpine\", \"Butte\", \"Colusa\", \"Contra Costa\", \"El …\n$ whiteE    <dbl> 115050, 89583, 57606, 58780, 111149, 82988, 60299, 50559, 50…\n$ blackE    <dbl> 56664, NA, 22302, 14946, 73306, 69519, 39621, 30263, 24107, …\n$ nativeE   <dbl> 82661, 47321, 52734, 85208, 78815, NA, 52511, 32813, 41250, …\n$ asianE    <dbl> 130236, NA, 49375, NA, 121358, 111964, 68274, 38456, 99030, …\n$ islanderE <dbl> 97500, NA, 118667, NA, 90565, 58516, 45595, NA, NA, 73750, 9…\n$ otherE    <dbl> 76546, NA, 48707, 67303, 72866, 69948, 45525, 42094, 41571, …\n$ region    <chr> \"Bay Area\", \"Capital Region\", \"Rest\", \"Capital Region\", \"Bay…\n```\n:::\n\n```{.r .cell-code}\ntable(ca_race$region)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n           Bay Area      Capital Region      Central Valley                Rest \n                  9                   9                   6                  27 \nSouthern California \n                  7 \n```\n:::\n:::\n\n\n::: callout-tip\nOne of the most efficient way of using functions within the `dplyr` package is along with the pipe `%>%`. Since the function names are human-readable and easy to follow, stacking up the functions using pipe make your data wrangling process clean and concise.\n:::\n\n**The most common functions are**\\\n`mutate()`: adds new variables that are functions of existing variables\\\n`filter()`: picks cases based on their values in a variable\\\n`group_by()`: perform any operation by group\\\n`summarise()`: calculate the mean, median, sd, min, max, first, last, n, n_distinct of a variable\\\n`arrange()`: changes the ordering of the rows\\\n`case_when()`: vectorise multiple `if_else()` statements\\\n`na.rm = TRUE`: calculate by removing massing values\n\n\n::: {.cell}\n\n```{.r .cell-code}\nca_race <- ca_race %>% \n  mutate(total = rowSums(select(.,whiteE, blackE, nativeE, asianE, islanderE, otherE), \n                         na.rm = TRUE)) %>%\n  mutate(level = case_when(total < 50000 ~ \"Less than 50000\",\n                           total <= 150000 ~ \"50001 - 150000\",\n                           total <= 300000 ~ \"150001 - 300000\",\n                           total <= 500000 ~ \"300001 - 500000\",\n                           total > 500000 ~ \"More than 500000\",\n                           TRUE ~ \"Rest\"))\nsum_table <- ca_race %>%\n  filter(region != \"Rest\") %>%      \n  group_by(region) %>%\n  summarize(White = mean(whiteE),\n            Black = mean(blackE, na.rm = TRUE),\n            Native = mean(nativeE, na.rm = TRUE),\n            Asian = mean(asianE, na.rm = TRUE),\n            Islander = mean(islanderE, na.rm = TRUE),\n            Other = mean(otherE, na.rm = TRUE)) %>% \n  arrange(region)\nsum_table\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 7\n  region                White  Black Native   Asian Islander  Other\n  <chr>                 <dbl>  <dbl>  <dbl>   <dbl>    <dbl>  <dbl>\n1 Bay Area            114912. 69085  84676  120988    92294  72606.\n2 Capital Region       72706. 58022. 50799.  79919.   60816. 59030.\n3 Central Valley       59379. 45909. 47797.  68488.   88245. 48077.\n4 Southern California  77601. 59083. 64966.  95986.   82545. 60632.\n```\n:::\n:::\n\n\nMore usage of `dyplr` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf)\n\n------------------------------------------------------------------------\n\n### Package `tidyr`\n\n**Common functions:**\\\n`pivot_longer()` and `pivot_wider()`: to converts data between long and wide forms\\\n`complete()`: make implicit missing values explicit\\\n`drop_na()`: make explicit missing values implicit\\\n`fill()`: replace missing values with next/previous value\\\n`replace_na()`: replace missing values with a known value\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum_table <- sum_table %>% \n  pivot_longer(cols = c(\"White\", \"Black\", \"Native\", \"Asian\", \"Islander\", \"Other\"),\n               names_to = \"Race\",\n               values_to = \"Value\")\nsum_table\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 24 × 3\n   region         Race       Value\n   <chr>          <chr>      <dbl>\n 1 Bay Area       White    114912.\n 2 Bay Area       Black     69085 \n 3 Bay Area       Native    84676 \n 4 Bay Area       Asian    120988 \n 5 Bay Area       Islander  92294 \n 6 Bay Area       Other     72606.\n 7 Capital Region White     72706.\n 8 Capital Region Black     58022.\n 9 Capital Region Native    50799.\n10 Capital Region Asian     79919.\n# ℹ 14 more rows\n```\n:::\n:::\n\n\nMore usage of `tidyr` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf)\n\n------------------------------------------------------------------------\n\n### Package `forcats`\n\n**Common functions:**\\\n`fct_reorder()`: Reordering a factor by another variable\\\n`fct_infreq()`: Reordering a factor by the frequency of values\\\n`fct_relevel()`: Changing the order of a factor by hand\\\n`fct_lump()`: Collapsing the least/most frequent values of a factor into \"other\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\nca_race$level <- fct_relevel(ca_race$level, c(\"More than 500000\",\n                                              \"300001 - 500000\",\n                                              \"150001 - 300000\",\n                                              \"50001 - 150000\",\n                                              \"Less than 50000\"))\nlevels(ca_race$level)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"More than 500000\" \"300001 - 500000\"  \"150001 - 300000\"  \"50001 - 150000\"  \n```\n:::\n:::\n\n\nMore usage of `forcats` can be found [here](https://forcats.tidyverse.org)\n\n------------------------------------------------------------------------\n\n# Data visualization\n\n## Package `ggplot2`\n\nThis is most famous package to visualize data in the R community. Although `ggplot2` is also part of the `tidyverse` package, we put it here for the sake of separate topics.\n\n**Common functions:**\\\n`geom_point`: adds points to the plot\\\n`geom_line`: connects points with lines\\\n`geom_histogram()`: creates histograms\\\n`geom_bar`: creates bar plots\\\n`geom_boxplot()`: generates boxplots\\\n`geom_vline`: generate reference lines (horizontal, vertical, or diagonal)\\\n`geom_smooth`: adds a smoothed line to a scatterplot\\\n`facet_wrap`: facets the plot into multiple panels based on a categorical variable\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum_table %>% \n  ggplot(aes(x = Race, y = Value)) +\n  facet_wrap(~ region, nrow = 2) +\n  geom_bar(stat = \"Identity\", fill = \"#98103E\") +\n  labs(title = \"Household income race and region in California\",\n       subtitle = \"2016-2020 American Community Survey\",\n       caption = \"Source data from the U.S. Census Bureau\",\n       x = \"Categories\",\n       y = \"Values\") + \n  theme_minimal()\n```\n\n::: {.cell-output-display}\n![](cheatsheet-R_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nMore usage of `ggplot2` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)\n",
+    "markdown": "---\ntitle: \"Cheatsheet for R\"\nauthor: \"Yingying Yu\"\ndate: 2023-12-10\ndescription: Useful packages and their famous functions in data analysis\ncategories: [R, tidyverse]\n---\n\n\n# Introduction\n\nDuring the process of data analysis, especially in **data wrangling** and **data visualization**, we often perform similar workflows to reach our target. For example, some functions within the `tidyverse` are really useful to do exploratory data analysis (EDA); to fit our data to a statistical model, we need to first transform our raw data to a specific form of data frame. In addition, we might have a preferred way to present our graphs from the result. This article summarized some of my routine commands, and this page would be keep updating for the future.\n\n# Data wrangling\n\n## Package `tidyverse`\n\n::: callout-note\nThe package `tidyverse` included different small packages, below are what I use frequently:\n\n| Package Name | Core Usage                   |\n|--------------|------------------------------|\n| `dplyr`      | Data manipulation            |\n| `tidyr`      | Tidy data                    |\n| `purrr`      | Replace loops                |\n| `stringr`    | Convert strings              |\n| `forcats`    | Handle categorical variables |\n\nMore `tidyverse` packages can be found on this website: <https://www.tidyverse.org/packages/>.\n:::\n\n------------------------------------------------------------------------\n\n### Package `dplyr`\n\nWhen we first uploaded our data, the first instinct to do is look it up, the `glimpse()` function provide the number of rows and columns of the data frame, as well as the types of data of every column. The base R `table()` function can show the number of each kind of a variable.\n\nLet's use the 2020 U.S. Census data for example. `ca_race` contains the household income of different race/ethnicity by counties in California.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nca_race <- readRDS(\"race.RDS\")\nglimpse(ca_race)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 58\nColumns: 9\n$ GEOID     <chr> \"06001\", \"06003\", \"06007\", \"06011\", \"06013\", \"06017\", \"06019…\n$ NAME      <chr> \"Alameda\", \"Alpine\", \"Butte\", \"Colusa\", \"Contra Costa\", \"El …\n$ whiteE    <dbl> 115050, 89583, 57606, 58780, 111149, 82988, 60299, 50559, 50…\n$ blackE    <dbl> 56664, NA, 22302, 14946, 73306, 69519, 39621, 30263, 24107, …\n$ nativeE   <dbl> 82661, 47321, 52734, 85208, 78815, NA, 52511, 32813, 41250, …\n$ asianE    <dbl> 130236, NA, 49375, NA, 121358, 111964, 68274, 38456, 99030, …\n$ islanderE <dbl> 97500, NA, 118667, NA, 90565, 58516, 45595, NA, NA, 73750, 9…\n$ otherE    <dbl> 76546, NA, 48707, 67303, 72866, 69948, 45525, 42094, 41571, …\n$ region    <chr> \"Bay Area\", \"Capital Region\", \"Rest\", \"Capital Region\", \"Bay…\n```\n:::\n\n```{.r .cell-code}\ntable(ca_race$region)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n           Bay Area      Capital Region      Central Valley                Rest \n                  9                   9                   6                  27 \nSouthern California \n                  7 \n```\n:::\n:::\n\n\n::: callout-tip\nOne of the most efficient way of using functions within the `dplyr` package is along with the pipe `%>%`. Since the function names are human-readable and easy to follow, stacking up the functions using pipe make your data wrangling process clean and concise.\n:::\n\n**The most common functions are**\\\n`mutate()`: adds new variables that are functions of existing variables\\\n`filter()`: picks cases based on their values in a variable\\\n`group_by()`: perform any operation by group\\\n`summarise()`: calculate the mean, median, sd, min, max, first, last, n, n_distinct of a variable\\\n`arrange()`: changes the ordering of the rows\\\n`case_when()`: vectorise multiple `if_else()` statements\\\n`na.rm = TRUE`: calculate by removing massing values\n\n\n::: {.cell}\n\n```{.r .cell-code}\nca_race <- ca_race %>% \n  mutate(total = rowSums(select(.,whiteE, blackE, nativeE, asianE, islanderE, otherE), \n                         na.rm = TRUE)) %>%\n  mutate(level = case_when(total < 50000 ~ \"Less than 50000\",\n                           total <= 150000 ~ \"50001 - 150000\",\n                           total <= 300000 ~ \"150001 - 300000\",\n                           total <= 500000 ~ \"300001 - 500000\",\n                           total > 500000 ~ \"More than 500000\",\n                           TRUE ~ \"Rest\"))\nsum_table <- ca_race %>%\n  filter(region != \"Rest\") %>%      \n  group_by(region) %>%\n  summarize(White = mean(whiteE),\n            Black = mean(blackE, na.rm = TRUE),\n            Native = mean(nativeE, na.rm = TRUE),\n            Asian = mean(asianE, na.rm = TRUE),\n            Islander = mean(islanderE, na.rm = TRUE),\n            Other = mean(otherE, na.rm = TRUE)) %>% \n  arrange(region)\nsum_table\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 7\n  region                White  Black Native   Asian Islander  Other\n  <chr>                 <dbl>  <dbl>  <dbl>   <dbl>    <dbl>  <dbl>\n1 Bay Area            114912. 69085  84676  120988    92294  72606.\n2 Capital Region       72706. 58022. 50799.  79919.   60816. 59030.\n3 Central Valley       59379. 45909. 47797.  68488.   88245. 48077.\n4 Southern California  77601. 59083. 64966.  95986.   82545. 60632.\n```\n:::\n:::\n\n\nMore usage of `dyplr` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf)\n\n------------------------------------------------------------------------\n\n### Package `tidyr`\n\n**Common functions:**\\\n`pivot_longer()` and `pivot_wider()`: to converts data between long and wide forms\\\n`complete()`: make implicit missing values explicit\\\n`drop_na()`: make explicit missing values implicit\\\n`fill()`: replace missing values with next/previous value\\\n`replace_na()`: replace missing values with a known value\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum_table <- sum_table %>% \n  pivot_longer(cols = c(\"White\", \"Black\", \"Native\", \"Asian\", \"Islander\", \"Other\"),\n               names_to = \"Race\",\n               values_to = \"Value\")\nsum_table\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 24 × 3\n   region         Race       Value\n   <chr>          <chr>      <dbl>\n 1 Bay Area       White    114912.\n 2 Bay Area       Black     69085 \n 3 Bay Area       Native    84676 \n 4 Bay Area       Asian    120988 \n 5 Bay Area       Islander  92294 \n 6 Bay Area       Other     72606.\n 7 Capital Region White     72706.\n 8 Capital Region Black     58022.\n 9 Capital Region Native    50799.\n10 Capital Region Asian     79919.\n# ℹ 14 more rows\n```\n:::\n:::\n\n\nMore usage of `tidyr` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf)\n\n------------------------------------------------------------------------\n\n### Package `forcats`\n\n**Common functions:**\\\n`fct_reorder()`: Reordering a factor by another variable\\\n`fct_infreq()`: Reordering a factor by the frequency of values\\\n`fct_relevel()`: Changing the order of a factor by hand\\\n`fct_lump()`: Collapsing the least/most frequent values of a factor into \"other\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\nca_race$level <- fct_relevel(ca_race$level, c(\"More than 500000\",\n                                              \"300001 - 500000\",\n                                              \"150001 - 300000\",\n                                              \"50001 - 150000\",\n                                              \"Less than 50000\"))\nlevels(ca_race$level)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"More than 500000\" \"300001 - 500000\"  \"150001 - 300000\"  \"50001 - 150000\"  \n```\n:::\n:::\n\n\nMore usage of `forcats` can be found [here](https://forcats.tidyverse.org)\n\n------------------------------------------------------------------------\n\n# Data visualization\n\n## Package `ggplot2`\n\nThis is most famous package to visualize data in the R community. Although `ggplot2` is also part of the `tidyverse` package, we put it here for the sake of separate topics.\n\n**Common functions:**\\\n`geom_point`: adds points to the plot\\\n`geom_line`: connects points with lines\\\n`geom_histogram()`: creates histograms\\\n`geom_bar`: creates bar plots\\\n`geom_boxplot()`: generates boxplots\\\n`geom_vline`: generate reference lines (horizontal, vertical, or diagonal)\\\n`geom_smooth`: adds a smoothed line to a scatterplot\\\n`facet_wrap`: facets the plot into multiple panels based on a categorical variable\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum_table %>% \n  ggplot(aes(x = Race, y = Value)) +\n  facet_wrap(~ region, nrow = 2) +\n  geom_bar(stat = \"Identity\", fill = \"#98103E\") +\n  labs(title = \"Household income race and region in California\",\n       subtitle = \"2016-2020 American Community Survey\",\n       caption = \"Source data from the U.S. Census Bureau\",\n       x = \"Categories\",\n       y = \"Values\") + \n  theme_minimal()\n```\n\n::: {.cell-output-display}\n![](cheatsheet-R_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\nMore usage of `ggplot2` can be found [here](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf)\n",
     "supporting": [
       "cheatsheet-R_files"
     ],

diff --git a/docs/projects/Cheatsheets/cheatsheet-R.html b/docs/projects/Cheatsheets/cheatsheet-R.html
@@ -152,7 +152,7 @@ <h1 class="title">Cheatsheet for R</h1>
       </div>
                           <div class="quarto-categories">
                 <div class="quarto-category">R</div>
-                <div class="quarto-category">Data analysis</div>
+                <div class="quarto-category">tidyverse</div>
               </div>
                   </div>
   </div>