-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-data-munging.qmd
95 lines (50 loc) · 7.63 KB
/
05-data-munging.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Data Munging -also called data wrangling
Data munging as defined here is the process of moving from [raw data](@sec-rawdata) to data in a processed form that is usable in analysis. Data munging consists of format standardization and Quality Assurance and Quality Control (QAQC).
Other references for what is included in data munging:
[Astera::What is data munging and why is it important?](https://www.astera.com/type/blog/what-is-data-munging/#:~:text=Data%20munging%20is%20the%20process,cleaning%2C%20enrichment%2C%20and%20validation)
- Data Discovery (this is probably not relevant to our purposes) \> Everything begins with a defined goal, and the data analysis journey isn't an exception. Data discovery is the first stage of data munging, where data analysts define data's purpose and how to achieve it through data analytics. The goal is to identify the potential uses and requirements of data.
- Data Structuring (QA - a lot of this happens in the Pre-Data stage if done well) \> Once the requirements are identified and outlined, the next stage is structuring raw data to make it machine-readable. Structured data has a well-defined schema and follows a consistent layout. Think of data neatly organized in rows and columns available in spreadsheets and relational databases.
- Data Cleansing (QC): \> Data cleansing lays the foundation for accurate and efficient data analysis. Several transformations --- like Remove, Replace, Find and Replace, etc. --- are applied to eliminate redundant text and null values as well as identify missing fields, misplaced entries, and typing errors that can distort the analysis.
- Data Enrichment (?): \> It typically involves aggregating multiple data sources. For instance, if an order ID is found within a system, a user can match that order ID against a different database to obtain further details like the account name, account balance, buying history, credit limit, etc. This additional data "enriches" the original ID with greater context.
NOTE: could metadata writing also be considered data enrichment?
- Data Validation (QC?): \> Data validation contrasts with data cleansing in that it rejects any data that don't comply with pre-defined rules or constraints. also checks for the correctness and meaningfulness of the information.
[Alteryx::Data Munging](https://www.alteryx.com/glossary/data-munging)
- Discovery: \> Also known as data profiling. Learn what's in your raw data sets to think ahead about the best approach for your analytic explorations. This step involves gathering data from data sources and forming a high-level picture of the distribution, type, and format of data values. It allows you to understand unique elements of the data such as outliers and value distribution to inform the analysis process.
- Enriching: \> Before you structure and cleanse your data, what else could you add to provide more value to your analysis? Enrichment is often about joins and complex derivations. For example, if you're looking at biking data, perhaps a weather dataset would be an important factor in your analysis.
- Structuring: \> This is a critical step because data can come in all shapes and sizes, and it is up to you to decide the best format to visualize and explore it. Separating, blending, and un-nesting are all important actions in this step.
- Cleaning: \> This step is essential to standardizing your data to ensure that all inconsistencies (such as null and misspelled values) are addressed. Other data may need to be standardized to a single format, such as state abbreviations.
- Validating: \> Verify if you've caught all the data quality and consistency issues and go back to address anything you may have missed. Data validation should be done on multiple dimensions.
- Publishing and orchestrating: \> This is where you can download and deliver the results of your wrangling effort to downstream analytics tools. Once you've published your data it's time to move onto the next step, analytics.
- Examples:
- Data aggregation: Combining data from multiple sources (e.g. spreadsheets, cloud databases, source systems, etc.) by importing, joining tables, and summarizing it based on specific criteria
- Correcting missing data: Inputting missing values, deleting rows or columns with a high percentage of missing data, and using interpolation to estimate missing values
- Converting data types: Changing strings to numeric values, converting datetime formats, and converting categorical data into numerical representations
- Filtering and sorting: Selecting specific rows or columns based on certain criteria or reordering data based on specific values
- Removing duplicates: Identifying and eliminating duplicate rows or records in the data set
- Data normalization: Standardizing or scaling data values to meet a specific range
- Feature engineering: Creating new features or variables from existing data, such as calculating the difference between two columns
- Outlier detection and handling: Seeking out outliers within the data and removing them, capping them, or transforming them if they could affect analysis results
- Text cleaning and processing: Removing unnecessary characters such as whitespace or punctuation, tokenizing text, converting text to lowercase, or stemming/lemmatizing words
- Data transformation: Applying mathematical or statistical transformations to the data, such as taking the logarithm, square root, or exponential of a variable
NOTE: ADD UNIQUE IDS!!!!! How to have IDs sample vs horizon vs pedon vs site ids
[Geeks for Geeks::Data Munging in R Programming](https://www.geeksforgeeks.org/data-munging-in-r-programming/)
[Samantha Oliver - USGS Water Data::Beyond Basic R - Data Munging](https://waterdata.usgs.gov/blog/data-munging/)
NOTE: Gapfilling, imputing, etc
[TJ Murphy::Reproducible Data Munging in R](https://tjmurphy.github.io/jabstb/jaxwest7.html)
[Daniel Dauber::R for Non-Programmers - Data Wranging](https://bookdown.org/daniel_dauber_io/r4np_book/data-wrangling.html) NOTE: this is VERY good
NOTE: you can do some data munging in the read.csv specific arguments: https://www.geeksforgeeks.org/read-contents-of-a-csv-file-in-r-programming-read-csv-function/
List of questions: Is the file in text format (separated values, preferrably csv)?
Does the first row contain column names?
Are there special characters, spaces or other wierdness in the column names? Renaming column names
Are there missing values in the dataset? If so, decide if/how they should be dealt with.
Specifying data types or column classes.
[Factors vs characters](https://datascience.stackexchange.com/questions/12018/when-to-choose-character-instead-of-factor-in-r)
[and](http://monashbioinformaticsplatform.github.io/2015-09-28-rbioinformatics-intro-r/01-supp-factors.html)
## Documenting Data Munging - Standards
As R is the main language of choice in our lab, the default data munging documentation standard involves an R project or series of R scripts. These should be stored in the **src** folder of your project directory and curated according to our lab \[R code and project style guidelines\]#todo.
Data munging documentation should also include metadata for this stage - and could be in the form of part of a project Quarto book published on GitHub Pages. This metadata should document all decisions made in the data munging process, the why and how. Due to the detailed nature of many of these decisions they are too much detail for a standard readme file or a project metadata file. However, a Quarto book with sections, chapters, and space to provide R code blocks would be an excellent way to document these decisions.
::: callout-tip
## Required Tools - Data Munging
- R/RStudio
- Google Drive
:::