Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update start-simple.md #41

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions texts/start-simple.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ How this looks in practice will depend on how exploratory you are in your approa

**A more ambitious but extremely useful approach to building models gradually is to work with fake data.** This is particularly apt when we are blinded to part of the data. We do this by assuming some "data generating process" (DGP), which is a function that generates datasets (typically with some noise!) and then programming your analytics to work with your DGP outputs. This approach has many advantages, including

1. testing performance of your methods (e.g., bias, precision, statistical power)
1. being able to program your analysis without waiting for data
1. checking your understanding of how your chosen statistical methods work "under the hood"
* testing performance of your methods (e.g., bias, precision, statistical power)
* being able to program your analysis without waiting for data
* checking your understanding of how your chosen statistical methods work "under the hood"

In cases where you know something about the problem, you can code a DGP without first looking at the data. Typically, however, you will use some existing data to design your DGP. For example, if studying an intervention meant to reduce child mortality in a country, you can typically find census data with geographical and temporal variation for that country. This information will allow you to simulate a realistic data structure (via resampling, adding noise, or just visually checking that your DGP matches census data) against which you can test your methods. Note as well that programming a DGP first will often help you spot problems in data from your research project. In our example, suppose the project-collected data for an indicator has a distribution which does not resemble that of the same indicator on the census (e.g., it has much higher mean or much lower variation). Then you may need to ask yourself if this indicator is defined consistently with other data sources, in what way your sample is representative, or hypothesise some necessary model adjustments.
In cases where you know something about the problem, you can code a DGP without first looking at the data. Typically, however, you will use some existing data to design your DGP. For example, if studying an intervention meant to reduce child mortality in a country, you can typically find census data with geographical and temporal variation for that country. This information will allow you to simulate a realistic data structure (via resampling, adding noise, or just visually checking that your DGP matches census data) against which you can test your methods. Note as well that programming a DGP first will often help you spot problems in data from your research project. In our example, suppose the project-collected data for an indicator has a distribution which does not resemble that of the same indicator on the census (e.g., it has much higher mean or much lower variation). Then you may need to ask yourself if this indicator is defined consistently with other data sources, in what way your sample is representative, or hypothesise some necessary model adjustments.