Skip to content

Commit

Permalink
Update description.md
Browse files Browse the repository at this point in the history
  • Loading branch information
apiraccini authored Jun 3, 2024
1 parent 70f884b commit a2e74d0
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion description.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Approaching this prediction task, the first thing that came to our mind was this
This led to three main hypothesis regarding the features.
- It is very likely that for predicting fertility during 2021-2023, the most important information will come from surveys conducted around 2021, and similarly, features from ten or more years prior will probably be weak in terms of predictive capability. If this is confirmed, it might be beneficial to primarily use the information from surveys conducted in recent years, supplementing it with relevant information from previous years.
- Despite having hundreds of survey questions for each topic, the most important information will probably lie in a few features, and some survey topics might not be useful for prediction. Therefore, it will be crucial to understand which features to select to best utilize the information in the important variables, thereby reducing the data to the most relevant subset.
- We also expect that handling missing values will be relevant to our task: it's possible that the missing values are not caused by a purely random component but may, in some cases, contain information relevant to the study. Given that the data consists of survey questions and responses, information missing may be missing due to the branching nature of questions during the survey completion process, where some questions appear only if the respondent has answered a previous question in a certain way. For example, questions regarding existing children will be missing if the respondent has not had any children yet.
- We also expect that handling missing values will be relevant to our task: it's possible that those missing values are not caused by a purely random component but may, in some cases, contain information relevant to the study. Given that the data consists of survey questions and responses, some answers may be missing due to the branching nature of questions during the survey completion process, where some questions appear only if the respondent has answered a previous question in a certain way. For example, questions regarding existing children will be missing if the respondent has not had any children yet.
Furthermore, considering that the data comes from a longitudinal survey with yearly surveys, it is possible that missing information for some variables in a given year can be correctly and consistently obtained from previous years (if it makes sense to do so).

On top of these simple considerations, it's worth to add that we can expect very few features to be the most important predictors: indeed, we pondered that it just takes common sense to see that the main predictors of fertility should be age, relationship status and stability, household economic situation, and whether there are already children in the household. Nonetheless, thanks to the vastity of data at hand it will still be possible to identify non-obvious and intriguing relationships.
Expand Down

0 comments on commit a2e74d0

Please sign in to comment.