Skip to content

Commit

Permalink
Update description.md
Browse files Browse the repository at this point in the history
  • Loading branch information
apiraccini authored Jun 3, 2024
1 parent 09f8d75 commit 70f884b
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions description.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The primary training dataset comprises individuals aged 18-45 in 2020 who partic

Approaching this prediction task, the first thing that came to our mind was this basic principle: even though the datasets are curated and prepared for this competition, the panel wasn't conceived with the intent of predicting individual fertility.
This led to three main hypothesis regarding the features.
- It is very likely that for predicting fertility during 2021-2023, the most important information will come from surveys conducted around 2021, and similarly, features from ten or more years prior will probably be weak in terms of predictive capability. If this is confirmed, it might be beneficial to primarily use the information from surveys conducted in recent years, while supplementing it with relevant information from previous years.
- It is very likely that for predicting fertility during 2021-2023, the most important information will come from surveys conducted around 2021, and similarly, features from ten or more years prior will probably be weak in terms of predictive capability. If this is confirmed, it might be beneficial to primarily use the information from surveys conducted in recent years, supplementing it with relevant information from previous years.
- Despite having hundreds of survey questions for each topic, the most important information will probably lie in a few features, and some survey topics might not be useful for prediction. Therefore, it will be crucial to understand which features to select to best utilize the information in the important variables, thereby reducing the data to the most relevant subset.
- We also expect that handling missing values will be relevant to our task: it's possible that the missing values are not caused by a purely random component but may, in some cases, contain information relevant to the study. Given that the data consists of survey questions and responses, information missing may be missing due to the branching nature of questions during the survey completion process, where some questions appear only if the respondent has answered a previous question in a certain way. For example, questions regarding existing children will be missing if the respondent has not had any children yet.
Furthermore, considering that the data comes from a longitudinal survey with yearly surveys, it is possible that missing information for some variables in a given year can be correctly and consistently obtained from previous years (if it makes sense to do so).
Expand Down Expand Up @@ -516,4 +516,4 @@ for epoch in range(num_epochs):
# evaluate on validation set
val_loss, val_acc = evaluate(model, device, data_loader)
print(f'\tval loss: {val_loss:.4f}, val acc: {val_acc:.4f}')
```
```

0 comments on commit 70f884b

Please sign in to comment.