Skip to content

Commit

Permalink
Update description.md
Browse files Browse the repository at this point in the history
  • Loading branch information
apiraccini authored Jun 3, 2024
1 parent de75e7f commit 160724d
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion description.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ In addition to the primary dataset, we also utilized the background dataset, whi
### Tree selection

We decided to focus on the train set first and only afterwards on the background survey.
Our initial aim was to craft a procedure which could test our initial hypothesis. The idea was to develop a quick and automatic way to assess with sustainable precision the marginal predictive power of every feature available. We decided to evaluate the predictive performance on the task of predicting ferility of a univariate model, using a stratified cross-validation with 5 folds and iterating over every feature available. Our model chosen for this procedure was the decision tree, for several reasons: it can gracefully handle missing data, it's better suited with categorical/ordinal features stored as numbers that a linear method like logistic regression, and, last but not least, the optimized implementations available (together with the modest number of rows) allowed us to iterate across tens of thousands of features fairly quickly. The results are summarized below.
Our initial aim was to craft a procedure which could test our initial hypothesis. The idea was to develop a quick and automatic way to assess with sustainable precision the marginal predictive power of every feature available. We decided to evaluate the predictive performance on the task of predicting ferility of a univariate model, using a stratified cross-validation with 5 folds and iterating over every feature available. The chosem model for this procedure was a decision tree, for several reasons: it can gracefully handle missing data, it's better suited with categorical/ordinal features stored as numbers that a linear method like logistic regression, and, last but not least, the optimized implementations available (together with the modest number of rows) allowed us to iterate across tens of thousands of features fairly quickly. The results are summarized below.

![gas](./saved/tree_selection_big_year.png)

Expand Down

0 comments on commit 160724d

Please sign in to comment.