Skip to content

Commit

Permalink
look at dataset balance
Browse files Browse the repository at this point in the history
  • Loading branch information
wassname committed Oct 21, 2020
1 parent 5eedb2b commit efbb0ce
Show file tree
Hide file tree
Showing 2 changed files with 95 additions and 20 deletions.
104 changes: 84 additions & 20 deletions notebooks/b05_Supervised_Learning/supervised_part1.ipynb

Large diffs are not rendered by default.

11 changes: 11 additions & 0 deletions notebooks/b05_Supervised_Learning/supervised_part1.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,17 @@
y = geolink["LITHOLOGY_GEOLINK"]
X

# NOTE: our dataset is imbalanced, this is always import when considering a problem.
#
# 1. so our baseline accuracy is 46.5
# 2. if we get poor performance may want to consider techniques to deal with unbalanced data. However we do not do this in the notebook

# Check dataset label balance
counts = y.value_counts()
counts = counts[counts>0]/counts.sum()
counts.plot.bar()
counts

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=2020
)
Expand Down

0 comments on commit efbb0ce

Please sign in to comment.