-
Notifications
You must be signed in to change notification settings - Fork 10
WIP: Cross-validation support for multiple covariate matching #64
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
def stratified_crossvalidate( | ||
covariate_df, | ||
pipeline_estimator=None, | ||
cv_splitter=StratifiedShuffleSplit(n_splits=1, test_size=0.25), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might set this to None, and initialize below. Just to be on the safe side.
|
||
# Check covariate_df is a dataframe | ||
if not isinstance(covariate_df, pd.DataFrame): | ||
raise TypeError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise TypeError | |
raise TypeError("`covariate_df` input needs to be a DataFrame object") |
Is that crucial, by the way? Why not take arrays?
if not isinstance(covariate_df, pd.DataFrame): | ||
raise TypeError | ||
|
||
X = covariate_df.to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you can check if it's a dataframe, and if not, skip this line
if pipeline_estimator is None: | ||
pipeline_estimator = make_pipeline( | ||
KNeighborsTransformer(n_neighbors=10, mode="distance"), | ||
Isomap(n_components=5, neighbors_algorithm="auto"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might expose n_neighbors
and n_components
as key-word arguments for the whole function, set to these values as default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not. If they want to customize, just pass your own pipeline_estimator
) | ||
|
||
X_reduced = pipeline_estimator.fit_transform(X) | ||
cluster_stratify = KMeans(n_clusters=5).fit_predict(X_reduced) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here : expose n_clusters
for foldno, (train_index, test_index) in enumerate( | ||
cv_splitter.split(X, cluster_stratify) | ||
): | ||
print("Fold No: {}".format(foldno)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use logging instead of print
?
A scikit-learn object for splitting the dataset into | ||
train/test set such that splits are stratified | ||
|
||
Yields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it Returns these, not Yields
print(train_clusters) | ||
print(train_freq) | ||
print(test_clusters) | ||
print(test_freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, logging?
@mnarayan : any progress on this one? Looks like the build failures are just some minor linting. Would you mind rebasing and adding tests? |
I believe this was superseded by #107 |
No description provided.