Selective: Feature Selection Library

Selective is a white-box feature selection library that supports supervised and unsupervised selection methods for classification and regression tasks.

The library provides:

Simple to complex selection methods: Variance, Correlation, Statistical, Linear, Tree-based, or Customized.
Text-based selection to maximize diversity in text embeddings and metadata coverage.
Interoperable with data frames as the input.
Automated task detection. No need to know what feature selection method works with what machine learning task.
Benchmarking multiple selectors using cross-validation with built-in parallelization.
Inspection of the results and feature importance.

Selective also provides optimized item selection based on diversity of text embeddings via TextWiser and coverage of binary labels via multi-objective optimization (AMAI'24, CPAIOR'21, DSO@IJCAI'22). This approach speeds-up online experimentation and boosts recommender systems significantly as presented at NVIDIA GTC'22.

Selective is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments.

Quick Start

# Import Selective and SelectionMethod
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from feature.selector import Selective, SelectionMethod

# Data
data, label = get_data_label(fetch_california_housing())

# Feature selectors from simple to more complex
selector = Selective(SelectionMethod.Variance(threshold=0.0))
selector = Selective(SelectionMethod.Correlation(threshold=0.5, method="pearson"))
selector = Selective(SelectionMethod.Statistical(num_features=3, method="anova"))
selector = Selective(SelectionMethod.Linear(num_features=3, regularization="none"))
selector = Selective(SelectionMethod.TreeBased(num_features=3))

# Feature reduction
subset = selector.fit_transform(data, label)
print("Reduction:", list(subset.columns))
print("Scores:", list(selector.get_absolute_scores()))

Available Methods

Method	Options
Variance per Feature	`threshold`
Correlation pairwise Features	Pearson Correlation Coefficient Kendall Rank Correlation Coefficient Spearman's Rank Correlation Coefficient
Statistical Analysis	ANOVA F-test Classification F-value Regression Chi-Square Mutual Information Classification Variance Inflation Factor
Linear Methods	Linear Regression Logistic Regression Lasso Regularization Ridge Regularization
Tree-based Methods	Decision Tree Random Forest Extra Trees Classifier XGBoost LightGBM AdaBoost CatBoost Gradient Boosting Tree
Text-based Methods	`featurization_method` = TextWiser `optimization_method = ["exact", "greedy", "kmeans", "random"]` `cost_metric = ["unicost", "diverse"]`

Benchmarking

# Imports
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from xgboost import XGBClassifier, XGBRegressor
from feature.selector import SelectionMethod, benchmark, calculate_statistics

# Data
data, label = get_data_label(fetch_california_housing())

# Selectors
corr_threshold = 0.5
num_features = 3
tree_params = {"n_estimators": 50, "max_depth": 5, "random_state": 111, "n_jobs": 4}
selectors = {

  # Correlation methods
  "corr_pearson": SelectionMethod.Correlation(corr_threshold, method="pearson"),
  "corr_kendall": SelectionMethod.Correlation(corr_threshold, method="kendall"),
  "corr_spearman": SelectionMethod.Correlation(corr_threshold, method="spearman"),
  
  # Statistical methods
  "stat_anova": SelectionMethod.Statistical(num_features, method="anova"),
  "stat_chi_square": SelectionMethod.Statistical(num_features, method="chi_square"),
  "stat_mutual_info": SelectionMethod.Statistical(num_features, method="mutual_info"),
  
  # Linear methods
  "linear": SelectionMethod.Linear(num_features, regularization="none"),
  "lasso": SelectionMethod.Linear(num_features, regularization="lasso", alpha=1000),
  "ridge": SelectionMethod.Linear(num_features, regularization="ridge", alpha=1000),
  
  # Non-linear tree-based methods
  "random_forest": SelectionMethod.TreeBased(num_features),
  "xgboost_classif": SelectionMethod.TreeBased(num_features, estimator=XGBClassifier(**tree_params)),
  "xgboost_regress": SelectionMethod.TreeBased(num_features, estimator=XGBRegressor(**tree_params))
}

# Benchmark (sequential)
score_df, selected_df, runtime_df = benchmark(selectors, data, label, cv=5)
print(score_df, "\n\n", selected_df, "\n\n", runtime_df)

# Benchmark (in parallel)
score_df, selected_df, runtime_df = benchmark(selectors, data, label, cv=5, n_jobs=4)
print(score_df, "\n\n", selected_df, "\n\n", runtime_df)

# Get benchmark statistics by feature
stats_df = calculate_statistics(score_df, selected_df)
print(stats_df)

Text-based Selection

This example shows how to use text-based selection. In this scenario, we would like to select a subset of articles that is most diverse in the text embedding space and covers a range of topics.

# Import Selective and TextWiser
import pandas as pd
from feature.selector import Selective, SelectionMethod
from textwiser import TextWiser, Embedding, Transformation

# Data with the text content of each article
data = pd.DataFrame({"article_1": ["article text here"],
                     "article_2": ["article text here"],
                     "article_3": ["article text here"],
                     "article_4": ["article text here"],
                     "article_5": ["article text here"]})

# Labels to denote 0/1 coverage metadata for each article 
# across four labels, e.g., sports, international, entertainment, science    
labels = pd.DataFrame({"article_1": [1, 1, 0, 1],
                       "article_2": [0, 1, 0, 0],
                       "article_3": [0, 0, 1, 0],
                       "article_4": [0, 0, 1, 1],
                       "article_5": [1, 1, 1, 0]},
                      index=["label_1", "label_2", "label_3", "label_4"])

# TextWiser featurization method to create text embeddings
textwiser = TextWiser(Embedding.TfIdf(), Transformation.NMF(n_components=20))

# Text-based selection
# The goal is to select a subset of articles 
# that is most diverse in the text embedding space of articles
# and covers the most labels in each topic
selector = Selective(SelectionMethod.TextBased(num_features=2, featurization_method=textwiser))

# Feature reduction
subset = selector.fit_transform(data, labels)
print("Reduction:", list(subset.columns))

Visualization

import pandas as pd
from sklearn.datasets import fetch_california_housing
from feature.utils import get_data_label
from feature.selector import SelectionMethod, Selective, plot_importance

# Data
data, label = get_data_label(fetch_california_housing())

# Feature Selector
selector = Selective(SelectionMethod.Linear(num_features=8, regularization="none"))
subset = selector.fit_transform(data, label)

# Plot Feature Importance
df = pd.DataFrame(selector.get_absolute_scores(), index=data.columns)
plot_importance(df)

Installation

Selective requires Python 3.7+ and can be installed from PyPI using pip install selective.

Source

Alternatively, you can build a wheel package on your platform from scratch using the source code:

git clone https://github.com/fidelity/selective.git
cd selective
pip install setuptools wheel # if wheel is not installed
python setup.py sdist bdist_wheel
pip install dist/selective-X.X.X-py3-none-any.whl

Test your setup

cd selective
python -m unittest discover tests

Citation

If you use Selective in a publication, please cite it as:

    @article{DBLP:journals/amai/HaDVH98,
    author       = {Kad\i{}o\u{g}lu, Serdar and Kleynhans, Bernard and Wang, Xin},
    title        = {Integrating optimized item selection with active learning for continuous exploration in recommender systems},
    journal      = {Ann. Math. Artif. Intell.},
    year         = {2024},
    url          = {https://doi.org/10.1007/s10472-024-09941-x},
    doi          = {10.1007/s10472-024-09941-x},
    }
}

Support

Please submit bug reports and feature requests as Issues.

License

Selective is licensed under Apache 2.0

Name	Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows	.github/workflows	Py Version and Fix text based tests (#16 )	Sep 6, 2023
feature	feature	Update license to Apache 2.0 (#20 )	Dec 3, 2024
tests	tests	Update license to Apache 2.0 (#20 )	Dec 3, 2024
.gitignore	.gitignore	initial commit	Feb 1, 2021
CHANGELOG.txt	CHANGELOG.txt	Cron schedule (#14 )	Apr 26, 2023
CODEOWNERS	CODEOWNERS	initial commit	Feb 1, 2021
CONTRIBUTING.md	CONTRIBUTING.md	Update Contributing Guide (#10 )	Nov 23, 2022
LICENSE.md	LICENSE.md	Update license to Apache 2.0 (#20 )	Dec 3, 2024
NOTICES	NOTICES	initial commit	Feb 1, 2021
README.md	README.md	Update license to Apache 2.0 (#20 )	Dec 3, 2024
requirements.txt	requirements.txt	Text based_Random+Greedy+KMeans+Exact (#13 )	May 8, 2023
setup.py	setup.py	Update license to Apache 2.0 (#20 )	Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Selective: Feature Selection Library

Quick Start

Available Methods

Benchmarking

Text-based Selection

Visualization

Installation

Source

Test your setup

Citation

Support

License

About

Releases 4

Packages

Contributors 9

Languages

License

fidelity/selective

Folders and files

Latest commit

History

Repository files navigation

Selective: Feature Selection Library

Quick Start

Available Methods

Benchmarking

Text-based Selection

Visualization

Installation

Source

Test your setup

Citation

Support

License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 9

Languages

Packages