The git repository containing DSBox cleaning related primitives is here. The git repository for DSBox primitives related to featurization is located here.
This is a multi-purpose cleaning featurizer primitive. This primitive requires metadata annotations from ISI's profiling primitive, see d3m.primitives.dsbox.Profiler
below. The cleaning featurization operations supported include:
- Split date column into multiple columns, e.g. year, month, date, day
- Split US phone number into multiple columns.
- Split column with consistent alpha-numeric value patterns, e.g. '2days' into multiple columns.
- Split column with consistent puntucation value patterns, e.g. 'NY_US' into multiple columns.
Fold multiple columns into one column based on common column name prefix. For example, fold columns with names 'month-jan', 'month-feb', 'month-mar' and so on, into one column named 'month'.
Performs one-hot encoding for categorical attributes. This encoder can handle missing values, and it allows user to specify the upper limit of columns to generate per cagtegorical attribute, n_limit
.
Performs unary encoding, which useful for ordinal data.
Performs mean missing value imputation for numerical columns, and mode imputation for categorical columns.
Performs missing value imputation by greedy search over simple imputation methods, i.e. mean, min, max, and zero.
Performs missing value imputation by regression, then improve the imputation by iterating over columns with missing values.
This primitive generates metadata by examining the given data. The types of metadata include:
- Column contains values tokenizable as an American phone number
- Column contains values tokenizable by puntucation
- Column contains values tokenizable into numerical tokens and alpha tokens
- Column value tokenization features (most common tokens, number of distinct tokens, ratio of distinct tokens, and so on)
- Column value features (most common values, number of distinct values, ration of distinct values, and so on)
- Column contains filename-like values
- Column contains missing values (number of missing values, ratio of missing values)
- Number of outlier values
- Correlation between columns (Pearson, Spearman)
Queries datamart for available datasets. The JSON query specification is defined Datamart Query API. The primitive returns a list of dataset metadata.
Joins two dataframes into one dataframe. The primtive takes two dataframes, left_dataframe and right_dataframe, and two lists specifing the join columns, left_columns and right_columns.
The git repository for DSBox primitives related to featurization is located here. The git repository containing DSBox cleaning related primitives is here.
Generate features using pre-trained ResNet50 deep neural network. Use hyperparameter layer_index
to select the network layer to use for featurization.
Generate features using pre-trained VGG16 deep neural network. Use hyperparameter layer_index
to select the network layer to use for featurization.
Reads in image files and generates a tensor that suitable as input to d3m.primitives.dsbox.ResNet50ImageFeature
and d3m.primitives.dsbox.Vgg16ImageFeature
.
Performs forecasting of one timeseries using recursive neural network.
Performs forecasting of one timeseries using AutoArima.
Performs forecasting of one timeseries using Group Up.
Generate features of multiple timeseries by random projecting the timeseries matrix into lower dimendions.
Reads in timeseries csv files and generate output List that is suitable as input to d3m.primitives.dsbox.RandomProjectionTimeSeriesFeaturization
.
Automatically detect foriegn key relationships among multiple tables, and join the tables into one table using aggregation.
This an identity function primitive that returns the input dataframe as output. This useful for bypassing a step in a pipeline without having to modify the pipeline structure.