diff --git a/docs/build/.documenter-siteinfo.json b/docs/build/.documenter-siteinfo.json index a2cb195..ca7b15a 100644 --- a/docs/build/.documenter-siteinfo.json +++ b/docs/build/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-04-14T11:38:16","documenter_version":"1.3.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-04-14T16:54:15","documenter_version":"1.3.0"}} \ No newline at end of file diff --git a/docs/build/examples/example/index.html b/docs/build/examples/example/index.html index d4077cb..dbedd85 100644 --- a/docs/build/examples/example/index.html +++ b/docs/build/examples/example/index.html @@ -1,2 +1,2 @@ -Example · PartitionedLS.jl

We present here an analysis of a solution found by a Partitioned LS algorithm on the Ames House Prices dataset, which is publicly available via Kaggle.

The Julia notebook used to generate the results is available here.

This dataset has a relatively high number of columns (79 in total) each detailing one particular characteristic of housing properties in Ames, Iowa. The task is to predict the selling price of each house.

We propose a grouping of the features into 10 groups, each one representing a high-level characteristic of the property:

GroupFeatures
LotDescritptionMSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, LotConfig, LandSlope
BuildingPlacementUtilities, Neighborhood, Condition1, Condition2
BuildingAgeYearBuilt, YearRemodAdd
BuildingQualityBldgType, HouseStyle, OverallQual, OverallCond, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, Functional
BasementBsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF
PowerAndTemperatureHeating, HeatingQC, CentralAir, Electrical, Fireplaces, FireplaceQu
Sizes1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea
RoomsBsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd
OutsideFacilitiesGarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence
VariousMiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition

As an example, we collect 6 columns referring to the availability and quality of air conditioning systems, electrical system, heating and fireplaces in a "Power and Temperature" group. Other feature groups refer to overall quality of the construction work and materials employed ("Building Quality"), external facilities such as garages or swimming pools ("Outside Facilities"). The $\beta$ values for the groups are as follows:

\$\\beta\$ values as found by the `Opt` algorithm on the Ames House Prices dataset

We note that the grouped solution enabled by the partitioned least squares formulation is able to give a high-level summary of the regression result. An analyst is therefore able to communicate easily to, e.g. an individual selling their house, that the price is mostly determined by the building quality and the attractiveness of the lot. A deeper analysis is of course possible by investigating the $\alpha$ values found by the algorithm. For instance, let consider the contributions to the ``Outside Facilities'':

\$\\alpha\$ values as found by the `Opt` algorithm on the Ames House Prices dataset for the "OutsideFacilities" group

Here, one is able to notice that garage quality has the biggest impact on the property's price, which is potentially actionable knowledge.

We argue that the group- and feature-level analysis made possible by our contributions improves on the interpretability of ungrouped linear regression.

+Example · PartitionedLS.jl

We present here an analysis of a solution found by a Partitioned LS algorithm on the Ames House Prices dataset, which is publicly available via Kaggle.

The Julia notebook used to generate the results is available here.

This dataset has a relatively high number of columns (79 in total) each detailing one particular characteristic of housing properties in Ames, Iowa. The task is to predict the selling price of each house.

We propose a grouping of the features into 10 groups, each one representing a high-level characteristic of the property:

GroupFeatures
LotDescritptionMSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, LotConfig, LandSlope
BuildingPlacementUtilities, Neighborhood, Condition1, Condition2
BuildingAgeYearBuilt, YearRemodAdd
BuildingQualityBldgType, HouseStyle, OverallQual, OverallCond, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, Functional
BasementBsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF
PowerAndTemperatureHeating, HeatingQC, CentralAir, Electrical, Fireplaces, FireplaceQu
Sizes1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea
RoomsBsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd
OutsideFacilitiesGarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence
VariousMiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition

As an example, we collect 6 columns referring to the availability and quality of air conditioning systems, electrical system, heating and fireplaces in a "Power and Temperature" group. Other feature groups refer to overall quality of the construction work and materials employed ("Building Quality"), external facilities such as garages or swimming pools ("Outside Facilities"). The $\beta$ values for the groups are as follows:

\$\\beta\$ values as found by the `Opt` algorithm on the Ames House Prices dataset

We note that the grouped solution enabled by the partitioned least squares formulation is able to give a high-level summary of the regression result. An analyst is therefore able to communicate easily to, e.g. an individual selling their house, that the price is mostly determined by the building quality and the attractiveness of the lot. A deeper analysis is of course possible by investigating the $\alpha$ values found by the algorithm. For instance, let consider the contributions to the ``Outside Facilities'':

\$\\alpha\$ values as found by the `Opt` algorithm on the Ames House Prices dataset for the "OutsideFacilities" group

Here, one is able to notice that garage quality has the biggest impact on the property's price, which is potentially actionable knowledge.

We argue that the group- and feature-level analysis made possible by our contributions improves on the interpretability of ungrouped linear regression.

diff --git a/docs/build/index.html b/docs/build/index.html index 63e6d13..2439b3b 100644 --- a/docs/build/index.html +++ b/docs/build/index.html @@ -39,7 +39,8 @@ # Make predictions on the given data matrix. The function works # with results returned by anyone of the solvers. predict(result[1], X)

MLJ interface

The MLJ interface is a allows you to use the library in a more MLJ-like fashion. The interface is defined by the PartLS model, which can be used in the MLJ framework. The model can be used in the same way as any other MLJ model.

A complete example:

using MLJ
-using PartitionedLS
+
+PartLS = @load PartLS, pkg=PartitionedLS
 
 X = [[1. 2. 3.]; 
      [3. 3. 4.]; 
@@ -64,7 +65,7 @@
 fit!(mach)
 
 # Make predictions
-predict(mach, X)

API Documentation

PartitionedLS.PartLSType
PartLS

A model type for fitting a partitioned least squares model to data. Both an MLJ and native interface are provided.

MLJ Interface

From MLJ, the type can be imported using

PartLS = @load PartLS pkg=PartitionedLS

Construct an instance with default hyper-parameters using the syntax model = PartLS(). Provide keyword arguments to override hyper-parameter defaults, as in model = PartLS(P=...).

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

where

  • X: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).
  • y: any vector with Continuous element scitype. Check scitype with scitype(y).

Train the machine using fit!(mach).

Hyper-parameters

  • Optimizer: the optimization algorithm to use. It can be Opt, Alt or BnB (names exported by PartitionedLS.jl).

  • P: the partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element P_{k, i} = 1 if feature i belongs to partition k.

  • η: the regularization parameter. It controls the strength of the regularization.

  • ϵ: the tolerance parameter. It is used to determine when the Alt optimization algorithm has converged. Only used by the Alt algorithm.

  • T: the maximum number of iterations. It is used to determine when to stop the Alt optimization algorithm has converged. Only used by the Alt algorithm.

  • rng: the random number generator to use.

    • If nothing, the global random number generator rand is used.

    • If an integer, the global number generator rand is used after seeding it with the given integer.

    • If an object of type AbstractRNG, the given random number generator is used.

Operations

  • predict(mach, Xnew): return the predictions of the model on new data Xnew

Fitted parameters

The fields of fitted_params(mach) are:

  • α: the values of the α variables. For each partition k, it holds the values of the α variables are such that $\sum_{i \in P_k} \alpha_{k} = 1$.
  • β: the values of the β variables. For each partition k, β_k is the coefficient that multiplies the features in the k-th partition.
  • t: the intercept term of the model.
  • P: the partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element P_{k, i} = 1 if feature i belongs to partition k.

Examples

PartLS = @load PartLS pkg=PartitionedLS
+predict(mach, X)

API Documentation

PartitionedLS.PartLSType
PartLS

A model type for fitting a partitioned least squares model to data. Both an MLJ and native interface are provided.

MLJ Interface

From MLJ, the type can be imported using

PartLS = @load PartLS pkg=PartitionedLS

Construct an instance with default hyper-parameters using the syntax model = PartLS(). Provide keyword arguments to override hyper-parameter defaults, as in model = PartLS(P=...).

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

where

  • X: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).
  • y: any vector with Continuous element scitype. Check scitype with scitype(y).

Train the machine using fit!(mach).

Hyper-parameters

  • Optimizer: the optimization algorithm to use. It can be Opt, Alt or BnB (names exported by PartitionedLS.jl).

  • P: the partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element P_{k, i} = 1 if feature i belongs to partition k.

  • η: the regularization parameter. It controls the strength of the regularization.

  • ϵ: the tolerance parameter. It is used to determine when the Alt optimization algorithm has converged. Only used by the Alt algorithm.

  • T: the maximum number of iterations. It is used to determine when to stop the Alt optimization algorithm has converged. Only used by the Alt algorithm.

  • rng: the random number generator to use.

    • If nothing, the global random number generator rand is used.

    • If an integer, the global number generator rand is used after seeding it with the given integer.

    • If an object of type AbstractRNG, the given random number generator is used.

Operations

  • predict(mach, Xnew): return the predictions of the model on new data Xnew

Fitted parameters

The fields of fitted_params(mach) are:

  • α: the values of the α variables. For each partition k, it holds the values of the α variables are such that $\sum_{i \in P_k} \alpha_{k} = 1$.
  • β: the values of the β variables. For each partition k, β_k is the coefficient that multiplies the features in the k-th partition.
  • t: the intercept term of the model.
  • P: the partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element P_{k, i} = 1 if feature i belongs to partition k.

Examples

PartLS = @load PartLS pkg=PartitionedLS
 
 X = [[1. 2. 3.];
      [3. 3. 4.];
@@ -105,7 +106,7 @@
 
 # fit using the optimal algorithm
 result = fit(Opt, X, y, P, η = 0.0)
-y_hat = predict(result.model, X)

For other fit keyword options, refer to the "Hyper-parameters" section for the MLJ interface.

source
PartitionedLS.PartLSFitResultType
struct PartLSFitResult

The PartLSFitResult struct represents the solution of the partitioned least squares problem. It contains the values of the α and β variables, the intercept t and the partition matrix P.

Fields

  • α::Vector{AbstractFloat}: The values of the α variables. For each partition $k$, it holds the values of the α variables are such that $\sum_{i \in P_k} \alpha_{k} = 1$.
  • β::Vector{AbstractFloat}: The values of the β variables. For each partition $k$, $\beta_k$ is the coefficient that multiplies the features in the k-th partition.
  • t::AbstractFloat: The intercept term of the model.
  • P::Matrix{Int64}: The partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element $P_{k, i} = 1$ if feature $i$ belongs to partition $k$.
source
MLJModelInterface.fitFunction
fit(
+y_hat = predict(result.model, X)

For other fit keyword options, refer to the "Hyper-parameters" section for the MLJ interface.

source
PartitionedLS.PartLSFitResultType
struct PartLSFitResult

The PartLSFitResult struct represents the solution of the partitioned least squares problem. It contains the values of the α and β variables, the intercept t and the partition matrix P.

Fields

  • α::Vector{AbstractFloat}: The values of the α variables. For each partition $k$, it holds the values of the α variables are such that $\sum_{i \in P_k} \alpha_{k} = 1$.
  • β::Vector{AbstractFloat}: The values of the β variables. For each partition $k$, $\beta_k$ is the coefficient that multiplies the features in the k-th partition.
  • t::AbstractFloat: The intercept term of the model.
  • P::Matrix{Int64}: The partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element $P_{k, i} = 1$ if feature $i$ belongs to partition $k$.
source
MLJModelInterface.fitFunction
fit(
     ::Type{Alt},
     X::Array{F<:AbstractFloat, 2},
     y::Array{F<:AbstractFloat, 1},
@@ -116,10 +117,10 @@
     nnlsalg,
     rng
 ) -> Tuple{PartLSFitResult, Nothing, NamedTuple{(:opt,), <:Tuple{Any}}}
-

Fits a PartitionedLS model by alternating the optimization of the α and β variables. This version uses an optimization strategy based on non-negative-least-squaes solvers. This formulation is faster and more numerically stable with respect to fit(Alt, ...)`.

Arguments

  • X: $N × M$ matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).
  • y: $N$ vector with Continuous element scitype. Check scitype with scitype(y).
  • P: $M × K$ Int matrix specifying how to partition the $M$ attributes into $K$ subsets. $P_{m,k}$ should be 1 if attribute number $m$ belongs to partition $k$.
  • η: regularization factor, higher values implies more regularized solutions. Default is 0.0.
  • T: number of alternating loops to be performed. Default is 100.
  • ϵ: minimum relative improvement in the objective function before stopping the optimization. Default is 1e-6
  • nnlsalg: specific flavour of nnls algorithm to be used, possible values are :pivot, :nnls, :fnnls. Default is :nnls

Result

A Tuple with the following fields:

  1. a PartLSFitResult object containing the fitted model
  2. a nothing object
  3. a NamedTuple with a field opt containing the optimal value of the objective function
source
#(TYPEDSIGNATURES)

Fits a PartialLS Regression model to the given data and resturns the learnt model (see the Result section). It uses a coplete enumeration strategy which is exponential in K, but guarantees to find the optimal solution.

Arguments

  • X: $N × M$ matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).
  • y: $N$ vector with Continuous element scitype. Check scitype with scitype(y).
  • P: $M × K$ Int matrix specifying how to partition the $M$ attributes into $K$ subsets. $P_{m,k}$ should be 1 if attribute number $m$ belongs to partition $k$.
  • η: regularization factor, higher values implies more regularized solutions (default: 0.0)
  • returnAllSolutions: if true an additional output is appended to the resulting tuple containing all solutions found during the algorithm.
  • nnlsalg: the kind of nnls algorithm to be used during solving. Possible values are :pivot, :nnls, :fnnls (default: :nnls)

Example

X = rand(100, 10)
+

Fits a PartitionedLS model by alternating the optimization of the α and β variables. This version uses an optimization strategy based on non-negative-least-squaes solvers. This formulation is faster and more numerically stable with respect to fit(Alt, ...)`.

Arguments

  • X: $N × M$ matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).
  • y: $N$ vector with Continuous element scitype. Check scitype with scitype(y).
  • P: $M × K$ Int matrix specifying how to partition the $M$ attributes into $K$ subsets. $P_{m,k}$ should be 1 if attribute number $m$ belongs to partition $k$.
  • η: regularization factor, higher values implies more regularized solutions. Default is 0.0.
  • T: number of alternating loops to be performed. Default is 100.
  • ϵ: minimum relative improvement in the objective function before stopping the optimization. Default is 1e-6
  • nnlsalg: specific flavour of nnls algorithm to be used, possible values are :pivot, :nnls, :fnnls. Default is :nnls

Result

A Tuple with the following fields:

  1. a PartLSFitResult object containing the fitted model
  2. a nothing object
  3. a NamedTuple with a field opt containing the optimal value of the objective function
source
#(TYPEDSIGNATURES)

Fits a PartialLS Regression model to the given data and resturns the learnt model (see the Result section). It uses a coplete enumeration strategy which is exponential in K, but guarantees to find the optimal solution.

Arguments

  • X: $N × M$ matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).
  • y: $N$ vector with Continuous element scitype. Check scitype with scitype(y).
  • P: $M × K$ Int matrix specifying how to partition the $M$ attributes into $K$ subsets. $P_{m,k}$ should be 1 if attribute number $m$ belongs to partition $k$.
  • η: regularization factor, higher values implies more regularized solutions (default: 0.0)
  • returnAllSolutions: if true an additional output is appended to the resulting tuple containing all solutions found during the algorithm.
  • nnlsalg: the kind of nnls algorithm to be used during solving. Possible values are :pivot, :nnls, :fnnls (default: :nnls)

Example

X = rand(100, 10)
 y = rand(100)
 P = [1 0 0; 0 1 0; 0 0 1; 1 1 0; 0 1 1]
-result = fit(Opt, X, y, P)
source
fit(
+result = fit(Opt, X, y, P)
source
fit(
     ::Type{BnB},
     X::Matrix{<:AbstractFloat},
     y::AbstractVector{<:AbstractFloat},
@@ -127,22 +128,22 @@
     η,
     nnlsalg
 ) -> Tuple{PartLSFitResult, Nothing, NamedTuple{(:opt, :nopen), <:Tuple{Any, Any}}}
-

Implements the Branch and Bound algorithm to fit a Partitioned Least Squres model.

Arguments

  • X: $N × M$ matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).
  • y: $N$ vector with Continuous element scitype. Check scitype with scitype(y).
  • P: $M × K$ Int matrix specifying how to partition the $M$ attributes into $K$ subsets. $P_{m,k}$ should be 1 if attribute number $m$ belongs to partition $k$.
  • η: regularization factor, higher values implies more regularized solutions (default: 0.0)
  • nnlsalg: the kind of nnls algorithm to be used during solving. Possible values are :pivot, :nnls, :fnnls (default: :nnls)

Result

A tuple with the following fields:

  1. a PartLSFitResult object containing the fitted model
  2. a nothing object
  3. a NamedTuple with fields:
    • opt containing the optimal value of the objective function
    • nopen containing the number of open nodes in the branch and bound tree
source
fit(
+

Implements the Branch and Bound algorithm to fit a Partitioned Least Squres model.

Arguments

  • X: $N × M$ matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).
  • y: $N$ vector with Continuous element scitype. Check scitype with scitype(y).
  • P: $M × K$ Int matrix specifying how to partition the $M$ attributes into $K$ subsets. $P_{m,k}$ should be 1 if attribute number $m$ belongs to partition $k$.
  • η: regularization factor, higher values implies more regularized solutions (default: 0.0)
  • nnlsalg: the kind of nnls algorithm to be used during solving. Possible values are :pivot, :nnls, :fnnls (default: :nnls)

Result

A tuple with the following fields:

  1. a PartLSFitResult object containing the fitted model
  2. a nothing object
  3. a NamedTuple with fields:
    • opt containing the optimal value of the objective function
    • nopen containing the number of open nodes in the branch and bound tree
source
fit(
     m::PartLS,
     verbosity,
     X,
     y
 ) -> Tuple{PartLSFitResult, Nothing, Any}
-

Fits a PartitionedLS Regression model to the given data and resturns the learnt model (see the Result section). It conforms to the MLJ interface.

Arguments

  • m: A PartLS model to fit
  • verbosity: the verbosity level
  • X: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).
  • y: any vector with Continuous element scitype. Check scitype with scitype(y).
source
MLJModelInterface.predictFunction
predict(
+

Fits a PartitionedLS Regression model to the given data and resturns the learnt model (see the Result section). It conforms to the MLJ interface.

Arguments

  • m: A PartLS model to fit
  • verbosity: the verbosity level
  • X: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).
  • y: any vector with Continuous element scitype. Check scitype with scitype(y).
source
MLJModelInterface.predictFunction
predict(
     α::AbstractVector{<:AbstractFloat},
     β::AbstractVector{<:AbstractFloat},
     t::AbstractFloat,
     P::Matrix{Int64},
     X::Matrix{<:AbstractFloat}
 ) -> Any
-

Result

the prediction for the partitioned least squares problem with solution α, β, t over the dataset X and partition matrix P

source
predict(
+

Result

the prediction for the partitioned least squares problem with solution α, β, t over the dataset X and partition matrix P

source
predict(
     model::PartLSFitResult,
     X::Matrix{<:AbstractFloat}
 ) -> Any
-

Make predictions for the datataset X using the PartialLS model model.

Arguments

  • model: a PartLSFitResult
  • X: any matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).

Return

the predictions of the given model on examples in X.

source
predict(model::PartLS, fitresult, X) -> Any
-

Make predictions for the datataset X using the PartitionedLS model model. It conforms to the MLJ interface.

source
PartitionedLS.homogeneousCoordsFunction

Rewrites X and P in homogeneous coordinates. The result is a tuple (Xo, Po) where Xo is the homogeneous version of X and Po is the homogeneous version of P.

Arguments

  • X: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).
  • P: the partition matrix

Return

  • Xo: the homogeneous version of X
  • Po: the homogeneous version of P
source
PartitionedLS.regularizeProblemFunction

Adds regularization terms to the problem. The regularization terms are added to the objective function as a sum of squares of the α variables. The regularization parameter η controls the strength of the regularization.

Arguments

  • X: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).
  • y: any vector with Continuous element scitype. Check scitype with scitype(y).
  • P: the partition matrix
  • η: the regularization parameter

Return

  • Xn: the new data matrix
  • yn: the new target vector

Main idea

K new rows are added to the data matrix X, row $k \in \{1 \dots K\}$ is a vector of zeros except for the components that corresponds to features belonging to the k-th partition, which is set to sqrt(η). The target vector y is extended with K zeros.

The point of this change is that when the objective function is evaluated as $math \|Xw - y\|^2$, the new part of the matrix contributes to the loss with a factor of $η \sum \|w_i\|^2$ . This is equivalent to adding a regularization term to the objective function.

source
+

Make predictions for the datataset X using the PartialLS model model.

Arguments

Return

the predictions of the given model on examples in X.

source
predict(model::PartLS, fitresult, X) -> Any
+

Make predictions for the datataset X using the PartitionedLS model model. It conforms to the MLJ interface.

source
PartitionedLS.homogeneousCoordsFunction

Rewrites X and P in homogeneous coordinates. The result is a tuple (Xo, Po) where Xo is the homogeneous version of X and Po is the homogeneous version of P.

Arguments

  • X: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).
  • P: the partition matrix

Return

  • Xo: the homogeneous version of X
  • Po: the homogeneous version of P
source
PartitionedLS.regularizeProblemFunction

Adds regularization terms to the problem. The regularization terms are added to the objective function as a sum of squares of the α variables. The regularization parameter η controls the strength of the regularization.

Arguments

  • X: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).
  • y: any vector with Continuous element scitype. Check scitype with scitype(y).
  • P: the partition matrix
  • η: the regularization parameter

Return

  • Xn: the new data matrix
  • yn: the new target vector

Main idea

K new rows are added to the data matrix X, row $k \in \{1 \dots K\}$ is a vector of zeros except for the components that corresponds to features belonging to the k-th partition, which is set to sqrt(η). The target vector y is extended with K zeros.

The point of this change is that when the objective function is evaluated as $math \|Xw - y\|^2$, the new part of the matrix contributes to the loss with a factor of $η \sum \|w_i\|^2$ . This is equivalent to adding a regularization term to the objective function.

source
diff --git a/docs/build/objects.inv b/docs/build/objects.inv index 32f7ccd..a11de86 100644 --- a/docs/build/objects.inv +++ b/docs/build/objects.inv @@ -1,5 +1,5 @@ # Sphinx inventory version 2 # Project: PartitionedLS.jl -# Version: 1.0.10 +# Version: 1.0.11 # The remainder of this file is compressed using zlib. xj0 -:RK_@ñ2ґ&ݝ}tFr Бz,<[OhJ>I:/πܠkHw0!a a$}Kcɹh$Z)diE]vEݳoZUiCd 102Sy;rcX*nK߲=o-IU{Л&z3bz\W^*N.A𧞮sf9Ql4ZEJSw[it)1CF \ No newline at end of file diff --git a/docs/build/search_index.js b/docs/build/search_index.js index 99a7343..5904258 100644 --- a/docs/build/search_index.js +++ b/docs/build/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"examples/example/","page":"Example","title":"Example","text":"We present here an analysis of a solution found by a Partitioned LS algorithm on the Ames House Prices dataset, which is publicly available via Kaggle.","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"The Julia notebook used to generate the results is available here.","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"This dataset has a relatively high number of columns (79 in total) each detailing one particular characteristic of housing properties in Ames, Iowa. The task is to predict the selling price of each house. ","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"We propose a grouping of the features into 10 groups, each one representing a high-level characteristic of the property:","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"Group Features\nLotDescritption MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, LotConfig, LandSlope\nBuildingPlacement Utilities, Neighborhood, Condition1, Condition2\nBuildingAge YearBuilt, YearRemodAdd\nBuildingQuality BldgType, HouseStyle, OverallQual, OverallCond, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, Functional\nBasement BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF\nPowerAndTemperature Heating, HeatingQC, CentralAir, Electrical, Fireplaces, FireplaceQu\nSizes 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea\nRooms BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd\nOutsideFacilities GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence\nVarious MiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"As an example, we collect 6 columns referring to the availability and quality of air conditioning systems, electrical system, heating and fireplaces in a \"Power and Temperature\" group. Other feature groups refer to overall quality of the construction work and materials employed (\"Building Quality\"), external facilities such as garages or swimming pools (\"Outside Facilities\"). The beta values for the groups are as follows:","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"(Image: $\\beta$ values as found by the `Opt` algorithm on the Ames House Prices dataset)","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"We note that the grouped solution enabled by the partitioned least squares formulation is able to give a high-level summary of the regression result. An analyst is therefore able to communicate easily to, e.g. an individual selling their house, that the price is mostly determined by the building quality and the attractiveness of the lot. A deeper analysis is of course possible by investigating the alpha values found by the algorithm. For instance, let consider the contributions to the ``Outside Facilities'':","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"(Image: $\\alpha$ values as found by the `Opt` algorithm on the Ames House Prices dataset for the \"OutsideFacilities\" group)","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"Here, one is able to notice that garage quality has the biggest impact on the property's price, which is potentially actionable knowledge. ","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"We argue that the group- and feature-level analysis made possible by our contributions improves on the interpretability of ungrouped linear regression.","category":"page"},{"location":"#Partitioned-Least-Squares","page":"Documentation","title":"Partitioned Least Squares","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"Linear least squares is one of the most widely used regression methods among scientists in many fields. The simplicity of the model allows this method to be used when data is scarce and it is usually appealing to practitioners that need to gather some insight into the problem by inspecting the values of the learnt parameters. PartitionedLS is a variant of the linear least squares model allowing practitioners to partition the input features into groups of variables that they require to contribute similarly to the final result.","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"An example of analysing a dataset using PartitionedLS is given here","category":"page"},{"location":"#The-model","page":"Documentation","title":"The model","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"The Partitioned Least Squares model is formally defined as:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"begingather*\ntextminimize_mathbfalpha mathbfbeta mathbfX times (mathbfP circ mathbfalpha) times mathbfbeta - mathbfy _2^2 \nbeginaligned\nquad stquad mathbfalpha succeq 0\n mathbfP^T times mathbfalpha = mathbf1\nendaligned\nendgather*","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"where: ","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"mathbfX is a N M matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).\nmathbfy is a N vector with Continuous element scitype. Check scitype with scitype(y). \nmathbfP is a M K Int matrix specifying how to partition the M attributes into K subsets. P_mk should be 1 if attribute number m belongs to partition k.\nmathbfbeta is a vector weighting the importance of each set of attributes in the partition;\nmathbfalpha is a vector weighting the importance of each attribute within one of the sets in the partition. Note that the constraints imply that for each set in the partition the weights of the corresponding alpha variables are all positive and sum to 1.","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"The PartitionedLS problem is non-convex and NP-complete. The library provides two algorithms to solve the problem anyway: an iterative algorithm based on the Alternating Least Squares approach and an optimal algorithm that guarantees requiring however exponential time in the cardinality of the partition (i.e., it is mainly useful when K is small).","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"More details can be found in the paper Partitioned Least Squares.","category":"page"},{"location":"#To-install-this-library","page":"Documentation","title":"To install this library","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"Just add it as a dependency to your Julia environment. Launch julia from the main directory of your project and enter the following commands:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"# Opens the package manager REPL\n]\n\n# Activate you local environment (can be skipped if you want to install the library globally)\nactivate .\n\n# Adds the library to the environment\nadd PartitionedLS","category":"page"},{"location":"#To-use-this-library","page":"Documentation","title":"To use this library","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"You will need a matrix P describing the partitioning of your variables, e.g.:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"P = [[1 0]; \n [1 0]; \n [0 1]]","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"specifies that the first and the second variable belongs to the first partition, while the third variable belongs to the second.","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"You have then the choice to use either the standard interface or the MLJ interface. ","category":"page"},{"location":"#Standard-interface","page":"Documentation","title":"Standard interface","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"The standard interface defines a fit function for each of the implemented algorithms. The function returns a tuple containing:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"a PartLSFitResult object containing the model and the parameters found by the algorithm;\nnothing (this is mandated by the MLJ interface, but it is not used in this case).\na NamedTuple containing some additional information.","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"A complete example:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"\nusing PartitionedLS\n\nX = [[1. 2. 3.]; \n [3. 3. 4.]; \n [8. 1. 3.]; \n [5. 3. 1.]]\n\ny = [1.; \n 1.; \n 2.; \n 3.]\n\nP = [[1 0]; \n [1 0]; \n [0 1]]\n\n\n# fit using the optimal algorithm \nresult = fit(Opt, X, y, P, η = 0.0)\n\n\n# Make predictions on the given data matrix. The function works\n# with results returned by anyone of the solvers.\npredict(result[1], X)","category":"page"},{"location":"#MLJ-interface","page":"Documentation","title":"MLJ interface","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"The MLJ interface is a allows you to use the library in a more MLJ-like fashion. The interface is defined by the PartLS model, which can be used in the MLJ framework. The model can be used in the same way as any other MLJ model.","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"A complete example:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"using MLJ\nusing PartitionedLS\n\nX = [[1. 2. 3.]; \n [3. 3. 4.]; \n [8. 1. 3.]; \n [5. 3. 1.]]\n\ny = [1.;\n 1.;\n 2.;\n 3.]\n\nP = [[1 0]; \n [1 0]; \n [0 1]]\n\n# Define the model\n\nmodel = PartLS(P=P, Optimizer=Opt, η=0.0)\n\n# Fit the model\nmach = machine(model, X, y)\nfit!(mach)\n\n# Make predictions\npredict(mach, X)","category":"page"},{"location":"#API-Documentation","page":"Documentation","title":"API Documentation","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"PartLS\nPartLSFitResult\nPartitionedLS.fit\nPartitionedLS.predict\nPartitionedLS.homogeneousCoords\nPartitionedLS.regularizeProblem","category":"page"},{"location":"#PartitionedLS.PartLS","page":"Documentation","title":"PartitionedLS.PartLS","text":"PartLS\n\nA model type for fitting a partitioned least squares model to data. Both an MLJ and native interface are provided.\n\nMLJ Interface\n\nFrom MLJ, the type can be imported using\n\nPartLS = @load PartLS pkg=PartitionedLS\n\nConstruct an instance with default hyper-parameters using the syntax model = PartLS(). Provide keyword arguments to override hyper-parameter defaults, as in model = PartLS(P=...).\n\nTraining data\n\nIn MLJ or MLJBase, bind an instance model to data with\n\nmach = machine(model, X, y)\n\nwhere\n\nX: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).\ny: any vector with Continuous element scitype. Check scitype with scitype(y). \n\nTrain the machine using fit!(mach).\n\nHyper-parameters\n\nOptimizer: the optimization algorithm to use. It can be Opt, Alt or BnB (names exported by PartitionedLS.jl).\nP: the partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element P_{k, i} = 1 if feature i belongs to partition k.\nη: the regularization parameter. It controls the strength of the regularization.\nϵ: the tolerance parameter. It is used to determine when the Alt optimization algorithm has converged. Only used by the Alt algorithm.\nT: the maximum number of iterations. It is used to determine when to stop the Alt optimization algorithm has converged. Only used by the Alt algorithm.\nrng: the random number generator to use.\nIf nothing, the global random number generator rand is used.\nIf an integer, the global number generator rand is used after seeding it with the given integer.\nIf an object of type AbstractRNG, the given random number generator is used.\n\nOperations\n\npredict(mach, Xnew): return the predictions of the model on new data Xnew\n\nFitted parameters\n\nThe fields of fitted_params(mach) are:\n\nα: the values of the α variables. For each partition k, it holds the values of the α variables are such that sum_i in P_k alpha_k = 1.\nβ: the values of the β variables. For each partition k, β_k is the coefficient that multiplies the features in the k-th partition.\nt: the intercept term of the model.\nP: the partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element P_{k, i} = 1 if feature i belongs to partition k.\n\nExamples\n\nPartLS = @load PartLS pkg=PartitionedLS\n\nX = [[1. 2. 3.];\n [3. 3. 4.];\n [8. 1. 3.];\n [5. 3. 1.]]\n\ny = [1.;\n 1.;\n 2.;\n 3.]\n\nP = [[1 0];\n [1 0];\n [0 1]]\n\n\nmodel = PartLS(P=P)\nmach = machine(model, X, y) |> fit!\n\n# predictions on the training set:\npredict(mach, X)\n\n\nNative Interface\n\nusing PartitionedLS\n\nX = [[1. 2. 3.];\n [3. 3. 4.];\n [8. 1. 3.];\n [5. 3. 1.]]\n\ny = [1.;\n 1.;\n 2.;\n 3.]\n\nP = [[1 0];\n [1 0];\n [0 1]]\n\n\n# fit using the optimal algorithm\nresult = fit(Opt, X, y, P, η = 0.0)\ny_hat = predict(result.model, X)\n\nFor other fit keyword options, refer to the \"Hyper-parameters\" section for the MLJ interface.\n\n\n\n\n\n","category":"type"},{"location":"#PartitionedLS.PartLSFitResult","page":"Documentation","title":"PartitionedLS.PartLSFitResult","text":"struct PartLSFitResult\n\nThe PartLSFitResult struct represents the solution of the partitioned least squares problem. It contains the values of the α and β variables, the intercept t and the partition matrix P.\n\nFields\n\nα::Vector{AbstractFloat}: The values of the α variables. For each partition k, it holds the values of the α variables are such that sum_i in P_k alpha_k = 1.\n\nβ::Vector{AbstractFloat}: The values of the β variables. For each partition k, beta_k is the coefficient that multiplies the features in the k-th partition.\n\nt::AbstractFloat: The intercept term of the model.\n\nP::Matrix{Int64}: The partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element P_k i = 1 if feature i belongs to partition k.\n\n\n\n\n\n","category":"type"},{"location":"#MLJModelInterface.fit","page":"Documentation","title":"MLJModelInterface.fit","text":"fit(\n ::Type{Alt},\n X::Array{F<:AbstractFloat, 2},\n y::Array{F<:AbstractFloat, 1},\n P::Matrix{Int64};\n η,\n ϵ,\n T,\n nnlsalg,\n rng\n) -> Tuple{PartLSFitResult, Nothing, NamedTuple{(:opt,), <:Tuple{Any}}}\n\n\nFits a PartitionedLS model by alternating the optimization of the α and β variables. This version uses an optimization strategy based on non-negative-least-squaes solvers. This formulation is faster and more numerically stable with respect to fit(Alt, ...)`.\n\nArguments\n\nX: N M matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).\ny: N vector with Continuous element scitype. Check scitype with scitype(y). \nP: M K Int matrix specifying how to partition the M attributes into K subsets. P_mk should be 1 if attribute number m belongs to partition k.\nη: regularization factor, higher values implies more regularized solutions. Default is 0.0.\nT: number of alternating loops to be performed. Default is 100.\nϵ: minimum relative improvement in the objective function before stopping the optimization. Default is 1e-6\nnnlsalg: specific flavour of nnls algorithm to be used, possible values are :pivot, :nnls, :fnnls. Default is :nnls\n\nResult\n\nA Tuple with the following fields:\n\na PartLSFitResult object containing the fitted model\na nothing object\na NamedTuple with a field opt containing the optimal value of the objective function\n\n\n\n\n\n#(TYPEDSIGNATURES)\n\nFits a PartialLS Regression model to the given data and resturns the learnt model (see the Result section). It uses a coplete enumeration strategy which is exponential in K, but guarantees to find the optimal solution.\n\nArguments\n\nX: N M matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).\ny: N vector with Continuous element scitype. Check scitype with scitype(y). \nP: M K Int matrix specifying how to partition the M attributes into K subsets. P_mk should be 1 if attribute number m belongs to partition k.\nη: regularization factor, higher values implies more regularized solutions (default: 0.0)\nreturnAllSolutions: if true an additional output is appended to the resulting tuple containing all solutions found during the algorithm.\nnnlsalg: the kind of nnls algorithm to be used during solving. Possible values are :pivot, :nnls, :fnnls (default: :nnls)\n\nExample\n\nX = rand(100, 10)\ny = rand(100)\nP = [1 0 0; 0 1 0; 0 0 1; 1 1 0; 0 1 1]\nresult = fit(Opt, X, y, P)\n\n\n\n\n\nfit(\n ::Type{BnB},\n X::Matrix{<:AbstractFloat},\n y::AbstractVector{<:AbstractFloat},\n P::Matrix{Int64};\n η,\n nnlsalg\n) -> Tuple{PartLSFitResult, Nothing, NamedTuple{(:opt, :nopen), <:Tuple{Any, Any}}}\n\n\nImplements the Branch and Bound algorithm to fit a Partitioned Least Squres model.\n\nArguments\n\nX: N M matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).\ny: N vector with Continuous element scitype. Check scitype with scitype(y). \nP: M K Int matrix specifying how to partition the M attributes into K subsets. P_mk should be 1 if attribute number m belongs to partition k.\nη: regularization factor, higher values implies more regularized solutions (default: 0.0)\nnnlsalg: the kind of nnls algorithm to be used during solving. Possible values are :pivot, :nnls, :fnnls (default: :nnls)\n\nResult\n\nA tuple with the following fields:\n\na PartLSFitResult object containing the fitted model\na nothing object\na NamedTuple with fields: \nopt containing the optimal value of the objective function\nnopen containing the number of open nodes in the branch and bound tree\n\n\n\n\n\nfit(\n m::PartLS,\n verbosity,\n X,\n y\n) -> Tuple{PartLSFitResult, Nothing, Any}\n\n\nFits a PartitionedLS Regression model to the given data and resturns the learnt model (see the Result section). It conforms to the MLJ interface.\n\nArguments\n\nm: A PartLS model to fit\nverbosity: the verbosity level\nX: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).\ny: any vector with Continuous element scitype. Check scitype with scitype(y). \n\n\n\n\n\n","category":"function"},{"location":"#MLJModelInterface.predict","page":"Documentation","title":"MLJModelInterface.predict","text":"predict(\n α::AbstractVector{<:AbstractFloat},\n β::AbstractVector{<:AbstractFloat},\n t::AbstractFloat,\n P::Matrix{Int64},\n X::Matrix{<:AbstractFloat}\n) -> Any\n\n\nResult\n\nthe prediction for the partitioned least squares problem with solution α, β, t over the dataset X and partition matrix P\n\n\n\n\n\npredict(\n model::PartLSFitResult,\n X::Matrix{<:AbstractFloat}\n) -> Any\n\n\nMake predictions for the datataset X using the PartialLS model model.\n\nArguments\n\nmodel: a PartLSFitResult\nX: any matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).\n\nReturn\n\nthe predictions of the given model on examples in X.\n\n\n\n\n\npredict(model::PartLS, fitresult, X) -> Any\n\n\nMake predictions for the datataset X using the PartitionedLS model model. It conforms to the MLJ interface.\n\n\n\n\n\n","category":"function"},{"location":"#PartitionedLS.homogeneousCoords","page":"Documentation","title":"PartitionedLS.homogeneousCoords","text":"Rewrites X and P in homogeneous coordinates. The result is a tuple (Xo, Po) where Xo is the homogeneous version of X and Po is the homogeneous version of P.\n\nArguments\n\nX: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X). \nP: the partition matrix\n\nReturn\n\nXo: the homogeneous version of X\nPo: the homogeneous version of P\n\n\n\n\n\n","category":"function"},{"location":"#PartitionedLS.regularizeProblem","page":"Documentation","title":"PartitionedLS.regularizeProblem","text":"Adds regularization terms to the problem. The regularization terms are added to the objective function as a sum of squares of the α variables. The regularization parameter η controls the strength of the regularization.\n\nArguments\n\nX: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).\ny: any vector with Continuous element scitype. Check scitype with scitype(y). \nP: the partition matrix\nη: the regularization parameter\n\nReturn\n\nXn: the new data matrix\nyn: the new target vector\n\nMain idea\n\nK new rows are added to the data matrix X, row k in 1 dots K is a vector of zeros except for the components that corresponds to features belonging to the k-th partition, which is set to sqrt(η). The target vector y is extended with K zeros.\n\nThe point of this change is that when the objective function is evaluated as math Xw - y^2, the new part of the matrix contributes to the loss with a factor of η sum w_i^2 . This is equivalent to adding a regularization term to the objective function.\n\n\n\n\n\n","category":"function"}] +[{"location":"examples/example/","page":"Example","title":"Example","text":"We present here an analysis of a solution found by a Partitioned LS algorithm on the Ames House Prices dataset, which is publicly available via Kaggle.","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"The Julia notebook used to generate the results is available here.","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"This dataset has a relatively high number of columns (79 in total) each detailing one particular characteristic of housing properties in Ames, Iowa. The task is to predict the selling price of each house. ","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"We propose a grouping of the features into 10 groups, each one representing a high-level characteristic of the property:","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"Group Features\nLotDescritption MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, LotConfig, LandSlope\nBuildingPlacement Utilities, Neighborhood, Condition1, Condition2\nBuildingAge YearBuilt, YearRemodAdd\nBuildingQuality BldgType, HouseStyle, OverallQual, OverallCond, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, Functional\nBasement BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF\nPowerAndTemperature Heating, HeatingQC, CentralAir, Electrical, Fireplaces, FireplaceQu\nSizes 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea\nRooms BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd\nOutsideFacilities GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence\nVarious MiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"As an example, we collect 6 columns referring to the availability and quality of air conditioning systems, electrical system, heating and fireplaces in a \"Power and Temperature\" group. Other feature groups refer to overall quality of the construction work and materials employed (\"Building Quality\"), external facilities such as garages or swimming pools (\"Outside Facilities\"). The beta values for the groups are as follows:","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"(Image: $\\beta$ values as found by the `Opt` algorithm on the Ames House Prices dataset)","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"We note that the grouped solution enabled by the partitioned least squares formulation is able to give a high-level summary of the regression result. An analyst is therefore able to communicate easily to, e.g. an individual selling their house, that the price is mostly determined by the building quality and the attractiveness of the lot. A deeper analysis is of course possible by investigating the alpha values found by the algorithm. For instance, let consider the contributions to the ``Outside Facilities'':","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"(Image: $\\alpha$ values as found by the `Opt` algorithm on the Ames House Prices dataset for the \"OutsideFacilities\" group)","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"Here, one is able to notice that garage quality has the biggest impact on the property's price, which is potentially actionable knowledge. ","category":"page"},{"location":"examples/example/","page":"Example","title":"Example","text":"We argue that the group- and feature-level analysis made possible by our contributions improves on the interpretability of ungrouped linear regression.","category":"page"},{"location":"#Partitioned-Least-Squares","page":"Documentation","title":"Partitioned Least Squares","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"Linear least squares is one of the most widely used regression methods among scientists in many fields. The simplicity of the model allows this method to be used when data is scarce and it is usually appealing to practitioners that need to gather some insight into the problem by inspecting the values of the learnt parameters. PartitionedLS is a variant of the linear least squares model allowing practitioners to partition the input features into groups of variables that they require to contribute similarly to the final result.","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"An example of analysing a dataset using PartitionedLS is given here","category":"page"},{"location":"#The-model","page":"Documentation","title":"The model","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"The Partitioned Least Squares model is formally defined as:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"begingather*\ntextminimize_mathbfalpha mathbfbeta mathbfX times (mathbfP circ mathbfalpha) times mathbfbeta - mathbfy _2^2 \nbeginaligned\nquad stquad mathbfalpha succeq 0\n mathbfP^T times mathbfalpha = mathbf1\nendaligned\nendgather*","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"where: ","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"mathbfX is a N M matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).\nmathbfy is a N vector with Continuous element scitype. Check scitype with scitype(y). \nmathbfP is a M K Int matrix specifying how to partition the M attributes into K subsets. P_mk should be 1 if attribute number m belongs to partition k.\nmathbfbeta is a vector weighting the importance of each set of attributes in the partition;\nmathbfalpha is a vector weighting the importance of each attribute within one of the sets in the partition. Note that the constraints imply that for each set in the partition the weights of the corresponding alpha variables are all positive and sum to 1.","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"The PartitionedLS problem is non-convex and NP-complete. The library provides two algorithms to solve the problem anyway: an iterative algorithm based on the Alternating Least Squares approach and an optimal algorithm that guarantees requiring however exponential time in the cardinality of the partition (i.e., it is mainly useful when K is small).","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"More details can be found in the paper Partitioned Least Squares.","category":"page"},{"location":"#To-install-this-library","page":"Documentation","title":"To install this library","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"Just add it as a dependency to your Julia environment. Launch julia from the main directory of your project and enter the following commands:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"# Opens the package manager REPL\n]\n\n# Activate you local environment (can be skipped if you want to install the library globally)\nactivate .\n\n# Adds the library to the environment\nadd PartitionedLS","category":"page"},{"location":"#To-use-this-library","page":"Documentation","title":"To use this library","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"You will need a matrix P describing the partitioning of your variables, e.g.:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"P = [[1 0]; \n [1 0]; \n [0 1]]","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"specifies that the first and the second variable belongs to the first partition, while the third variable belongs to the second.","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"You have then the choice to use either the standard interface or the MLJ interface. ","category":"page"},{"location":"#Standard-interface","page":"Documentation","title":"Standard interface","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"The standard interface defines a fit function for each of the implemented algorithms. The function returns a tuple containing:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"a PartLSFitResult object containing the model and the parameters found by the algorithm;\nnothing (this is mandated by the MLJ interface, but it is not used in this case).\na NamedTuple containing some additional information.","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"A complete example:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"\nusing PartitionedLS\n\nX = [[1. 2. 3.]; \n [3. 3. 4.]; \n [8. 1. 3.]; \n [5. 3. 1.]]\n\ny = [1.; \n 1.; \n 2.; \n 3.]\n\nP = [[1 0]; \n [1 0]; \n [0 1]]\n\n\n# fit using the optimal algorithm \nresult = fit(Opt, X, y, P, η = 0.0)\n\n\n# Make predictions on the given data matrix. The function works\n# with results returned by anyone of the solvers.\npredict(result[1], X)","category":"page"},{"location":"#MLJ-interface","page":"Documentation","title":"MLJ interface","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"The MLJ interface is a allows you to use the library in a more MLJ-like fashion. The interface is defined by the PartLS model, which can be used in the MLJ framework. The model can be used in the same way as any other MLJ model.","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"A complete example:","category":"page"},{"location":"","page":"Documentation","title":"Documentation","text":"using MLJ\n\nPartLS = @load PartLS, pkg=PartitionedLS\n\nX = [[1. 2. 3.]; \n [3. 3. 4.]; \n [8. 1. 3.]; \n [5. 3. 1.]]\n\ny = [1.;\n 1.;\n 2.;\n 3.]\n\nP = [[1 0]; \n [1 0]; \n [0 1]]\n\n# Define the model\n\nmodel = PartLS(P=P, Optimizer=Opt, η=0.0)\n\n# Fit the model\nmach = machine(model, X, y)\nfit!(mach)\n\n# Make predictions\npredict(mach, X)","category":"page"},{"location":"#API-Documentation","page":"Documentation","title":"API Documentation","text":"","category":"section"},{"location":"","page":"Documentation","title":"Documentation","text":"PartLS\nPartLSFitResult\nPartitionedLS.fit\nPartitionedLS.predict\nPartitionedLS.homogeneousCoords\nPartitionedLS.regularizeProblem","category":"page"},{"location":"#PartitionedLS.PartLS","page":"Documentation","title":"PartitionedLS.PartLS","text":"PartLS\n\nA model type for fitting a partitioned least squares model to data. Both an MLJ and native interface are provided.\n\nMLJ Interface\n\nFrom MLJ, the type can be imported using\n\nPartLS = @load PartLS pkg=PartitionedLS\n\nConstruct an instance with default hyper-parameters using the syntax model = PartLS(). Provide keyword arguments to override hyper-parameter defaults, as in model = PartLS(P=...).\n\nTraining data\n\nIn MLJ or MLJBase, bind an instance model to data with\n\nmach = machine(model, X, y)\n\nwhere\n\nX: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).\ny: any vector with Continuous element scitype. Check scitype with scitype(y). \n\nTrain the machine using fit!(mach).\n\nHyper-parameters\n\nOptimizer: the optimization algorithm to use. It can be Opt, Alt or BnB (names exported by PartitionedLS.jl).\nP: the partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element P_{k, i} = 1 if feature i belongs to partition k.\nη: the regularization parameter. It controls the strength of the regularization.\nϵ: the tolerance parameter. It is used to determine when the Alt optimization algorithm has converged. Only used by the Alt algorithm.\nT: the maximum number of iterations. It is used to determine when to stop the Alt optimization algorithm has converged. Only used by the Alt algorithm.\nrng: the random number generator to use.\nIf nothing, the global random number generator rand is used.\nIf an integer, the global number generator rand is used after seeding it with the given integer.\nIf an object of type AbstractRNG, the given random number generator is used.\n\nOperations\n\npredict(mach, Xnew): return the predictions of the model on new data Xnew\n\nFitted parameters\n\nThe fields of fitted_params(mach) are:\n\nα: the values of the α variables. For each partition k, it holds the values of the α variables are such that sum_i in P_k alpha_k = 1.\nβ: the values of the β variables. For each partition k, β_k is the coefficient that multiplies the features in the k-th partition.\nt: the intercept term of the model.\nP: the partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element P_{k, i} = 1 if feature i belongs to partition k.\n\nExamples\n\nPartLS = @load PartLS pkg=PartitionedLS\n\nX = [[1. 2. 3.];\n [3. 3. 4.];\n [8. 1. 3.];\n [5. 3. 1.]]\n\ny = [1.;\n 1.;\n 2.;\n 3.]\n\nP = [[1 0];\n [1 0];\n [0 1]]\n\n\nmodel = PartLS(P=P)\nmach = machine(model, X, y) |> fit!\n\n# predictions on the training set:\npredict(mach, X)\n\n\nNative Interface\n\nusing PartitionedLS\n\nX = [[1. 2. 3.];\n [3. 3. 4.];\n [8. 1. 3.];\n [5. 3. 1.]]\n\ny = [1.;\n 1.;\n 2.;\n 3.]\n\nP = [[1 0];\n [1 0];\n [0 1]]\n\n\n# fit using the optimal algorithm\nresult = fit(Opt, X, y, P, η = 0.0)\ny_hat = predict(result.model, X)\n\nFor other fit keyword options, refer to the \"Hyper-parameters\" section for the MLJ interface.\n\n\n\n\n\n","category":"type"},{"location":"#PartitionedLS.PartLSFitResult","page":"Documentation","title":"PartitionedLS.PartLSFitResult","text":"struct PartLSFitResult\n\nThe PartLSFitResult struct represents the solution of the partitioned least squares problem. It contains the values of the α and β variables, the intercept t and the partition matrix P.\n\nFields\n\nα::Vector{AbstractFloat}: The values of the α variables. For each partition k, it holds the values of the α variables are such that sum_i in P_k alpha_k = 1.\n\nβ::Vector{AbstractFloat}: The values of the β variables. For each partition k, beta_k is the coefficient that multiplies the features in the k-th partition.\n\nt::AbstractFloat: The intercept term of the model.\n\nP::Matrix{Int64}: The partition matrix. It is a binary matrix where each row corresponds to a partition and each column corresponds to a feature. The element P_k i = 1 if feature i belongs to partition k.\n\n\n\n\n\n","category":"type"},{"location":"#MLJModelInterface.fit","page":"Documentation","title":"MLJModelInterface.fit","text":"fit(\n ::Type{Alt},\n X::Array{F<:AbstractFloat, 2},\n y::Array{F<:AbstractFloat, 1},\n P::Matrix{Int64};\n η,\n ϵ,\n T,\n nnlsalg,\n rng\n) -> Tuple{PartLSFitResult, Nothing, NamedTuple{(:opt,), <:Tuple{Any}}}\n\n\nFits a PartitionedLS model by alternating the optimization of the α and β variables. This version uses an optimization strategy based on non-negative-least-squaes solvers. This formulation is faster and more numerically stable with respect to fit(Alt, ...)`.\n\nArguments\n\nX: N M matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).\ny: N vector with Continuous element scitype. Check scitype with scitype(y). \nP: M K Int matrix specifying how to partition the M attributes into K subsets. P_mk should be 1 if attribute number m belongs to partition k.\nη: regularization factor, higher values implies more regularized solutions. Default is 0.0.\nT: number of alternating loops to be performed. Default is 100.\nϵ: minimum relative improvement in the objective function before stopping the optimization. Default is 1e-6\nnnlsalg: specific flavour of nnls algorithm to be used, possible values are :pivot, :nnls, :fnnls. Default is :nnls\n\nResult\n\nA Tuple with the following fields:\n\na PartLSFitResult object containing the fitted model\na nothing object\na NamedTuple with a field opt containing the optimal value of the objective function\n\n\n\n\n\n#(TYPEDSIGNATURES)\n\nFits a PartialLS Regression model to the given data and resturns the learnt model (see the Result section). It uses a coplete enumeration strategy which is exponential in K, but guarantees to find the optimal solution.\n\nArguments\n\nX: N M matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).\ny: N vector with Continuous element scitype. Check scitype with scitype(y). \nP: M K Int matrix specifying how to partition the M attributes into K subsets. P_mk should be 1 if attribute number m belongs to partition k.\nη: regularization factor, higher values implies more regularized solutions (default: 0.0)\nreturnAllSolutions: if true an additional output is appended to the resulting tuple containing all solutions found during the algorithm.\nnnlsalg: the kind of nnls algorithm to be used during solving. Possible values are :pivot, :nnls, :fnnls (default: :nnls)\n\nExample\n\nX = rand(100, 10)\ny = rand(100)\nP = [1 0 0; 0 1 0; 0 0 1; 1 1 0; 0 1 1]\nresult = fit(Opt, X, y, P)\n\n\n\n\n\nfit(\n ::Type{BnB},\n X::Matrix{<:AbstractFloat},\n y::AbstractVector{<:AbstractFloat},\n P::Matrix{Int64};\n η,\n nnlsalg\n) -> Tuple{PartLSFitResult, Nothing, NamedTuple{(:opt, :nopen), <:Tuple{Any, Any}}}\n\n\nImplements the Branch and Bound algorithm to fit a Partitioned Least Squres model.\n\nArguments\n\nX: N M matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).\ny: N vector with Continuous element scitype. Check scitype with scitype(y). \nP: M K Int matrix specifying how to partition the M attributes into K subsets. P_mk should be 1 if attribute number m belongs to partition k.\nη: regularization factor, higher values implies more regularized solutions (default: 0.0)\nnnlsalg: the kind of nnls algorithm to be used during solving. Possible values are :pivot, :nnls, :fnnls (default: :nnls)\n\nResult\n\nA tuple with the following fields:\n\na PartLSFitResult object containing the fitted model\na nothing object\na NamedTuple with fields: \nopt containing the optimal value of the objective function\nnopen containing the number of open nodes in the branch and bound tree\n\n\n\n\n\nfit(\n m::PartLS,\n verbosity,\n X,\n y\n) -> Tuple{PartLSFitResult, Nothing, Any}\n\n\nFits a PartitionedLS Regression model to the given data and resturns the learnt model (see the Result section). It conforms to the MLJ interface.\n\nArguments\n\nm: A PartLS model to fit\nverbosity: the verbosity level\nX: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).\ny: any vector with Continuous element scitype. Check scitype with scitype(y). \n\n\n\n\n\n","category":"function"},{"location":"#MLJModelInterface.predict","page":"Documentation","title":"MLJModelInterface.predict","text":"predict(\n α::AbstractVector{<:AbstractFloat},\n β::AbstractVector{<:AbstractFloat},\n t::AbstractFloat,\n P::Matrix{Int64},\n X::Matrix{<:AbstractFloat}\n) -> Any\n\n\nResult\n\nthe prediction for the partitioned least squares problem with solution α, β, t over the dataset X and partition matrix P\n\n\n\n\n\npredict(\n model::PartLSFitResult,\n X::Matrix{<:AbstractFloat}\n) -> Any\n\n\nMake predictions for the datataset X using the PartialLS model model.\n\nArguments\n\nmodel: a PartLSFitResult\nX: any matrix or table with Continuous element scitype containing the examples for which the predictions are sought. Check column scitypes of a table X with schema(X).\n\nReturn\n\nthe predictions of the given model on examples in X.\n\n\n\n\n\npredict(model::PartLS, fitresult, X) -> Any\n\n\nMake predictions for the datataset X using the PartitionedLS model model. It conforms to the MLJ interface.\n\n\n\n\n\n","category":"function"},{"location":"#PartitionedLS.homogeneousCoords","page":"Documentation","title":"PartitionedLS.homogeneousCoords","text":"Rewrites X and P in homogeneous coordinates. The result is a tuple (Xo, Po) where Xo is the homogeneous version of X and Po is the homogeneous version of P.\n\nArguments\n\nX: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X). \nP: the partition matrix\n\nReturn\n\nXo: the homogeneous version of X\nPo: the homogeneous version of P\n\n\n\n\n\n","category":"function"},{"location":"#PartitionedLS.regularizeProblem","page":"Documentation","title":"PartitionedLS.regularizeProblem","text":"Adds regularization terms to the problem. The regularization terms are added to the objective function as a sum of squares of the α variables. The regularization parameter η controls the strength of the regularization.\n\nArguments\n\nX: any matrix or table with Continuous element scitype. Check column scitypes of a table X with schema(X).\ny: any vector with Continuous element scitype. Check scitype with scitype(y). \nP: the partition matrix\nη: the regularization parameter\n\nReturn\n\nXn: the new data matrix\nyn: the new target vector\n\nMain idea\n\nK new rows are added to the data matrix X, row k in 1 dots K is a vector of zeros except for the components that corresponds to features belonging to the k-th partition, which is set to sqrt(η). The target vector y is extended with K zeros.\n\nThe point of this change is that when the objective function is evaluated as math Xw - y^2, the new part of the matrix contributes to the loss with a factor of η sum w_i^2 . This is equivalent to adding a regularization term to the objective function.\n\n\n\n\n\n","category":"function"}] } diff --git a/docs/src/index.md b/docs/src/index.md index 47df012..5f7def2 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -106,7 +106,8 @@ A complete example: ```julia using MLJ -using PartitionedLS + +PartLS = @load PartLS, pkg=PartitionedLS X = [[1. 2. 3.]; [3. 3. 4.];