Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Programming c++ implementation #9

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
bb10165
added Dynp class
npkamath Jan 2, 2024
d13dfc7
reverted extra spaces
npkamath Jan 2, 2024
5bbbaf2
reverted spaces
npkamath Jan 2, 2024
930430a
removed read_input function
npkamath Jan 2, 2024
379897a
Merge remote-tracking branch 'refs/remotes/origin/main'
npkamath Jan 2, 2024
e35cff1
removed read_input function
npkamath Jan 2, 2024
e8ccb7c
Delete PULL_REQUEST_TEMPLATE.md
npkamath Jan 2, 2024
fccca35
refactor: Remove Dynp directory
b-butler Jan 9, 2024
0c64950
fixed namespace, naming, and cleaned up comments
npkamath Jan 12, 2024
fe6bac1
added dynp.py file, fixed constructors, added column_indices for fast…
npkamath Jan 14, 2024
245a92f
fixed index system
npkamath Jan 15, 2024
0af1b08
reorganized class variables and added dynp.py functions and fit
npkamath Jan 15, 2024
66e2104
fit function added with parameter input, removed whitespace with pre…
npkamath Jan 19, 2024
a4cff6f
Merge remote-tracking branch 'upstream/main'
b-butler Jan 24, 2024
9ccf1fe
build: Switch to scikit-build-core.
b-butler Jan 24, 2024
8da0fe0
ci: Add reusable workflow to install system packages
b-butler Jan 24, 2024
943a6c0
ci: Fix package installation by using composite action
b-butler Jan 24, 2024
98ee45c
ci: Correctly specify shell for custom action
b-butler Jan 24, 2024
744b928
ci: Fix action step name formatting
b-butler Jan 24, 2024
b8dd0f2
ci: Remove trailing ":"
b-butler Jan 24, 2024
c7e6920
ci: Update package manager caches before installing
b-butler Jan 24, 2024
7771157
ci: Fix apt-get package names
b-butler Jan 24, 2024
f7287f8
ci: Fix one last package name
b-butler Jan 24, 2024
47aedc0
ci: Yet another package name change
b-butler Jan 24, 2024
4fd3a58
documentation and renaming added
npkamath Feb 26, 2024
4dd27f8
upper triangular restructured and dynp restructured
npkamath Feb 26, 2024
f51ff1c
upper triangular restructured and dynp restructured
npkamath Feb 26, 2024
c0afe6f
conditionals
npkamath Feb 26, 2024
77e3e00
naming and other cpp changes
npkamath Feb 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 29 additions & 5 deletions dupin/detect/dynp.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""Implements dynamic programming class for optimal segementation algorithm."""
import _DynP
import _dupin
import numpy as np


Expand All @@ -18,13 +18,37 @@ class DynP:
min_size: int
Minimum size of a segment. Changing will not provide optimal
detection, but will reduce runtime.


Methods
-------
__init__(self, data: np.ndarray, num_bkps: int, jump: int, min_size: int)
Initializes the DynamicProgramming instance with the time series data
and parameters.
set_num_threads(self, num_threads: int)
Sets the number of threads to be used for parallel computation.
fit(self, num_bkps: int) -> list
Calculates the cost matrix and identifies the optimal breakpoints in
the time series data.
Comment on lines +23 to +32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These get documented by the method's docstring itself.


Example Usage
-------------
>>> import numpy as np
>>> from dynp import DynP
>>> data = np.random.rand(100, 1) # Simulated time series data
>>> num_bkps = 3 # Number of breakpoints to detect
>>> jump = 1 # Interval for checking potential breakpoints
>>> min_size = 3 # Minimum size of a segment
>>> model = Dynp(data, num_bkps, jump, min_size)
>>> breakpoints = model.fit(num_bkps)
>>> print(breakpoints)
"""

def __init__(
self, data: np.ndarray, num_bkps: int, jump: int, min_size: int
):
"""Initialize the DynamicProgramming instance with given parameters."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you document a class's __init__.py method, you need to document its parameters here as well. The alternative is to document both parameters and attributes in the class's docstring.

self.dynp = _DynP.DynamicProgramming(data, num_bkps, jump, min_size)
self._dupin = _dupin.DynamicProgramming(data, num_bkps, jump, min_size)

def set_num_threads(self, num_threads: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term dev wise, if additional CPP methods will be added we will most likely want num_threads to be controlled on the level of the whole module? Would preprocessor be a good place to have this? @b-butler thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to as a _DynP module function and exporting it to dupin/util.py is probably the best solution in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would just move this function to util.py right? would just have to import _DynP in util.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

"""Set the number of threads for parallelization.
Expand All @@ -35,9 +59,9 @@ def set_num_threads(self, num_threads: int):
The number of threads to use during computation. Default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a parameter default. If you want to add a comment about the default parallelization it should go above this section. Also, the message should state that we use all available cores unless set otherwise.

is determined automatically.
"""
self.dynp.set_threads(num_threads)
self._dupin.set_threads(num_threads)

def fit(self, num_bkps: int) -> list:
def fit(self, num_breakpoints: int) -> list[int]:
"""Calculate the cost matrix and return the breakpoints.

Parameters
Expand All @@ -49,4 +73,4 @@ def fit(self, num_bkps: int) -> list:
-------
list: A list of integers representing the breakpoints.
"""
return self.dynp.fit()
return self._dupin.fit(num_breakpoints)
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
list(APPEND dupin_cxx "dupininterface.cpp" "dupin.h" "dupin.cpp")
list(APPEND dupin_cxx "module.cpp" "dupin.h" "dupin.cpp")

pybind11_add_module(_dupin ${dupin_cxx})

Expand Down
89 changes: 47 additions & 42 deletions src/dupin.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,22 @@ using namespace std;
using namespace Eigen;

DynamicProgramming::DynamicProgramming()
: num_bkps(1), num_parameters(0), num_timesteps(0), jump(1), min_size(3) {}
: num_features(0), num_timesteps(0), jump(1), min_size(3), cost_matrix(0) {}

DynamicProgramming::DynamicProgramming(const Eigen::MatrixXd &data, int num_bkps_,
int jump_, int min_size_)
: data(data), num_bkps(num_bkps_),
jump(jump_), min_size(min_size_) {

DynamicProgramming::DynamicProgramming(const Eigen::MatrixXd &data,
int jump_, int min_size_)
: data(data), jump(jump_), min_size(min_size_), cost_matrix(data.rows()) {
num_timesteps = data.rows();
num_parameters = data.cols();
num_features = data.cols();
}

void DynamicProgramming::scale_data() {
Eigen::VectorXd min_val = data.colwise().minCoeff();
Eigen::VectorXd max_val = data.colwise().maxCoeff();
Eigen::VectorXd range = max_val - min_val;

for (int j = 0; j < num_parameters; ++j) {
for (int j = 0; j <num_features; ++j) {
if (range(j) == 0.0) {
data.col(j).setZero();
} else {
Expand All @@ -42,51 +42,58 @@ void DynamicProgramming::regression_setup(linear_fit_struct &lfit) {
lfit.y = data;
}

Eigen::VectorXd DynamicProgramming::regression_line(int start, int end, int dim,
linear_fit_struct &lfit) {
int n = end - start;
Eigen::VectorXd x = lfit.x.segment(start, n);
Eigen::VectorXd y = lfit.y.col(dim).segment(start, n);
//work in progress, the rowwise colwise is messing up
Eigen::MatrixXd DynamicProgramming::regression_lines(int start, int end, linear_fit_struct &lfit) {
int n = end - start;
Eigen::VectorXd x = lfit.x.segment(start, n);
Eigen::MatrixXd y = lfit.y.block(start, 0, n, num_features);

// Ensure x is in a two-dimensional form for broadcasting
Eigen::MatrixXd x_matrix = x.replicate(1, num_features);

// Calculate means
double x_mean = x.mean();
Eigen::VectorXd y_mean = y.colwise().mean();

// Center the data around 0
Eigen::MatrixXd x_centered = x_matrix.colwise() - Eigen::VectorXd::Constant(n, x_mean);
Eigen::MatrixXd y_centered = y.rowwise() - y_mean.transpose();

double x_mean = x.mean();
double y_mean = y.mean();
// Calculate slopes for each feature
Eigen::VectorXd slope = (x_centered.array() * y_centered.array()).colwise().sum() / x_centered.array().square().sum();

Eigen::VectorXd x_centered = x.array() - x_mean;
Eigen::VectorXd y_centered = y.array() - y_mean;
// Calculate intercepts for each feature
Eigen::VectorXd intercept = y_mean.array() - slope.array() * x_mean;

double slope = x_centered.dot(y_centered) / x_centered.squaredNorm();
double intercept = y_mean - slope * x_mean;
// everything till this line is functioning fine; I might be overcomplicating it
Eigen::MatrixXd regression_lines = (x_matrix.array().colwise() - x_mean).colwise() * slope.array() + intercept.transpose().array();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just need to do something like x_matrix * slope.replicate(x.size(), 1) + interept.replicate(x.size(), 1) though I am not too familiar with Eigen.


return x.unaryExpr(
[slope, intercept](double xi) { return slope * xi + intercept; });
return regression_lines;
}

double DynamicProgramming::l2_cost(Eigen::MatrixXd &predicted_y, int start, int end) {
Eigen::MatrixXd diff = predicted_y.block(start, 0, end - start, num_parameters) -
data.block(start, 0, end - start, num_parameters);
return std::sqrt(diff.array().square().sum());
double DynamicProgramming::l2_cost(const Eigen::MatrixXd &predicted_y, int start, int end) {
Eigen::MatrixXd diff = predicted_y.block(start, 0, end - start, num_features) -
data.block(start, 0, end - start, num_features);
return std::sqrt(diff.array().square().sum());
}

Eigen::MatrixXd DynamicProgramming::predicted(int start, int end,
linear_fit_struct &lfit) {
Eigen::MatrixXd predicted_y(num_timesteps, num_parameters);
for (int i = 0; i < num_parameters; ++i) {
predicted_y.block(start, i, end - start, 1) =
regression_line(start, end, i, lfit);
}
return predicted_y;
void DynamicProgramming::predicted(int start, int end, linear_fit_struct &lfit,
Eigen::MatrixXd &predicted_y) {
predicted_y.block(start, 0, end - start, num_features) = regression_lines(start, end, lfit);
}

double DynamicProgramming::cost_function(int start, int end) {
linear_fit_struct lfit;
regression_setup(lfit);
Eigen::MatrixXd predicted_y = predicted(start, end, lfit);

Eigen::MatrixXd predicted_y(num_timesteps, num_features);
predicted(start, end, lfit, predicted_y); // Fill the predicted_y matrix

return l2_cost(predicted_y, start, end);
}

void DynamicProgramming::initialize_cost_matrix() {
scale_data();
cost_matrix.initialize(num_timesteps);
tbb::parallel_for(tbb::blocked_range<int>(0, num_timesteps),
[&](const tbb::blocked_range<int> &r) {
for (int i = r.begin(); i < r.end(); ++i) {
Expand All @@ -112,7 +119,9 @@ std::pair<double, std::vector<int>> DynamicProgramming::seg(int start, int end,
std::pair<double, std::vector<int>> best = {std::numeric_limits<double>::infinity(), {}};

for (int bkp = start + min_size; bkp < end; bkp++) {
if ((bkp - start) >= min_size && (end - bkp) >= min_size) {
if ((bkp - start) < min_size || (end - bkp) < min_size) {
continue;
}
auto left = seg(start, bkp, num_bkps - 1);
auto right = seg(bkp, end, 0);
double cost = left.first + right.first;
Expand All @@ -130,21 +139,17 @@ std::pair<double, std::vector<int>> DynamicProgramming::seg(int start, int end,
return best;
}

std::vector<int> DynamicProgramming::compute_breakpoints() {
std::vector<int> DynamicProgramming::compute_breakpoints(int num_bkps) {
auto result = seg(0, num_timesteps - 1, num_bkps);
std::vector<int> breakpoints = result.second;
std::sort(breakpoints.begin(), breakpoints.end());
breakpoints.erase(std::unique(breakpoints.begin(), breakpoints.end()),
breakpoints.end());
return breakpoints;
}

std::vector<int> DynamicProgramming::fit(int num_bkps_in){
num_bkps = num_bkps_in;
std::vector<int> DynamicProgramming::fit(int num_bkps){
if (!cost_computed){
initialize_cost_matrix();
}
return compute_breakpoints();
return compute_breakpoints(num_bkps);
}

void set_parallelization(int num_threads) {
Expand Down
52 changes: 25 additions & 27 deletions src/dupin.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,25 +19,25 @@ class DynamicProgramming {
std::vector<int> row_indices;
int length;

int index(int row, int col) const {
return row_indices[row] + col - row;
// Helper function to compute the row_indices vector
void compute_row_indices() {
row_indices.resize(length);
for (int row = 0; row < length; ++row) {
row_indices[row] = row * (2 * length - row + 1) / 2;
}
}

public:
UpperTriangularMatrix() : length(0) {}

void initialize(int n) {
length = n;
matrix.resize(n * (n + 1) / 2, 0.0);
row_indices.resize(n);
for (int row = 0; row < n; ++row) {
row_indices[row] = row * (2 * length - row + 1) / 2;
}
// Constructor that initializes the matrix and row_indices
UpperTriangularMatrix(int n) : length(n), matrix(n * (n + 1) / 2, 0.0) {
compute_row_indices();
}

double &operator()(int row, int col) {
return matrix[index(row, col)];
int idx = row_indices[row] + col - row; // Use precomputed index
return matrix[idx];
}

int getSize() const { return length; }
};
// Struct for memoization key, combining start, end, and number of
Expand Down Expand Up @@ -69,8 +69,7 @@ class DynamicProgramming {
std::unordered_map<MemoKey, std::pair<double, std::vector<int>>, MemoKeyHash>
memo;

int num_bkps; // Number of breakpoints to detect.
int num_parameters; // Number of features in the dataset.
int num_features; // Number of features in the dataset.
int num_timesteps; // Number of data points (time steps).
b-butler marked this conversation as resolved.
Show resolved Hide resolved
int jump; // Interval for checking potential breakpoints.
int min_size; // Minimum size of a segment.
Expand All @@ -83,40 +82,39 @@ class DynamicProgramming {
Eigen::MatrixXd y; // Dependent variable (labels).
Eigen::VectorXd x; // z Independent variable (time steps).
};
// Scales the dataset using min-max normalization.
// Scales the dataset using min-max normalization.
void scale_data();

// Prepares data for linear regression.
// Prepares data for linear regression, setting up the independent variable 'x'.
void regression_setup(linear_fit_struct &lfit);

// Calculates the regression line for a given data segment.
Eigen::VectorXd regression_line(int start, int end, int dim,
linear_fit_struct &lfit);
// Computes regression parameters (slope and intercept) for all dimensions simultaneously.
Eigen::MatrixXd regression_lines(int start, int end, linear_fit_struct &lfit);

// Generates predicted values based on the linear regression model.
Eigen::MatrixXd predicted(int start, int end, linear_fit_struct &lfit);
// Generates predicted values based on the linear regression model for all features.
void predicted(int start, int end, linear_fit_struct &lfit, Eigen::MatrixXd &predicted_y);

// Calculates L2 cost (Euclidean distance) between predicted and actual data.
double l2_cost(Eigen::MatrixXd &predicted_y, int start, int end);
// Calculates L2 cost (Euclidean distance) between predicted and actual data for a given segment.
double l2_cost(const Eigen::MatrixXd &predicted_y, int start, int end);

// Computes the cost of a specific data segment using linear regression.
// Computes the cost of a specific data segment using linear regression and L2 cost.
double cost_function(int start, int end);

// Recursive function for dynamic programming segmentation.
// Recursive function for dynamic programming segmentation.
std::pair<double, std::vector<int>> seg(int start, int end, int num_bkps);

// Initializes and fills the cost matrix for all data segments.
void initialize_cost_matrix();

// Returns the optimal set of breakpoints after segmentation.
std::vector<int> compute_breakpoints();
std::vector<int> compute_breakpoints(int num_bkps);

public:
// Default constructor.
DynamicProgramming();

// Parameterized constructor.
DynamicProgramming(const Eigen::MatrixXd &data, int num_bkps_, int jump_,
DynamicProgramming(const Eigen::MatrixXd &data, int jump_,
int min_size_);

//Sets number of threads for parallelization
Expand Down
2 changes: 1 addition & 1 deletion src/dupininterface.cpp → src/module.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

namespace py = pybind11;

PYBIND11_MODULE(_DynP, m) {
PYBIND11_MODULE(_dupin, m) {
py::class_<DynamicProgramming>(m, "DynamicProgramming")
.def(py::init<>())
.def_property("cost_matrix", &DynamicProgramming::getCostMatrix,
Expand Down
Loading