Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Programming c++ implementation #9

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
bb10165
added Dynp class
npkamath Jan 2, 2024
d13dfc7
reverted extra spaces
npkamath Jan 2, 2024
5bbbaf2
reverted spaces
npkamath Jan 2, 2024
930430a
removed read_input function
npkamath Jan 2, 2024
379897a
Merge remote-tracking branch 'refs/remotes/origin/main'
npkamath Jan 2, 2024
e35cff1
removed read_input function
npkamath Jan 2, 2024
e8ccb7c
Delete PULL_REQUEST_TEMPLATE.md
npkamath Jan 2, 2024
fccca35
refactor: Remove Dynp directory
b-butler Jan 9, 2024
0c64950
fixed namespace, naming, and cleaned up comments
npkamath Jan 12, 2024
fe6bac1
added dynp.py file, fixed constructors, added column_indices for fast…
npkamath Jan 14, 2024
245a92f
fixed index system
npkamath Jan 15, 2024
0af1b08
reorganized class variables and added dynp.py functions and fit
npkamath Jan 15, 2024
66e2104
fit function added with parameter input, removed whitespace with pre…
npkamath Jan 19, 2024
a4cff6f
Merge remote-tracking branch 'upstream/main'
b-butler Jan 24, 2024
9ccf1fe
build: Switch to scikit-build-core.
b-butler Jan 24, 2024
8da0fe0
ci: Add reusable workflow to install system packages
b-butler Jan 24, 2024
943a6c0
ci: Fix package installation by using composite action
b-butler Jan 24, 2024
98ee45c
ci: Correctly specify shell for custom action
b-butler Jan 24, 2024
744b928
ci: Fix action step name formatting
b-butler Jan 24, 2024
b8dd0f2
ci: Remove trailing ":"
b-butler Jan 24, 2024
c7e6920
ci: Update package manager caches before installing
b-butler Jan 24, 2024
7771157
ci: Fix apt-get package names
b-butler Jan 24, 2024
f7287f8
ci: Fix one last package name
b-butler Jan 24, 2024
47aedc0
ci: Yet another package name change
b-butler Jan 24, 2024
4fd3a58
documentation and renaming added
npkamath Feb 26, 2024
4dd27f8
upper triangular restructured and dynp restructured
npkamath Feb 26, 2024
f51ff1c
upper triangular restructured and dynp restructured
npkamath Feb 26, 2024
c0afe6f
conditionals
npkamath Feb 26, 2024
77e3e00
naming and other cpp changes
npkamath Feb 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/actions/install-nopython-dependencies/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Install Non-Python Dependencies
description: Installs system packages needed to build dupin.
runs:
using: "composite"
steps:
- name: install-dependencies-linux
if: runner.os == 'Linux'
shell: bash
run: |
sudo apt-get update
sudo apt-get install libtbb12 libtbb-dev libeigen3-dev ninja-build
- name: install-dependencies-macos
if: runner.os == 'macOS'
shell: bash
run: |
brew update
brew install tbb eigen ninja
- name: install-dependencies-windows
if: runner.os == 'Windows'
shell: bash
run: choco install tbb eigen3 ninja
3 changes: 2 additions & 1 deletion .github/workflows/publish-packages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,11 @@ jobs:
-r requirements/requirements-test.txt \
-r requirements/requirements-jit.txt \
-r requirements/requirements-data.txt

- name: Install pypa/build
run:
python -m pip install build
- name: Install System Packages
uses: ./.github/actions/install-nopython-dependencies
- name: Build a binary wheel and a source tarball
run:
python -m build --outdir dist/ .
Expand Down
4 changes: 3 additions & 1 deletion .github/workflows/run-pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ jobs:
python-version: ${{ matrix.python }}
- name: Update pip/build packages
run: |
pip install setuptools --upgrade
pip install pip --upgrade
- name: Install newest dependencies
run: |
pip install -r requirements/requirements-test.txt
Expand All @@ -60,6 +60,8 @@ jobs:
run: |
pip install -r requirements/requirements-jit.txt
if: ${{ matrix.python != '3.12' }}
- name: Install system packages
uses: ./.github/actions/install-nopython-dependencies
- name: Install the package
run: |
pip install -e .
Expand Down
16 changes: 16 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
cmake_minimum_required(VERSION 3.8)
project(dupin VERSION 0.0.1 LANGUAGES CXX)

set(DEFAULT_BUILD_TYPE "Release")
if(NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE ${DEFAULT_BUILD_TYPE})
endif()

find_package(Eigen3 REQUIRED)
find_package(TBB REQUIRED)
# Use modern method for Python binding
set(PYBIND11_NEWPYTHON ON)
find_package(pybind11 CONFIG REQUIRED)

add_subdirectory(src)
add_subdirectory(dupin)
2 changes: 2 additions & 0 deletions dupin/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Defaults to site-packages so only need package name
install(DIRECTORY ./ DESTINATION dupin FILES_MATCHING PATTERN "*.py")
76 changes: 76 additions & 0 deletions dupin/detect/dynp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
"""Implements dynamic programming class for optimal segementation algorithm."""
import _dupin
import numpy as np


class DynP:
"""Detects the change points in a time series.

Comment on lines +6 to +8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this class is supposed to be used by users, Perhaps some additional documentation would be useful here alongside a usage example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something that goes into the comments of this class or in the documentation/tutorial for dupin? I can add this in comments if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation here. A tutorial using DynP may be useful in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh, so check for example SweepDetector class in: https://github.com/glotzerlab/dupin/blob/main/dupin/detect/detect.py
You might want to write what this class is used for and what implementation does it use in 2-3 short sentences.

Additionally, if this class should be used differently to the most of the other classes you might want to add a very short example of how to use it. Not sure if thats the case? For example:

def compute(self, orientations):
        """Calculates the per-particle and global order parameter.

        Example::

            >>> orientations = np.array([[1, 0, 0, 0]] * 100)
            >>> director = np.array([1, 1, 0])
            >>> nematic = freud.order.Nematic(director)
            >>> nematic.compute(orientations)
            freud.order.Nematic(u=[...])

        Args:
            orientations (:math:`\left(N_{particles}, 4 \right)` :class:`numpy.ndarray`):
                Orientations to calculate the order parameter.
        """

Attributes
----------
data: np.ndarray
Matrix storing the time series data.
num_bkps: int
Number of change points to detect.
jump: int
Interval for checking potential change points. Changing will
not provide optimal detection, but will reduce runtime.
min_size: int
Minimum size of a segment. Changing will not provide optimal
detection, but will reduce runtime.


Methods
-------
__init__(self, data: np.ndarray, num_bkps: int, jump: int, min_size: int)
Initializes the DynamicProgramming instance with the time series data
and parameters.
set_num_threads(self, num_threads: int)
Sets the number of threads to be used for parallel computation.
fit(self, num_bkps: int) -> list
Calculates the cost matrix and identifies the optimal breakpoints in
the time series data.
Comment on lines +23 to +32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These get documented by the method's docstring itself.


Example Usage
-------------
>>> import numpy as np
>>> from dynp import DynP
>>> data = np.random.rand(100, 1) # Simulated time series data
>>> num_bkps = 3 # Number of breakpoints to detect
>>> jump = 1 # Interval for checking potential breakpoints
>>> min_size = 3 # Minimum size of a segment
>>> model = Dynp(data, num_bkps, jump, min_size)
>>> breakpoints = model.fit(num_bkps)
>>> print(breakpoints)
"""

def __init__(
self, data: np.ndarray, num_bkps: int, jump: int, min_size: int
):
"""Initialize the DynamicProgramming instance with given parameters."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you document a class's __init__.py method, you need to document its parameters here as well. The alternative is to document both parameters and attributes in the class's docstring.

self._dupin = _dupin.DynamicProgramming(data, num_bkps, jump, min_size)

def set_num_threads(self, num_threads: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term dev wise, if additional CPP methods will be added we will most likely want num_threads to be controlled on the level of the whole module? Would preprocessor be a good place to have this? @b-butler thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to as a _DynP module function and exporting it to dupin/util.py is probably the best solution in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would just move this function to util.py right? would just have to import _DynP in util.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

"""Set the number of threads for parallelization.

Parameters
----------
num_threads: int
The number of threads to use during computation. Default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a parameter default. If you want to add a comment about the default parallelization it should go above this section. Also, the message should state that we use all available cores unless set otherwise.

is determined automatically.
"""
self._dupin.set_threads(num_threads)

def fit(self, num_breakpoints: int) -> list[int]:
"""Calculate the cost matrix and return the breakpoints.

Parameters
----------
num_bkps: int
number of change points to detect.

Returns
-------
list: A list of integers representing the breakpoints.
"""
return self._dupin.fit(num_breakpoints)
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[build-system]
build-backend = "setuptools.build_meta"
requires = ["setuptools >= 64.0.0"]
requires = ["scikit-build-core>=0.7.0", "pybind11"]
build-backend = "scikit_build_core.build"

[project]
name = "dupin"
Expand Down
19 changes: 19 additions & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
list(APPEND dupin_cxx "module.cpp" "dupin.h" "dupin.cpp")

pybind11_add_module(_dupin ${dupin_cxx})

set_target_properties(_dupin PROPERTIES
CXX_STANDARD 17
CMAKE_CXX_STANDARD_REQUIRED True
)

target_include_directories(_dupin PRIVATE
${EIGEN3_INCLUDE_DIR}
${TBB_INCLUDE_DIRS}
)

target_link_libraries(_dupin PRIVATE TBB::tbb)
target_compile_definitions(_dupin PRIVATE VERSION_INFO=${PROJECT_VERSION})
target_compile_options(_dupin PRIVATE -O2 -march=native)
# Installs C++ extension into the root of the Python package
install(TARGETS _dupin LIBRARY DESTINATION dupin)
170 changes: 170 additions & 0 deletions src/dupin.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
#include <iostream>
#include <iomanip>
#include <limits>
#include <unordered_map>
#include <vector>
#include <Eigen/Dense>
#include <tbb/blocked_range2d.h>
#include <tbb/global_control.h>
#include <tbb/parallel_for.h>
#include "dupin.h"

using namespace std;
using namespace Eigen;
Comment on lines +12 to +13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use the namespace, explicitly scope them. It makes the code more readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using namespace std;
using namespace Eigen;

If you use std:: and Eigen:: then you don't need these lines.


DynamicProgramming::DynamicProgramming()
: num_features(0), num_timesteps(0), jump(1), min_size(3), cost_matrix(0) {}


DynamicProgramming::DynamicProgramming(const Eigen::MatrixXd &data,
int jump_, int min_size_)
: data(data), jump(jump_), min_size(min_size_), cost_matrix(data.rows()) {
num_timesteps = data.rows();
num_features = data.cols();
}

void DynamicProgramming::scale_data() {
Eigen::VectorXd min_val = data.colwise().minCoeff();
Eigen::VectorXd max_val = data.colwise().maxCoeff();
Eigen::VectorXd range = max_val - min_val;

for (int j = 0; j <num_features; ++j) {
if (range(j) == 0.0) {
data.col(j).setZero();
} else {
data.col(j) = (data.col(j).array() - min_val(j)) / range(j);
}
}
}
void DynamicProgramming::regression_setup(linear_fit_struct &lfit) {
lfit.x = Eigen::VectorXd::LinSpaced(num_timesteps, 0, num_timesteps - 1) /
(num_timesteps - 1);
lfit.y = data;
}

//work in progress, the rowwise colwise is messing up
Eigen::MatrixXd DynamicProgramming::regression_lines(int start, int end, linear_fit_struct &lfit) {
int n = end - start;
Eigen::VectorXd x = lfit.x.segment(start, n);
Eigen::MatrixXd y = lfit.y.block(start, 0, n, num_features);

// Ensure x is in a two-dimensional form for broadcasting
Eigen::MatrixXd x_matrix = x.replicate(1, num_features);

// Calculate means
double x_mean = x.mean();
Eigen::VectorXd y_mean = y.colwise().mean();

// Center the data around 0
Eigen::MatrixXd x_centered = x_matrix.colwise() - Eigen::VectorXd::Constant(n, x_mean);
Eigen::MatrixXd y_centered = y.rowwise() - y_mean.transpose();

// Calculate slopes for each feature
Eigen::VectorXd slope = (x_centered.array() * y_centered.array()).colwise().sum() / x_centered.array().square().sum();

// Calculate intercepts for each feature
Eigen::VectorXd intercept = y_mean.array() - slope.array() * x_mean;

// everything till this line is functioning fine; I might be overcomplicating it
Eigen::MatrixXd regression_lines = (x_matrix.array().colwise() - x_mean).colwise() * slope.array() + intercept.transpose().array();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just need to do something like x_matrix * slope.replicate(x.size(), 1) + interept.replicate(x.size(), 1) though I am not too familiar with Eigen.


return regression_lines;
}

double DynamicProgramming::l2_cost(const Eigen::MatrixXd &predicted_y, int start, int end) {
Eigen::MatrixXd diff = predicted_y.block(start, 0, end - start, num_features) -
data.block(start, 0, end - start, num_features);
return std::sqrt(diff.array().square().sum());
}

void DynamicProgramming::predicted(int start, int end, linear_fit_struct &lfit,
Eigen::MatrixXd &predicted_y) {
predicted_y.block(start, 0, end - start, num_features) = regression_lines(start, end, lfit);
}

double DynamicProgramming::cost_function(int start, int end) {
linear_fit_struct lfit;
regression_setup(lfit);
Comment on lines +86 to +87
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the constructor for linear_fit_struct set up the values.


Eigen::MatrixXd predicted_y(num_timesteps, num_features);
predicted(start, end, lfit, predicted_y); // Fill the predicted_y matrix

return l2_cost(predicted_y, start, end);
}

void DynamicProgramming::initialize_cost_matrix() {
scale_data();
tbb::parallel_for(tbb::blocked_range<int>(0, num_timesteps),
[&](const tbb::blocked_range<int> &r) {
for (int i = r.begin(); i < r.end(); ++i) {
for (int j = i + min_size; j < num_timesteps; ++j) {
cost_matrix(i, j) = cost_function(i, j);
}
}
});
cost_computed = true;
}

std::pair<double, std::vector<int>> DynamicProgramming::seg(int start, int end,
int num_bkps) {
MemoKey key = {start, end, num_bkps};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MemoKey key = {start, end, num_bkps};
MemoKey key{start, end, num_bkps};

auto it = memo.find(key);
if (it != memo.end()) {
return it->second;
}
if (num_bkps == 0) {
return {cost_matrix(start, end), {end}};
}

std::pair<double, std::vector<int>> best = {std::numeric_limits<double>::infinity(), {}};

for (int bkp = start + min_size; bkp < end; bkp++) {
if ((bkp - start) < min_size || (end - bkp) < min_size) {
continue;
}
auto left = seg(start, bkp, num_bkps - 1);
auto right = seg(bkp, end, 0);
double cost = left.first + right.first;
if (cost < best.first) {
best.first = cost;
best.second = left.second;
best.second.push_back(bkp);
best.second.insert(best.second.end(), right.second.begin(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't right always be one element long? If so, we could just push back.

right.second.end());
}
}
}

memo[key] = best;
return best;
}

std::vector<int> DynamicProgramming::compute_breakpoints(int num_bkps) {
auto result = seg(0, num_timesteps - 1, num_bkps);
std::vector<int> breakpoints = result.second;
return breakpoints;
}

std::vector<int> DynamicProgramming::fit(int num_bkps){
if (!cost_computed){
initialize_cost_matrix();
}
return compute_breakpoints(num_bkps);
}

void set_parallelization(int num_threads) {
static tbb::global_control gc(tbb::global_control::max_allowed_parallelism,
num_threads);
}

DynamicProgramming::UpperTriangularMatrix &
DynamicProgramming::getCostMatrix() {
return cost_matrix;
}

void DynamicProgramming::setCostMatrix(
const DynamicProgramming::UpperTriangularMatrix &value) {
cost_matrix = value;
}
Comment on lines +165 to +168
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should anyone be setting the full cost matrix?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! honestly the setter functions aren't needed anymore. I can probably delete them and just set the getters as actual functions rather than properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved but not deleted.


int main() { return 0; }
Loading
Loading