Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Programming c++ implementation #9

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

npkamath
Copy link

@npkamath npkamath commented Jan 2, 2024

Description

Added a c++ implementation of the dynamic programming class. Header, cpp, dependencies and requirements are all in the the Dynp folder.

Motivation

Optimizes the dynp class to use significantly less time and memory; optimized using c++ eigen and intel tbb libraries.

Do Changes Introduce Breaking Changes

Yes.

Have you (if appropriate)

  • Updated changelog
  • Updated Documentation
  • Add tests
  • Added name to contributors

@npkamath
Copy link
Author

npkamath commented Jan 2, 2024

I did not delete the pull_request file from .github; I misunderstood and edited it and added it to my my own root directory. The original file is unchancged.

Also format C++ files using clang-format.
Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add dupin/detect/dynp.py file that creates wrapper around DynamicProgramming class.

Sorry, I don't have a more complete review, but I will continue it later this week. I wanted to go ahead and give you tasks that you could start on.

Also, let me know if you have questions on the new structure of the package.

src/dupininterface.cpp Outdated Show resolved Hide resolved
src/dupin.cpp Outdated Show resolved Hide resolved
src/dupin.cpp Outdated Show resolved Hide resolved
Comment on lines +12 to +13
using namespace std;
using namespace Eigen;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use the namespace, explicitly scope them. It makes the code more readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using namespace std;
using namespace Eigen;

If you use std:: and Eigen:: then you don't need these lines.

src/dupin.cpp Outdated Show resolved Hide resolved
src/dupin.h Outdated Show resolved Hide resolved
src/dupin.h Outdated Show resolved Hide resolved
src/dupin.h Outdated
memo;

int num_bkps; // Number of breakpoints to detect.
int num_parameters; // Number of features in the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be known from the passed matrix no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this was just for testing/setting back in ye old days of input.txt. I can just make it use data.size now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still to be done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep this we should be more elaborate and use num_features and num_breakpoints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if possible, expand on the variable names throughout. It will help future developers work with the code.

src/dupin.h Outdated
Eigen::VectorXd x; // z Independent variable (time steps).
};

public:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these should be private or protected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of these methods either don't need to exist the n_timesteps n_parameters stuff or should be private/protected. The constructor, setters, getters, and anything exported to Python should remain here.

src/dupin.h Outdated Show resolved Hide resolved
@npkamath npkamath requested a review from a team January 12, 2024 12:56
@npkamath
Copy link
Author

still have the new function to add

Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments for now. Thanks for the quick work.

src/dupin.h Outdated
Comment on lines 34 to 41
void initialize(int n) {
length = n;
matrix.resize(n * (n + 1) / 2, 0.0);
row_indices.resize(n);
for (int row = 0; row < n; ++row) {
row_indices[row] = row * (2 * length - row + 1) / 2;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes because without a global cost_matrix variable, I will have to pass by reference the entire cost_matrix through the recursion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean can't we just initialize the matrix and discard the initialization function. Why do we create it and then size it?

src/dupin.h Outdated
memo;

int num_bkps; // Number of breakpoints to detect.
int num_parameters; // Number of features in the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still to be done.

src/dupininterface.cpp Outdated Show resolved Hide resolved
src/dupininterface.cpp Outdated Show resolved Hide resolved
src/dupininterface.cpp Outdated Show resolved Hide resolved
src/dupin.h Outdated
Eigen::VectorXd x; // z Independent variable (time steps).
};

public:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of these methods either don't need to exist the n_timesteps n_parameters stuff or should be private/protected. The constructor, setters, getters, and anything exported to Python should remain here.

src/dupin.h Show resolved Hide resolved
Comment on lines +170 to +173
void DynamicProgramming::setCostMatrix(
const DynamicProgramming::UpperTriangularMatrix &value) {
cost_matrix = value;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should anyone be setting the full cost matrix?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! honestly the setter functions aren't needed anymore. I can probably delete them and just set the getters as actual functions rather than properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved but not deleted.

src/dupin.cpp Outdated Show resolved Hide resolved
src/dupin.h Outdated Show resolved Hide resolved
Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments as you address my other ones (I you have resolved one of my points mark so on the comments page and it will be easier for my to know what you have handled.

CMakeLists.txt Outdated Show resolved Hide resolved
"""
self.dynp.set_threads(num_threads)

def return_breakpoints(self) -> list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unnecessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify which part

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant return_breakpoints, I removed it and just kept fit.

src/CMakeLists.txt Outdated Show resolved Hide resolved
@npkamath
Copy link
Author

New commit should address most of the changes you requested: let me know if the private/public structure works or else I can rewrite some of the functions

Copy link

codecov bot commented Jan 19, 2024

Codecov Report

Attention: Patch coverage is 0% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 0.00%. Comparing base (5b132c8) to head (f51ff1c).
Report is 5 commits behind head on main.

❗ Current head f51ff1c differs from pull request most recent head 77e3e00. Consider uploading reports for the commit 77e3e00 to get more accurate results

Files Patch % Lines
dupin/detect/dynp.py 0.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #9       +/-   ##
==========================================
- Coverage   76.08%   0.00%   -76.09%     
==========================================
  Files          18      19        +1     
  Lines        1104    1113        +9     
  Branches      234       0      -234     
==========================================
- Hits          840       0      -840     
- Misses        221    1113      +892     
+ Partials       43       0       -43     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +6 to +8
class DynP:
"""Detects the change points in a time series.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this class is supposed to be used by users, Perhaps some additional documentation would be useful here alongside a usage example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something that goes into the comments of this class or in the documentation/tutorial for dupin? I can add this in comments if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation here. A tutorial using DynP may be useful in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh, so check for example SweepDetector class in: https://github.com/glotzerlab/dupin/blob/main/dupin/detect/detect.py
You might want to write what this class is used for and what implementation does it use in 2-3 short sentences.

Additionally, if this class should be used differently to the most of the other classes you might want to add a very short example of how to use it. Not sure if thats the case? For example:

def compute(self, orientations):
        """Calculates the per-particle and global order parameter.

        Example::

            >>> orientations = np.array([[1, 0, 0, 0]] * 100)
            >>> director = np.array([1, 1, 0])
            >>> nematic = freud.order.Nematic(director)
            >>> nematic.compute(orientations)
            freud.order.Nematic(u=[...])

        Args:
            orientations (:math:`\left(N_{particles}, 4 \right)` :class:`numpy.ndarray`):
                Orientations to calculate the order parameter.
        """

"""Initialize the DynamicProgramming instance with given parameters."""
self.dynp = _DynP.DynamicProgramming(data, num_bkps, jump, min_size)

def set_num_threads(self, num_threads: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term dev wise, if additional CPP methods will be added we will most likely want num_threads to be controlled on the level of the whole module? Would preprocessor be a good place to have this? @b-butler thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to as a _DynP module function and exporting it to dupin/util.py is probably the best solution in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would just move this function to util.py right? would just have to import _DynP in util.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is looking good. I tried to give a thorough review this time since I am going on parental leave tomorrow. No rush on getting this done, as I won't be looking at GitHub for a couple of weeks, likely.

Also, I switched the build system to https://scikit-build-core.readthedocs.io/en/latest/getting_started.html. If you have problems building the package look there.

We need tests. See example tests in tests/detect/test_detect.py.

Comment on lines +6 to +8
class DynP:
"""Detects the change points in a time series.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation here. A tutorial using DynP may be useful in the future.

dupin/detect/dynp.py Outdated Show resolved Hide resolved
src/dupininterface.cpp Outdated Show resolved Hide resolved
src/dupininterface.cpp Outdated Show resolved Hide resolved
src/dupin.h Outdated
Comment on lines 34 to 41
void initialize(int n) {
length = n;
matrix.resize(n * (n + 1) / 2, 0.0);
row_indices.resize(n);
for (int row = 0; row < n; ++row) {
row_indices[row] = row * (2 * length - row + 1) / 2;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean can't we just initialize the matrix and discard the initialization function. Why do we create it and then size it?

src/dupin.cpp Outdated Show resolved Hide resolved
src/dupin.h Outdated
memo;

int num_bkps; // Number of breakpoints to detect.
int num_parameters; // Number of features in the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep this we should be more elaborate and use num_features and num_breakpoints.

src/dupin.h Outdated
memo;

int num_bkps; // Number of breakpoints to detect.
int num_parameters; // Number of features in the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if possible, expand on the variable names throughout. It will help future developers work with the code.

dupin/detect/dynp.py Outdated Show resolved Hide resolved
dupin/detect/dynp.py Outdated Show resolved Hide resolved
Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npkamath Thank you for all the hard work. This is probably the last thorough review I will give. I am happy to give it a quick glance though if you'd like and feel free to reach out to me.

@DomFijan is taking over the day-to-day management of the package and can help you in the finishing touches. @joaander is also a great resource for specific questions on coding practices. If either of them advise you a different way than I did feel free to disregard my prior comments.

I am approving as to not block any future developments, but please don't merge until @DomFijan approves.

Comment on lines +23 to +32
Methods
-------
__init__(self, data: np.ndarray, num_bkps: int, jump: int, min_size: int)
Initializes the DynamicProgramming instance with the time series data
and parameters.
set_num_threads(self, num_threads: int)
Sets the number of threads to be used for parallel computation.
fit(self, num_bkps: int) -> list
Calculates the cost matrix and identifies the optimal breakpoints in
the time series data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These get documented by the method's docstring itself.

def __init__(
self, data: np.ndarray, num_bkps: int, jump: int, min_size: int
):
"""Initialize the DynamicProgramming instance with given parameters."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you document a class's __init__.py method, you need to document its parameters here as well. The alternative is to document both parameters and attributes in the class's docstring.

Parameters
----------
num_threads: int
The number of threads to use during computation. Default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a parameter default. If you want to add a comment about the default parallelization it should go above this section. Also, the message should state that we use all available cores unless set otherwise.

Comment on lines +12 to +13
using namespace std;
using namespace Eigen;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using namespace std;
using namespace Eigen;

If you use std:: and Eigen:: then you don't need these lines.

double slope = x_centered.dot(y_centered) / x_centered.squaredNorm();
double intercept = y_mean - slope * x_mean;
// everything till this line is functioning fine; I might be overcomplicating it
Eigen::MatrixXd regression_lines = (x_matrix.array().colwise() - x_mean).colwise() * slope.array() + intercept.transpose().array();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just need to do something like x_matrix * slope.replicate(x.size(), 1) + interept.replicate(x.size(), 1) though I am not too familiar with Eigen.

Comment on lines +81 to +82
// Structure for storing linear regression parameters.
struct linear_fit_struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a good idea, and hopefully not too much work as the code is pretty modular already.

src/dupin.cpp Outdated
Comment on lines 51 to 52
double x_mean = x.mean();
double y_mean = y.mean();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What were your thoughts @npkamath?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants