Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CyNetDiff Submission #175

Open
14 of 32 tasks
eliotwrobson opened this issue Apr 22, 2024 · 52 comments
Open
14 of 32 tasks

CyNetDiff Submission #175

eliotwrobson opened this issue Apr 22, 2024 · 52 comments

Comments

@eliotwrobson
Copy link

eliotwrobson commented Apr 22, 2024

Submitting Author: Name (@eliotwrobson)
All current maintainers: (@eliotwrobson, @abhishekumrawal)
Package Name: CyNetDiff
One-Line Description of Package: A performance-focused library implementing algorithms for simulating network diffusion processes, written in Cython.
Repository Link: https://github.com/eliotwrobson/CyNetDiff
Version submitted: v0.1.13
EIC: @Batalex
Editor: @sneakers-the-rat, @lwasser
Reviewer 1: @Kai-Striega
Reviewer 2: @jeremyzhangsq
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD


Code of Conduct & Commitment to Maintain Package

Description

  • Include a brief paragraph describing what your package does:

Network diffusion processes aim to model the spread of information through social networks, represented using graphs. Experimental work involving these models usually involves simulating these processes many times over large graphs, which can be computationally very expensive. At the same time, being able to conduct experiments using a high-level language like Python is helpful to researchers, as this gives greater flexibility in developing research software. To address both of these concerns, CyNetDiff is a Cython module implementing the independent cascade and linear threshold models, two of the most popular network diffusion models. Development has been focused on performance, while still giving an intuitive, high-level interface to assist in research tasks.

Scope

  • Please indicate which category or categories.
    Check out our package scope page to learn more about our
    scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):

    • Data retrieval
    • Data extraction
    • Data processing/munging
    • Data deposition
    • Data validation and testing
    • Data visualization1
    • Workflow automation
    • Citation management and bibliometrics
    • Scientific software wrappers
    • Database interoperability

Domain Specific

  • Geospatial
  • Education

Community Partnerships

If your package is associated with an
existing community please check below:

  • For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):

    • Who is the target audience and what are scientific applications of this package?

This is aimed at researchers working in areas related to network diffusion and influence maximization, and specifically at optimizing the most computationally expensive part of this process. This should enable researchers to conduct experiments on larger graphs than would be possible with a pure-Python package. For a recent work doing experiments that fit the use cases of this package, see https://arxiv.org/abs/2207.08937

  • Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There is a previous package filling a similar use case called ndlib: https://github.com/GiulioRossetti/ndlib
Our package differs as it was developed with a focus on performance, and with lesser emphasis on visualization
and flexibility (for example, we do not have a way of defining custom models). Using code compiled with Cython
allows our package to handle much larger graphs than are possible with a pure-Python package like ndlib.

  • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

#165, @Batalex

Continuing off of the discussion there, the main goals of the review for me are to finalize the interface and add related features / utility code to aid usability. In terms of concrete features, the algorithms already implemented are enough for a full release.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • uses an OSI approved license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a tutorial with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

Is it possible to add the paper submission after this review is completed? Would like the content of the paper to incorporate any functionality that gets added during the review process.

JOSS Checks
  • The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
  • The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
  • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • The package is deposited in a long-term repository with the DOI:

Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

  • Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

  • I have read the author guide.
  • I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Review issues

Footnotes

  1. Please fill out a pre-submission inquiry before submitting a data visualization package.

@Batalex
Copy link
Contributor

Batalex commented Apr 26, 2024

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci review. Below are the basic checks that your package needs to pass to begin our review. If some of these are missing, we will ask you to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements below.

  • Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
    • The package imports properly into a standard Python environment import package.
  • Fit The package meets criteria for fit and overlap.
  • Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
    • User-facing documentation that overviews how to install and start using the package.
    • Short tutorials that help a user understand how to use the package and what it can do for them.
    • API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
  • Core GitHub repository Files
    • README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
    • Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
    • Code of Conduct The package has a CODE_OF_CONDUCT.md file.
    • License The package has an OSI approved license.
      NOTE: We prefer that you have development instructions in your documentation too.
  • Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
  • Automated tests Package has a testing suite and is tested via a Continuous Integration service.
  • Repository The repository link resolves correctly.
  • Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

  • Initial onboarding survey was filled out
    We appreciate each maintainer of the package filling out this survey individually. 🙌
    Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. 🙌


Editor comments

I think the documentation is a little bare currently to get started with the review. I'd like to see the information you wrote in this submission issue in the project repository itself. For example:

Network diffusion processes aim to model the spread of information through social networks, represented using graphs. Experimental work involving these models usually involves simulating these processes many times over large graphs, which can be computationally very expensive. At the same time, being able to conduct experiments using a high-level language like Python is helpful to researchers, as this gives greater flexibility in developing research software. To address both of these concerns, CyNetDiff is a Cython module implementing the independent cascade and linear threshold models, two of the most popular network diffusion models. Development has been focused on performance, while still giving an intuitive, high-level interface to assist in research tasks.

This is a nice elevator pitch to rework and include in the README.

I'd like to see more content in the documentation as well. The API section is great, but the basic example lacks context and does not really show case what CyNetDiff is great for. I want to share with you this resource: Diátaxis framework.
I am not asking that you follow the full framework to the letter, and it would be a huge task to do so, but I think there is some virtue in thinking about the kind of documentation you aim to showcase.
The documentation is the first thing a potential user will take a look at. As such, it needs to quickly guide the reader to:

  • what does it do? (= should I try it?)
  • how does it work? (= ok, I think this is the tool I need, how do I proceed?)

@eliotwrobson
Copy link
Author

Thank you for the feedback! I think those changes are reasonable and will significantly enhance the package, I'll try and expand the documentation by Monday so we can get the ball rolling on the review before too long 🚀

@eliotwrobson
Copy link
Author

@Batalex I have just published a release with the following changes:

  • Updated the elevator pitch and example in the README
  • Added two longer extended examples to the documentation site

@Batalex
Copy link
Contributor

Batalex commented May 2, 2024

Thanks, this looks great. From what I can tell, we are good to go. I'll get started on finding the perfect editor for this review 🐈‍⬛

@eliotwrobson
Copy link
Author

@Batalex I know we don't have an editor yet, but is it too early to begin suggesting reviewers? We've already found someone that might be a good candidate for this.

@Batalex
Copy link
Contributor

Batalex commented May 11, 2024

@Batalex I know we don't have an editor yet, but is it too early to begin suggesting reviewers? We've already found someone that might be a good candidate for this.

That's fantastic!
Could you please orient them to our signups form? https://forms.gle/GHfxvmS47nQFDcBM6

@eliotwrobson
Copy link
Author

For sure! I'll be sure to email this person as well, but this is @jeremyzhangsq

@eliotwrobson
Copy link
Author

Shiqi has agreed to review and filled out the form!

@Batalex Batalex self-assigned this May 20, 2024
@Batalex Batalex moved this to seeking-editor in peer-review-status May 31, 2024
@eliotwrobson
Copy link
Author

@Batalex just wanted to check back in and see if there was anything we could do to help find a suitable editor for this. We're not in a huge rush or anything, but we want to submit this package to a conference before too long.

@Batalex
Copy link
Contributor

Batalex commented Jun 29, 2024

Hello @eliotwrobson,
Unfortunately, I do not have an editor at the moment. There are two reasons for that: first, we are so successful our current editors have their hands full currently (let me brag a little). Second, the domain is more niche than others, so we (or more specifically I) cannot rely on personal connections or previous involvements to find someone with a relevant expertise. What would be your target conference? I'll see what I can do to help accelerate the process.

@eliotwrobson
Copy link
Author

@Batalex Thank you for your detailed response. We don't quite have a target conference in mind, but we're starting more work on a longer version of what will be our paper for this. I mainly asked because we may be able to help search for an editor as well, depending on what the preferred qualifications are. For example, Shiqi has published papers related to this, and might be able to serve, depending on how heavy the workload is, and how difficult it would be to find two other reviewers.

Regarding this general area, anyone who has worked on projects related to influence maximization or even just general graph theory might know enough to get the ball rolling.

@sneakers-the-rat
Copy link
Collaborator

I can edit this, although I just was the editor for the automata package and maybe the authors want a break from me ;)

@eliotwrobson
Copy link
Author

@sneakers-the-rat we would greatly appreciate you serving as editor 🙂

@sneakers-the-rat
Copy link
Collaborator

sneakers-the-rat commented Jul 2, 2024

Happy to do it, i'll be able to get started searching for reviewers tmrw :). (i'll let the EiC do the formal assignment)

@sneakers-the-rat sneakers-the-rat self-assigned this Jul 2, 2024
@lwasser lwasser moved this from seeking-editor to under-review in peer-review-status Jul 3, 2024
@cmarmo
Copy link
Member

cmarmo commented Jul 3, 2024

i'll let the EiC do the formal assignment

Thanks a lot @sneakers-the-rat and thanks to @Batalex who did the hard work as EIC here!
Happy review to all!

@eliotwrobson
Copy link
Author

Awesome! @sneakers-the-rat just to fill you in, we've already identified a reviewer (@/jeremyzhangsq) who has papers published in this area. So the other reviewer doesn't necessarily need to be a subject area expert, and I think it would be appropriate to have someone with more knowledge of Cython packaging / best practices, since that was the part I was most unfamiliar with when writing this package.

@lwasser
Copy link
Member

lwasser commented Oct 1, 2024

hi there team! i'm just checking in on this review to see the status. it looks like there's been no activity for almost 3 months. Is there anything that i can do to support moving things forward?

@eliotwrobson
Copy link
Author

Thanks all for getting the ball rolling! I'm not sure how much @/jeremyzhangsq checks GitHub, so I'll go ahead and ping him again via email.

@Kai-Striega
Copy link

Kai-Striega commented Oct 20, 2024

A quick couple of questions @lwasser /@sneakers-the-rat:

  • as previously discussed, my review has focused primarily on the Cython aspects of the project. Do you require me to review the remaining points on the review template too? I'm happy to do this, but I need to be aware that it's required.
  • I've look through the package and written plenty of notes but haven't typed them into the review template yet. As this is my first time reviewing, would we be able to organise a mentoring session to ensure I do a great job?

@lwasser
Copy link
Member

lwasser commented Oct 20, 2024

@Kai-Striega 👋🏻 i just saw your discord dm's! So sorry it took me a few days to respond. I answered you there, please do go through the entire template. I'm happy to either provide a mentor session OR to get someone else in our community to do so as well. You're also in great hands here with @sneakers-the-rat so i don't want to step on any toes more than I already have!! so please just let me know how I can best support everyone here 🚀

@Kai-Striega
Copy link

Kai-Striega commented Oct 26, 2024

@eliotwrobson please see my review below. I'm happy to expand/talk about any of the points I have made.

@sneakers-the-rat / @lwasser I haven't received a mentor yet, but had the available time to do it today. Please let me know if I am required to change/update anything.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README.
  • Installation instructions: for the development version of the package and any non-standard dependencies in README.
    • Instructions provided, however I had difficulty following them. See comments below.
  • Vignette(s) demonstrating major functionality that runs successfully locally.
    • Vignette in README and examples/ directory
  • Function Documentation: for all user-facing functions.
  • Examples for all user-facing functions.
    • No examples for user facing documentation page
    • No examples in models.pyi file with rest of documentation
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements
The package meets the readme requirements below:

  • Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • The package name
  • Badges for:
    • Continuous integration and test coverage,
    • Docs building (if you have a documentation website),
    • A repostatus.org badge,
    • Python versions supported,
    • Current package version (on PyPI / Conda).
      • PyPi supported and badged
      • conda not yet available, however this package seems appropriate for conda

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

  • Short description of package goals.
  • Package installation instructions
  • Any additional setup required to use the package (authentication tokens, etc.)
  • Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
    • Link to examples/ directory missing in README
  • Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
    • Vignettes/examples are sufficient
  • Link to your documentation website.
  • If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
    • Not provided, have not done a lit review to check if this is required.
  • Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Package structure should follow general community best-practices. In general please consider whether:

  • Package documentation is clear and easy to find and use.
  • The need for the package is clear
  • All functions have documentation and associated examples for use
    • Functions/methods have documentation, but not examples in the documentation.
  • The package is easy to install
    • See comments in reviewer comments about wheels and required C++ compiler

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
    • No performance claims made
    • Benchmarks appear fairly ad-hoc.
    • Benchmarks provided as part of the examples, should be own directory.
    • asv (or another benchmarking framework) could be used to produce more reliable benchmarks
  • Automated tests:
    • All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
    • Tests cover essential functions of the package and a reasonable range of inputs and conditions.
  • Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
  • Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.
    A few notable highlights to look at:
    • Package supports modern versions of Python and not End of life versions.
      • Missing CPython 3.13 support, however GitHub declares Requires: Python >=3.9
      • Missing free-threaded Python support, this is not unusual as the project heavily uses Cython, which does not officially support free-threaded Python yet
    • Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
      • Passes with ruff locally
      • Passes in CI for ruff and mypy
      • As this project utilizes Cython heavily a tool such as cython-lint is recommended

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 8


Review Comments

I have several review comments below. These are ordered by importance.

Requires a C++ compiler to build/lack of wheels

Wheels are provided only CPython 3.11 on Linux. This requires all other possible installers to compile the project from source. This is fairly simple, however it provides an extra step/requirements which is possible to avoid. My recommendation is to provide wheels for all officially supported Python versions on all major platforms.

Cython code generation

The project currently caches the file src/cynetdiff/models.cpp explicitly. I usually recommended not to ship the .c/.cpp files directly. There are several problems with shipping .c/.cpp files:

  • They are not guaranteed to be forward compatible with future CPython C-API or Cython versions. The CPython C-API seems to be changing fairly frequently at the moment.
  • It's possible that Cython generated code will not work on certain platforms e.g. arm if built for x86

This is echoed by the official Cython user guide

it is not recommended to include generated files in source distributions. Instead, require Cython at build-time to generate the C/C++ files

Pseudo random number generation

The package currently uses C's rand function. From models.pyx:

from libc.stdlib cimport rand, RAND_MAX
cdef double RAND_SCALE = 1.0 / RAND_MAX

cdef inline double next_rand() nogil:
    return rand() * RAND_SCALE

There are several issues with rand:

  1. There are no guarantees about the algorithm, and therefore the quality of the randomness, used to generate random numbers
  2. It uses global state which can lead to problems with multithreaded applications. While the current project doesn't use any parallelism, this could be significant if CyNetDiff-1 is implemented.

My recommendation is to adopt a more suitable random number generation engine. In the Cython world there are two ways I can recommend of doing this:

  1. C++'s <random> header via the Cython bindings
  2. Adopt NumPy's random number generation. This is officially documented and an example can be seen here

Note that (2) would also allow you to be compliant with SPEC-7.

Minor Cython related improvements

Note these points are minor and should not be considered blocking.

Use of array.array

The package uses array.array both for type hints and Cython typing in several places. Consider using Cython's memoryview instead. memoryviews provide the advantage of being available to anything that provides the Buffer Protocol, such as NumPy arrays, which I suspect are used significantly more frequently than array.arrays are by the packages target audience.

Missing potential bounds checks when running cythonize
[1/1] Cythonizing src/cynetdiff/models.pyx
warning: src/cynetdiff/models.pyx:97:51: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:107:35: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:135:39: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:137:38: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:141:34: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:261:39: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:263:45: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:264:34: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:268:38: Use boundscheck(False) for faster access
warning: src/cynetdiff/models.pyx:284:52: Use boundscheck(False) for faster access
  • It may be possible to improve performance by disabling bounds checking in these situations.
  • I have not verified if this actually improves performance in these particular circumstances
Review resizing array.array
    for i in range(m):
        influence_arr.append(1.0 / in_degrees[self.edges[i]])

This computations resizes influence_arr multiple times, leading to multiple memory allocations. It would be more efficient to use .extend rather than .append.

Use of name mangling

Starting a name with __ introduces name mangling in Python. In C/C++ these are reserved for the compilers internal keywords. Name mangling can make life difficult. CyNetDiff uses name mangling at several points (e.g. __activation_suceeds and __advance_model). I'd be careful with these.

@eliotwrobson
Copy link
Author

@Kai-Striega thank you for your detailed comments! I have a couple of questions, but I'll try to get started addressing some of these items soon.

Requires a C++ compiler to build/lack of wheels

Do you know of any documentation that has discussion of how to do this? I agree it is much more convenient to not need a compiler installed, but I wasn't sure how to set this up.

Cython code generation

This makes sense to me, I could swear I read the opposite somewhere but I must have been mistaken.

Pseudo random number generation

Thank you for the references! I had actually thought about whether there was a standard way to handle the random number generation. I think I will follow option (2) and add the SPEC-7 compliance.

Use of array.array

Is there a recommended type annotation that expresses what works with the Cython memoryview? It would be nice to add this annotation along with support for numpy arrays.

@Kai-Striega
Copy link

@Kai-Striega thank you for your detailed comments! I have a couple of questions, but I'll try to get started addressing some of these items soon.

You're welcome.

Feel free to message if you have any more questions. I'll try answer your questions below.

Do you know of any documentation that has discussion of how to do this? I agree it is much more convenient to not need a compiler installed, but I wasn't sure how to set this up.

AFAIK, there isn't much. You could look at how SciPy builds their wheels for nightly builds. This should include all the info you need. I'm happy to help out with this, but might be a bit slow at the moment.

Note that SciPy's CI is far more complex than I would expect for this project.

This makes sense to me, I could swear I read the opposite somewhere but I must have been mistaken.

This isn't cut and dry. In the past, it was advised to do the opposite. This has changed over the last decade or so. Some people still prefer to include the .c/cpp files. However, this isn't recommended anymore.

Thank you for the references! I had actually thought about whether there was a standard way to handle the random number generation. I think I will follow option (2) and add the SPEC-7 compliance.

You're welcome :)

Is there a recommended type annotation that expresses what works with the Cython memoryview? It would be nice to add this annotation along with support for numpy arrays.

Unfortunately, not that I've come across. The problem is that Cython memoryviews support anything that supports the Buffer Protocol this could include custom objects.

My usual tactic is to use ArrayLike with a Python level wrapper, check if the value is scalar and then convert it to an array. This isn't great.

An alternative option would be to create your own Protocol with all the buffer protocol methods.

@eliotwrobson
Copy link
Author

@Kai-Striega I've been able to implement most of the above changes without too many issues. I'm starting to take a look at the spec7 documentation and the examples you linked. Would it be better to provide client code with a wrapper class like the one you linked here that then gets passed to the model class that executes simulations? Or should client code directly pass in a numpy generator? From the timing code in the notebook you linked, it seems like it would be much faster to expose this wrapper class to client code, rather than having a numpy generator passed in.

@Kai-Striega
Copy link

I'm at work right now, and can't respond in the length I'd like. I'll edit this over the next few days to expand on my thoughts.

TLDR: Neither. SPEC-7 allows for a couple of different options.

Here is the officially recommended way

from collections.abc import Sequence
import numpy as np


SeedLike = int | np.integer | Sequence[int] | np.random.SeedSequence
RNGLike = np.random.Generator | np.random.BitGenerator


def my_func(*, rng: RNGLike | SeedLike | None = None):
    """My function summary.

    Parameters
    ----------
    rng : `numpy.random.Generator`, optional
        Pseudorandom number generator state. When `rng` is None, a new
        `numpy.random.Generator` is created using entropy from the
        operating system. Types other than `numpy.random.Generator` are
        passed to `numpy.random.default_rng` to instantiate a `Generator`.
    """
    rng = np.random.default_rng(rng)

    ...

I would use this as a wrapper for your Cython functions.

@eliotwrobson
Copy link
Author

Short update that I've addressed most of the items from the review comments in a new release (v0.1.14) besides the rng seeding that is still being discussed and adding more builds. Not quite ready for a re-review till we address the above. Also emailed shiqi, and looks like he is traveling now but will be back to review in early November 👍🏽

@sneakers-the-rat
Copy link
Collaborator

Catching up - here are the things i see raised in the review:

making a checklist mostly for tidyiness sake, not trying to list out requirements - tell me which of these are taken care of/still outstanding

  • docs: examples
  • statement of need: we don't have a hard statement of need requirement, and i don't like novelty tests so it's not that, but it would be nice to have some statement for the pyos site that puts this package in context with other packages - "similar to x except y, etc." kind of stuff
  • packaging: wheels for multiple platforms (looks ongoing)
  • packaging: versioning generated artifacts (looks like this is done?)
  • function: use of c's rand function (looks like this is ongoing)
  • function: name mangling

Thanks for the detailed review :) I learned some things about cython here too.

If it would be helpful, you might consider opening up the remaining items as issues that link back to this one just to keep track of what's left todo/use issues as a cheap form of threading.

Re: building for multiple platforms, this sounds like something we would want to have in the packaging guide, so if you figure something out, maybe we put instructions there?

@eliotwrobson
Copy link
Author

@sneakers-the-rat thanks for writing that out. I've removed all the instances of name mangling I could find, and you are correct that removing generated artifacts is done. For the statement of need, would you like something written that can go on the pyos site, or that also to be added to the library documentation? A simple description is that this package is a faster version of ndlib.

@Kai-Striega
Copy link

Kai-Striega commented Oct 29, 2024

Re: building for multiple platforms, this sounds like something we would want to have in the packaging guide, so if you figure something out, maybe we put instructions there?

While this is noble, I think a unified approach to building will be hard, especially when extension modules are involved. It will depend on which languages are used, which platforms you build for, how complicated you want to get, and so on...

In particular, I feel that Poetry doesn't handle extension modules particularly well. There is SPIN which much of the scientific Python ecosystem seems to be standardising on, however this would require a complete rewrite of CyNetDiff's infrastructure.

A simple description is that this package is a faster version of ndlib.

If you're going to make claims that it's faster, it would be great to show it. I think this should be required if you make performance claims.

@sneakers-the-rat
Copy link
Collaborator

While this is noble, I think a unified approach to building will be hard

Fair! i was thinking of it as a starting point for a more general guide, but let's leave that aside so we don't muddy this review :) Moved that over here: pyOpenSci/python-package-guide#422

I think this should be required if you make performance claims.

Agreed that we should be able to show a benchmark to claim this. @eliotwrobson to answer your earlier question, yes this would be for the pyos website, where ideally the goal is that we build out the website so that it's possible to see a package and its 'neighbors,' "the family of tools that do things like x," and much like how open reviews on eLife allow readers to know the context of a paper and how people in that field think of it, it would be great (but not required) to provide that same context for software. "why this and not that? etc." If benchmarks are not in reach at the moment, a simple "see also: ndlib" would be fine

@eliotwrobson
Copy link
Author

eliotwrobson commented Oct 30, 2024

For the benchmark, some light benchmarks are included in this example notebook: https://github.com/eliotwrobson/CyNetDiff/blob/main/examples/visualizations.ipynb

I'm not sure if there's a more standard way of doing this? We also include some discussion of these benchmarks in our paper: https://www.vldb.org/pvldb/vol17/p4409-umrawal.pdf

Regarding building, although changing / documenting something to work with Poetry is probably not the most productive given what others have said, I think having some example related to this in the packaging guide would be extremely useful. I had a very hard time even figuring out how to get started writing a package that used Cython, and this would be extremely useful to me on another project I want to get started on before too long.

@sneakers-the-rat
Copy link
Collaborator

If you have published benchmarks already, then that's good enough for me for a claim like that - we can just cite them :)

Also am i reading this right, that an identical operation took 14s with ndlib and 0.067s with cynetdiff? that's.... certainly a perf improvement. well done.

Also i forgot to say earlier, i was asking for a context statement for the website but i also generally (i.e. outside of the purposes of the review) think they are good to have in documentation. If you think you've improved on something, I don't think it's disrespectful to compare yourself to it esp on something objective like 'the time it takes to perform an identical operation,' and it's helpful to everyone to have better linking between what's out there. One example i love is how the first two sections of the dask dataframes 'best practices' page are "use pandas" (and then clarifications about when you would want to use dask)

@Kai-Striega
Copy link

Kai-Striega commented Nov 1, 2024

@eliotwrobson

For the benchmark, some light benchmarks are included in this example notebook: https://github.com/eliotwrobson/CyNetDiff/blob/main/examples/visualizations.ipynb

I'm happy with the benchmarks as-is, for the purpose of a formal review, I think there is room for improvement.

I'm not sure if there's a more standard way of doing this? We also include some discussion of these benchmarks in our paper: https://www.vldb.org/pvldb/vol17/p4409-umrawal.pdf

There are tools out there to do this for you. The one that seems to be favoured by NumPy/SciPy (where I do most of my FOSS stuff) is air speed velocity. There are several other tools also available, but I don't know too much about them.

Regarding building, although changing / documenting something to work with Poetry is probably not the most productive given what others have said, I think having some example related to this in the packaging guide would be extremely useful. I had a very hard time even figuring out how to get started writing a package that used Cython, and this would be extremely useful to me on another project I want to get started on before too long.

I've tried building Cython projects with Poetry and actually gave up. I found it very hard/confusing, so just getting your package to build should be considered an achievement. My go to for building Cython projects is now SPIN. This uses Meson which has an inbuilt Cython Module.

@eliotwrobson
Copy link
Author

I've tried building Cython projects with Poetry and actually gave up. I found it very hard/confusing, so just getting your package to build should be considered an achievement. My go to for building Cython projects is now SPIN. This uses Meson which has an inbuilt Cython Module.

@sneakers-the-rat I think that even this simple recommendation alone would be a useful thing to add to the packaging guide 👍🏽

Regarding the remaining review items, I need to refactor another feature for this package to implement something for a paper I'm working on. I think the other reviewer is going to be able to take a look soon, so I'll wait for his comments before attacking the rest of these items.

@sneakers-the-rat
Copy link
Collaborator

Great - if it would be possible to open issues for each of the requested changes linking back to this issue (either of you), then I (or you) can make a list in the OP of completed/outstanding work to do so that the second reviewer can know whats already been raised/is being addressed and they can focus their review elsewhere, or comment on those ongoing issues!

@eliotwrobson
Copy link
Author

@sneakers-the-rat just made associated issues and linked them back here (they are the top 4 in the current issues list).

@jeremyzhangsq
Copy link

@eliotwrobson check this out:D

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README.
  • Installation instructions: for the development version of the package and any non-standard dependencies in README.
  • Vignette(s) demonstrating major functionality that runs successfully locally.
  • Function Documentation: for all user-facing functions.
  • Examples for all user-facing functions.
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements
The package meets the readme requirements below:

The README should include, from top to bottom:

  • The package name
  • Badges for:
    • Continuous integration and test coverage,
    • Docs building (if you have a documentation website),
    • A [repostatus.org](https://www.repostatus.org/) badge,
    • Python versions supported,
    • Current package version (on PyPI / Conda).
      • Yes for PyPI

NOTE: If the README has many more badges, you might want to consider using a table for badges: [see this example](https://github.com/ropensci/drake). Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

  • Short description of package goals.
  • Package installation instructions
  • Any additional setup required to use the package (authentication tokens, etc.)
  • Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the [README.md](http://readme.md/) file.
    • Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
  • Link to your documentation website.
  • If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
    • The authors have compared this package with NDlib, a wide-accepted library for network diffusion. This package is more efficient (See the PVLDB paper).
  • Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Package structure should follow general community best-practices. In general please consider whether:

  • Package documentation is clear and easy to find and use.
  • The need for the package is clear
  • All functions have documentation and associated examples for use
  • The package is easy to install

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
    • Although the authors provided the benchmark codes, I suggest the authors give some key results in the README.
  • Automated tests:
    • All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
    • Tests cover essential functions of the package and a reasonable range of inputs and conditions.
  • Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
  • Packaging guidelines: The package conforms to the pyOpenSci [packaging guidelines](https://www.pyopensci.org/python-package-guide).
    A few notable highlights to look at:

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching [JOSS's requirements](http://joss.theoj.org/about#paper_structure) with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 6 hours


Review Comments

I have to admit that this project really surprised me. Previously, the IM community mainly used C++ for implementation for the sake of efficiency. However, it may cause inconvenience for real-world development and deployment (from my own experience). Thus, I believe this project will help our community a lot. Thanks to other reviewers’ comments, I can quickly catch up with the progress. Overall, this project meets my standards except two minors to improve.

  • I suggest showing some key evaluation results in the README, such as how fast CyNetDiff can be compared to NDlib.
  • To adapt other variants of IC and LT models in the future, I suggest to provide args about initial node thresholds and edge weights in
def linear_threshold(
    G: t.Union[nx.Graph, nx.DiGraph], seeds: t.Iterable[int], steps: int = 0
) -> t.List[t.List[int]]:

An example of the LT variant that requires customized thresholds and weights is this:

  • Nicola Barbieri and Francesco Bonchi. 2014. Influence maximization with viral
    product design. In ICDM.

@eliotwrobson
Copy link
Author

eliotwrobson commented Nov 21, 2024

Thanks to both reviewers for the feedback! Have a few more urgent things to take care of but will try to get through some of these changes over the upcoming holiday 👍🏽

EDIT: A little behind schedule (whoops), but will get to everything mentioned here before too long.

@eliotwrobson
Copy link
Author

eliotwrobson commented Dec 14, 2024

Finally had a chance to go over things again, and I managed to get a PR for the most significant item so far (SPEC7 compliance): eliotwrobson/CyNetDiff#28

@Kai-Striega if you have time, could you glance at this PR briefly? No need to spend too much time looking through the internals, I just want to double check that we're doing everything as expected in the spec.

As for some of the other comments, I think that we have a sufficient amount of customization in the model creation for now (edge weights are accepted as arguments by both models, and adding custom thresholds for the linear threshold model won't give a performance boost over pure-python unless we adopt a custom model of randomly generating thresholds in each simulation). So the only remaining items are the following

  • More examples in the documentation.
  • ASV / something showing more current benchmarks.
  • Releasing wheels on multiple platforms (I probably won't do this because I can't figure out how).

@Kai-Striega
Copy link

@eliotwrobson I'll take a look, hopefully today.

Releasing wheels on multiple platforms (I probably won't do this because I can't figure out how).

I think having wheels for multiple platforms is important. If you don't supply them, the user has to figure out how to compile your project on their own. i.e. you're moving the burden of having to figure out how to compile it on their platform from yourself to each individual user.

I'm happy to help with this a bit if you need it, basically for NumPy/SciPy we build it as a CI job. You can see how NumPy does it by looking in the .github/workflows/ directory.

@eliotwrobson
Copy link
Author

I'm happy to help with this a bit if you need it, basically for NumPy/SciPy we build it as a CI job. You can see how NumPy does it by looking in the .github/workflows/ directory.

@Kai-Striega Thank you! This would be greatly appreciated 🙏🏽 🙏🏽 I'd be open to switching the build system as well if that is necessary, I'm just currently not sure how to set up automated publishing with respect to this.

@eliotwrobson
Copy link
Author

Update on the status of things here. I've made a pass and resolved all but one of the outstanding requested changes from this review (the remaining is the distribution of wheels for other platforms, which I'll be handling next). The latest released version (0.1.16) has the following changes:

  • Examples for all of the user facing functions in the API, in addition to the extended examples present already: https://eliotwrobson.github.io/CyNetDiff/api/
  • Benchmarks run through pytest-benchmark on GitHub actions for the main branch. The comparison here is done with NDlib and there is a link to the latest benchmarks run from the main branch in the README. Not the most sophisticated benchmarks in the world but appropriate given the size of the project (ASV was ultimately too much overhead when there are only really a couple of tasks we care about benchmarking here).
  • Support for randomized activation and other utility functions to support additional use cases.

The wheel distribution will take a little more time than everything else, but I'll get to that as soon as I can. However, I think everything else is in a good spot 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: under-review
Development

No branches or pull requests

7 participants