Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataIntegralProblem #491

Merged
merged 15 commits into from
Sep 17, 2023
Merged

Conversation

IlianPihlajamaa
Copy link
Contributor

See this issue.

@codecov
Copy link

codecov bot commented Sep 11, 2023

Codecov Report

Merging #491 (1e960d1) into master (03cdaed) will decrease coverage by 23.98%.
Report is 7 commits behind head on master.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           master     #491       +/-   ##
===========================================
- Coverage   57.27%   33.30%   -23.98%     
===========================================
  Files          50       50               
  Lines        3703     3681       -22     
===========================================
- Hits         2121     1226      -895     
- Misses       1582     2455      +873     
Files Changed Coverage Δ
src/SciMLBase.jl 68.42% <ø> (-3.01%) ⬇️
src/problems/basic_problems.jl 87.71% <100.00%> (-0.75%) ⬇️

... and 32 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@IlianPihlajamaa IlianPihlajamaa changed the title add IntegralDataProblem add x and dim field to IntegralProblem Sep 13, 2023
@IlianPihlajamaa
Copy link
Contributor Author

Also added 2 constructors:

function IntegralProblem(f, x::AbstractVector{<:Number}, args...; kwargs...)
    IntegralProblem{isinplace(f, 3)}(f, x, args...; kwargs...)
end
function IntegralProblem(y::AbstractArray, x::AbstractVector{<:Number}, args...; dim::Int=1, kwargs...)
    IntegralProblem{false}(y, x, args...; dim=Val(dim), kwargs...)
end

The first for integrating a callable f over some set of sample points x, and the second for integrating an Array y over some dimension dim. I chose to put x second in both of these (contrary to what is usual in many packages/languages), because this way it is consistent with the IntegralProblem(f, lb, ub) ordering. Let me know if you agree.

@IlianPihlajamaa
Copy link
Contributor Author

Oops, there was a small mistake, test SciMLBase now passes locally, sorry!

@IlianPihlajamaa
Copy link
Contributor Author

There are still method ambiguities, but I don't know how to get rid of them...

@lxvm
Copy link
Contributor

lxvm commented Sep 16, 2023

If I may comment on this pr, I think adding a separate DataIntegralProblem would be preferable to modifying the existing IntegralProblem. I see two reasons for this:

  1. These are distinct mathematical problems since an IntegralProblem is continuous, i.e. we are given a function f and a domain [lb, ub] and we evaluate f wherever we please for any quadrature, and a DataIntegralProblem is discrete, i.e. the values of x and f.(x) are the inputs. The difference is that in the first case, choosing the integration points and weights together carefully can produce high-order accurate quadratures that are much more accurate than those available for unstructured grids (e.g. trapezoidal rule). This is due to mathematical connections between integration and interpolation.
  2. The algorithms currently used to solve IntegralProblems only accept functions with domains, not data, so none of the existing implementation of IntegralProblems could be reused for DataIntegralProblems. The current design of this pr, which extends IntegralProblems with more options could lead to more confusion, errors, and complicated autodiff implementation.

I hope this is useful feedback and I am happy to see a pr like this because I think this feature would be very useful.

@IlianPihlajamaa
Copy link
Contributor Author

Feedback is always welcome of course.
Personally, i am impartial to whether or not it should be separated in the API. Perhaps others could weigh in? (@ChrisRackauckas ) From an implementation point of view, separate structs are definitely easier.

@lxvm lxvm mentioned this pull request Sep 16, 2023
4 tasks
@ChrisRackauckas
Copy link
Member

Separate it for the reasons @lxvm mentioned. When I last reviewed this PR it was separated to have a separate IntegralDataProblem for those reasons. What happened and why was it changed?

Indeed overlapping them doesn't make sense because their dispatches need to be separate, along with most of their data, so it's not quite clear what would be gained by keeping them together. The AD passes indeed need to be very different too, so I do not understand why it would be part of IntegralProblem. But again, I thought they were already separate in 07e5698 ?

Copy link
Member

@ChrisRackauckas ChrisRackauckas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should split to a separate IntegralDataProblem like originally discussed.

@IlianPihlajamaa
Copy link
Contributor Author

The reason I merged them into the same struct was because, for the case

solve(IntegralProblem(f::Function, x::AbstractVector), Trapezoidal())

There needed to be an x-field in IntegralProblem anyway. And so I thought that it therefore wasnt necessary to split the two. I hadnt considered the AD angle.

In hindsight

  1. In the case of integrating a function, the specification of the precise grid (as opposed to a given number of points is probably never useful. So it isn't necessary to put this extra field in.
  2. For integrating data it is probably better abyway to have a separate api, as argued above.

So I will revert the changes (apart from the added tests) when I have time, reintroducing the DataIntegralProblem.

@ChrisRackauckas
Copy link
Member

Thanks for the explanation! I'll wait for the revert to re-review.

@IlianPihlajamaa IlianPihlajamaa changed the title add x and dim field to IntegralProblem Add DataIntegralProblem Sep 16, 2023
@IlianPihlajamaa
Copy link
Contributor Author

Do you prefer IntegralDataProblem or DataIntegralProblem? The last one sounds slightly better to me, but I'm not a native English speaker.

Anyway, it should be ready for review now.

@lxvm
Copy link
Contributor

lxvm commented Sep 16, 2023

Perhaps SampledIntegralProblem? Both https://docs.scipy.org/doc/scipy/tutorial/integrate.html#integrating-using-samples and https://github.com/dextorious/NumericalIntegration.jl use the word 'sampling' so it might be more familiar to newcomers, and 'sampled' implies that the integrand has already been evaluated

@IlianPihlajamaa
Copy link
Contributor Author

SampledIntegralProblem

I like it.

@ChrisRackauckas
Copy link
Member

👍 on SampledIntegralProblem

It is assumed that the values of `y` along dimension `dim`
correspond to the integrand evaluated at sampling points `x`
- x: Sampling points, must be a subtype of `AbstractVector`.
- dim: Dimension along which to integrate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to leave it up to the user? I think that just adds more work in the implementation without a tangible benefit. dimension should always be the last one as other wise you'll have a performance hit from column major, so we should just stick to that.

Copy link
Contributor Author

@IlianPihlajamaa IlianPihlajamaa Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, with both (1) that the last dim is always faster, and (2) even if it were that we should force all users to use that.

  1. Here is a simple example:
function trapz(y, dim)
          solve(SampledIntegralProblem(y, axes(y,dim); dim=dim), 
           TrapezoidalRule())
end
y = rand(5, 100000)
julia> @btime trapz(trapz($y,2), 1)
 1.165 ms (8 allocations: 400 bytes)
u: 200010.59503616963

julia> @btime trapz(trapz($y,1), 1)
 395.000 μs (9 allocations: 781.59 KiB)
u: 200010.59503616992

while there are better ways to do a two-dimensional trapezoidal rule of course.

  1. Doing numeric integration over some array isn't always the performance bottleneck of the algorithm. Typically, generating the data in the first place (by some experiment perhaps) takes more time. It would be annoying to force users to stick to a certain data layout to be able to do integration over it if it is just a small part of their workflow. This is also why other packages implement this feature, see SciPy or Trapz.jl for example.

Having said that, you are of course right that the last dimension is typically faster and should be the default, so I'll change that.

@ChrisRackauckas
Copy link
Member

I'm a bit weary on the dim part, but let's see how this goes.

@ChrisRackauckas ChrisRackauckas merged commit 690f6f0 into SciML:master Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants