Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive copying in dal.adhoc.iter_datalinks #537

Open
msdemlei opened this issue Apr 9, 2024 · 4 comments
Open

Excessive copying in dal.adhoc.iter_datalinks #537

msdemlei opened this issue Apr 9, 2024 · 4 comments

Comments

@msdemlei
Copy link
Contributor

msdemlei commented Apr 9, 2024

I was somewhat surprised when something like the following showed surprisingly high CPU usage (and hence runtime):

import pyvo
from astropy.coordinates import SkyCoord

svc = pyvo.ssa.SSAService("http://dc.g-vo.org/feros/q/ssa/ssap.xml?")
matches = svc.search(
  SkyCoord.from_name("EI Eri"),
  radius=0.001,
  maxrec=200,
  format="votable")
for dl in matches.iter_datalinks():
    rec = next(dl.bysemantics("#preview"))

Profiling (p.sort_stats("cumtime").print_stats(10)) yielded something like this:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    698/1    0.008    0.000  160.721  160.721 {built-in method builtins.exec}
        1    0.013    0.013  160.721  160.721 datalink-previews.py:1(<module>)
      201    0.118    0.001  144.192    0.717 adhoc.py:174(iter_datalinks)
      200    0.495    0.002  142.322    0.712 adhoc.py:573(clone_byid)
30679931/259   53.656    0.000  118.169    0.456 copy.py:128(deepcopy)
1659037/237   14.289    0.000  118.158    0.499 copy.py:227(_deepcopy_dict)
1496762/562   11.641    0.000  118.157    0.210 copy.py:259(_reconstruct)
      200    0.576    0.003   23.373    0.117 adhoc.py:591(<listcomp>)
122425/121824    0.581    0.000   21.647    0.000 core.py:3205(__getitem__)
   120000    0.333    0.000   20.755    0.000 core.py:6315(__new__)

Hence, almost all the runtime is spent copying the votable object in clone_byid. That, in turn, has only become necessary because we try to retrieve multiple ids at a time as an "optimisation".

Let me bluntly confess that I've always considered the multi-id feature of datalink a particularly bad deal in terms of optimisation potential versus implementation complexity, but if we spend more time on managing multi-id (and apparently get quadratic runtime on top) than we could possibly save in terms of HTTP round trips, then we should do something.

Would anyone greatly object if I wrote an iter_datalinks with a trivial (one id at a time) implementation and we used the current multi-id implementation only on request (e.g., a use_multi_id=False keyword argument)?
I am fairly confident the straightforward implementation would be faster, not to mention a lot more robust.

Or does anyone want to fix the current implementation to avoid the excessive copying?

@andamian
Copy link
Contributor

I'll have a look. The deep copy seems wasteful but it was convenient at the time. There should be a better way. I'm pretty sure, the optimization came as a necessity because of the overhead associated to multiple datalink roundtrips.

@msdemlei
Copy link
Contributor Author

msdemlei commented Apr 10, 2024 via email

@andamian
Copy link
Contributor

#601 - is it related? Haven't look at the spec yet, but can we replace the GET with a single POST as it is suggested so no chunking is required?

@msdemlei
Copy link
Contributor Author

msdemlei commented Sep 24, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants