Excessive copying in dal.adhoc.iter_datalinks #537

msdemlei · 2024-04-09T09:11:57Z

I was somewhat surprised when something like the following showed surprisingly high CPU usage (and hence runtime):

import pyvo
from astropy.coordinates import SkyCoord

svc = pyvo.ssa.SSAService("http://dc.g-vo.org/feros/q/ssa/ssap.xml?")
matches = svc.search(
  SkyCoord.from_name("EI Eri"),
  radius=0.001,
  maxrec=200,
  format="votable")
for dl in matches.iter_datalinks():
    rec = next(dl.bysemantics("#preview"))

Profiling (p.sort_stats("cumtime").print_stats(10)) yielded something like this:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    698/1    0.008    0.000  160.721  160.721 {built-in method builtins.exec}
        1    0.013    0.013  160.721  160.721 datalink-previews.py:1(<module>)
      201    0.118    0.001  144.192    0.717 adhoc.py:174(iter_datalinks)
      200    0.495    0.002  142.322    0.712 adhoc.py:573(clone_byid)
30679931/259   53.656    0.000  118.169    0.456 copy.py:128(deepcopy)
1659037/237   14.289    0.000  118.158    0.499 copy.py:227(_deepcopy_dict)
1496762/562   11.641    0.000  118.157    0.210 copy.py:259(_reconstruct)
      200    0.576    0.003   23.373    0.117 adhoc.py:591(<listcomp>)
122425/121824    0.581    0.000   21.647    0.000 core.py:3205(__getitem__)
   120000    0.333    0.000   20.755    0.000 core.py:6315(__new__)

Hence, almost all the runtime is spent copying the votable object in clone_byid. That, in turn, has only become necessary because we try to retrieve multiple ids at a time as an "optimisation".

Let me bluntly confess that I've always considered the multi-id feature of datalink a particularly bad deal in terms of optimisation potential versus implementation complexity, but if we spend more time on managing multi-id (and apparently get quadratic runtime on top) than we could possibly save in terms of HTTP round trips, then we should do something.

Would anyone greatly object if I wrote an iter_datalinks with a trivial (one id at a time) implementation and we used the current multi-id implementation only on request (e.g., a use_multi_id=False keyword argument)?
I am fairly confident the straightforward implementation would be faster, not to mention a lot more robust.

Or does anyone want to fix the current implementation to avoid the excessive copying?

The text was updated successfully, but these errors were encountered:

andamian · 2024-04-10T00:05:30Z

I'll have a look. The deep copy seems wasteful but it was convenient at the time. There should be a better way. I'm pretty sure, the optimization came as a necessity because of the overhead associated to multiple datalink roundtrips.

msdemlei · 2024-04-10T07:54:08Z

On Tue, Apr 09, 2024 at 05:05:51PM -0700, Adrian wrote: sure, the optimization came as a necessity because of the overhead associated to multiple datalink roundtrips.

Out of curiosity (and with a view to future standardisation): Did you measure that? Did you actually observe improvements? I frankly have a hard time believing that with HTTP 1.1 persistent connections multi-ID makes a noticeable difference (more than a few percent) in actual workflows (where you will, in general, do requests for the linked data on top). Given the runtime behaviour of the implementation for the sort of id sets where runtime becomes an issue, I am actually pretty sure that a naive implementation would actually be faster, significantly so when you have 1000s of ids.

andamian · 2024-09-23T23:53:55Z

#601 - is it related? Haven't look at the spec yet, but can we replace the GET with a single POST as it is suggested so no chunking is required?

msdemlei · 2024-09-24T07:42:00Z

On Mon, Sep 23, 2024 at 04:54:17PM -0700, Adrian wrote: #601 - is it related? Haven't look at the spec yet, but can we replace the `GET` with a single `POST` as it is suggested so no chunking is required?

No -- MAXREC is independent of the method. But I will give you that the fact hat such massive 100s-of-ID-s requests (I'd still be curious about the user story behind them) do exist suggests that my assessment that multi-ID datalink was a bad compromise between capability and simplicty may be wrong. But they must have a terrible run time in the current implementation, and so I'd say fixing this bug gains some urgency. I *think* taking the ids out of the VOTable in one go and then batching from the resulting list ought to be a fairly straightforward and fast implementation.

msdemlei mentioned this issue Jul 2, 2024

Add datalink original row #559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive copying in dal.adhoc.iter_datalinks #537

Excessive copying in dal.adhoc.iter_datalinks #537

msdemlei commented Apr 9, 2024

andamian commented Apr 10, 2024

msdemlei commented Apr 10, 2024 via email

andamian commented Sep 23, 2024

msdemlei commented Sep 24, 2024 via email

Excessive copying in dal.adhoc.iter_datalinks #537

Excessive copying in dal.adhoc.iter_datalinks #537

Comments

msdemlei commented Apr 9, 2024

andamian commented Apr 10, 2024

msdemlei commented Apr 10, 2024 via email

andamian commented Sep 23, 2024

msdemlei commented Sep 24, 2024 via email