-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive copying in dal.adhoc.iter_datalinks #537
Comments
I'll have a look. The deep copy seems wasteful but it was convenient at the time. There should be a better way. I'm pretty sure, the optimization came as a necessity because of the overhead associated to multiple datalink roundtrips. |
On Tue, Apr 09, 2024 at 05:05:51PM -0700, Adrian wrote:
sure, the optimization came as a necessity because of the overhead
associated to multiple datalink roundtrips.
Out of curiosity (and with a view to future standardisation): Did you
measure that? Did you actually observe improvements?
I frankly have a hard time believing that with HTTP 1.1 persistent
connections multi-ID makes a noticeable difference (more than a few
percent) in actual workflows (where you will, in general, do requests
for the linked data on top). Given the runtime behaviour of the
implementation for the sort of id sets where runtime becomes an
issue, I am actually pretty sure that a naive implementation would
actually be faster, significantly so when you have 1000s of ids.
|
#601 - is it related? Haven't look at the spec yet, but can we replace the |
On Mon, Sep 23, 2024 at 04:54:17PM -0700, Adrian wrote:
#601 - is it related? Haven't look at the spec yet, but can we
replace the `GET` with a single `POST` as it is suggested so no
chunking is required?
No -- MAXREC is independent of the method. But I will give you that
the fact hat such massive 100s-of-ID-s requests (I'd still be curious
about the user story behind them) do exist suggests that my
assessment that multi-ID datalink was a bad compromise between
capability and simplicty may be wrong. But they must have a terrible
run time in the current implementation, and so I'd say fixing this
bug gains some urgency.
I *think* taking the ids out of the VOTable in one go and then
batching from the resulting list ought to be a fairly straightforward
and fast implementation.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I was somewhat surprised when something like the following showed surprisingly high CPU usage (and hence runtime):
Profiling (
p.sort_stats("cumtime").print_stats(10)
) yielded something like this:Hence, almost all the runtime is spent copying the votable object in
clone_byid
. That, in turn, has only become necessary because we try to retrieve multiple ids at a time as an "optimisation".Let me bluntly confess that I've always considered the multi-id feature of datalink a particularly bad deal in terms of optimisation potential versus implementation complexity, but if we spend more time on managing multi-id (and apparently get quadratic runtime on top) than we could possibly save in terms of HTTP round trips, then we should do something.
Would anyone greatly object if I wrote an
iter_datalinks
with a trivial (one id at a time) implementation and we used the current multi-id implementation only on request (e.g., a use_multi_id=False keyword argument)?I am fairly confident the straightforward implementation would be faster, not to mention a lot more robust.
Or does anyone want to fix the current implementation to avoid the excessive copying?
The text was updated successfully, but these errors were encountered: