Parallactic angle computation via measures exhibits non-determinisitic behaviour. #221

JSKenyon · 2022-12-07T12:30:56Z

Describe the bug
In a multi-threaded regime, the parallactic angles may be slightly wrong.

To Reproduce
Run QuartiCal with input_model.apply_p_jones=1 and several datasets. The model will not be the same on every run. The bug is intermittent and inconsistent, similar to a race or out of bounds error. However, it is absent when no parallactic angles are applied. Note that the error will be very, very small but seems to occasionally have knock on effects.

Expected behavior
The parallactic angle computation should be deterministic - the same for given inputs.

Version
main

The text was updated successfully, but these errors were encountered:

JSKenyon · 2022-12-08T09:17:26Z

I have managed to reproduce this (inconsistently) outside QC (still calling uncontroversial QC internals though):

from quartical.data_handling.angles import _make_parangles
from concurrent.futures import ThreadPoolExecutor, wait, ProcessPoolExecutor
from daskms import xds_from_storage_table, xds_from_storage_ms
import numpy as np


if __name__ == "__main__":

    ms_path = "~/reductions/3C147/msdir/C147_unflagged.MS"

    data_xds_list = xds_from_storage_ms(
        ms_path,
        group_cols=("FIELD_ID", "DATA_DESC_ID", "SCAN_NUMBER")
    )

    anttab = xds_from_storage_table(ms_path + "::ANTENNA")[0]
    feedtab = xds_from_storage_table(ms_path + "::FEED")[0]
    fieldtab = xds_from_storage_table(ms_path + "::FIELD")[0]

    # We do this eagerly to make life easier.
    feeds = feedtab.POLARIZATION_TYPE.values
    unique_feeds = np.unique(feeds)

    if np.all([feed in "XxYy" for feed in unique_feeds]):
        feed_type = "linear"
    elif np.all([feed in "LlRr" for feed in unique_feeds]):
        feed_type = "circular"
    else:
        raise ValueError("Unsupported feed type/configuration.")

    phase_dirs = fieldtab.PHASE_DIR.values

    field_centres = [phase_dirs[0, 0] for xds in data_xds_list]

    # TODO: This could be more complicated for arrays with multiple feeds.
    receptor_angles = feedtab.RECEPTOR_ANGLE.values

    ant_names = anttab.NAME.values

    ant_positions_ecef = anttab.POSITION.values  # ECEF coordinates.

    epoch = "J2000"  # TODO: Should this be configurable?

    futures = []

    with ThreadPoolExecutor(max_workers=6) as tp:
        for xds, field_centre in zip(data_xds_list[:10], field_centres[:10]):
            futures.append(
                tp.submit(
                    _make_parangles,
                    xds.TIME.values,
                    ant_names,
                    ant_positions_ecef,
                    receptor_angles,
                    field_centre,
                    epoch
                )
            )

        wait(futures)

    results1 = [f.result() for f in futures]

    futures = []

    with ThreadPoolExecutor(max_workers=6) as tp:
        for xds, field_centre in zip(data_xds_list[:10], field_centres[:10]):
            futures.append(
                tp.submit(
                    _make_parangles,
                    xds.TIME.values,
                    ant_names,
                    ant_positions_ecef,
                    receptor_angles,
                    field_centre,
                    epoch
                )
            )

        wait(futures)

    results2 = [f.result() for f in futures]

    for i in range(10):
        print(np.all(results1[i] == results2[i]))
        print(np.max(results1[i] - results2[i]))

Note that the above only gives incorrect results some of the time. Will dig a little further.

JSKenyon · 2022-12-08T09:37:53Z

The above problem occurs with a ProcessPoolExecutor too but seems to go away when the number of processes exceeds the number of tasks i.e. the problem seems to stem from multiple usages of measures within a process.

sjperkins · 2022-12-08T09:40:29Z

The above problem occurs with a ProcessPoolExecutor too but seems to go away when the number of processes exceeds the number of tasks i.e. the problem seems to stem from multiple usages of measures within a process.

What happens if you recreate the measures server in each function call, instead of re-using an existing measures server?

JSKenyon · 2022-12-08T09:43:25Z

What happens if you recreate the measures server in each function call, instead of re-using an existing measures server?

The problem persists. I tried initialising it in each call and then doing a del on it to try and ensure that it was gone.

bennahugo · 2022-12-08T11:11:23Z

Measures is definitely not threadsafe, but not sure why it is giving differences between runs inside a processpool. What is the level of the differences?

…

On Thu, 08 Dec 2022, 11:43 JSKenyon, ***@***.***> wrote: What happens if you recreate the measures server in each function call, instead of re-using an existing measures server? The problem persists. I tried initialising it in each call and then doing a del on it to try and ensure that it was gone. — Reply to this email directly, view it on GitHub <#221 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4RE6UOXLXDINSTAXU7UALWMGUUTANCNFSM6AAAAAASWYCIW4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

bennahugo · 2022-12-08T11:14:36Z

I should add that typically you initialize one measures object and set it up to do multiple epochs but it should not make a difference in the end ( up to machine precision )

…

On Thu, 08 Dec 2022, 13:10 Benna Hugo, ***@***.***> wrote: Measures is definitely not threadsafe, but not sure why it is giving differences between runs inside a processpool. What is the level of the differences? On Thu, 08 Dec 2022, 11:43 JSKenyon, ***@***.***> wrote: > What happens if you recreate the measures server in each function call, > instead of re-using an existing measures server? > > The problem persists. I tried initialising it in each call and then doing > a del on it to try and ensure that it was gone. > > — > Reply to this email directly, view it on GitHub > <#221 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AB4RE6UOXLXDINSTAXU7UALWMGUUTANCNFSM6AAAAAASWYCIW4> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >

JSKenyon · 2022-12-08T11:18:12Z

Measures is definitely not threadsafe, but not sure why it is giving differences between runs inside a processpool. What is the level of the differences?

Varies, which is part of my confusion. Sometimes 3e-9, sometimes 1e-15, sometimes zero. Point is, while these are not very significant errors they are non-deterministic i.e. they can cause all subsequent code to change behaviour slightly. This makes bug hunting really difficult because results may change between runs. This only affects cases where you apply the parallactic angle i.e. not most use cases but it is still very annoying and I don't really have a solution short of wrapping the measures server in some sort of proxy and forcing all computations to happen in a single thread (this approach gets complicated quickly).

bennahugo · 2022-12-08T12:31:07Z

Yea these are well below any sort of accuracy you will get from astrometrics. It could stem from the underlying astrometric routines - which is at best subarcsecond level accuracy using 32bit floats. I think this level of variation is fine though -- it might just be that you need to up the tolerances a bit then?

…

On Thu, Dec 8, 2022 at 1:18 PM JSKenyon ***@***.***> wrote: Measures is definitely not threadsafe, but not sure why it is giving differences between runs inside a processpool. What is the level of the differences? Varies, which is part of my confusion. Sometimes 3e-9, sometimes 1e-15, sometimes zero. Point is, while these are not very significant errors they are non-deterministic i.e. they can cause all subsequent code to change behaviour slightly. This makes bug hunting really difficult because results may change between runs. This only affects cases where you apply the parallactic angle i.e. not most use cases but it is still very annoying and I don't really have a solution short of wrapping the measures server in some sort of proxy and forcing all computations to happen in a single thread (this approach gets complicated quickly). — Reply to this email directly, view it on GitHub <#221 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4RE6TL2RRH3VUPSLCRYUTWMG7YBANCNFSM6AAAAAASWYCIW4> . You are receiving this because you commented.Message ID: ***@***.***>

-- -- Benjamin Hugo PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

JSKenyon · 2022-12-08T12:35:20Z

It isn't really about numerical precision - it is about non-deterministic results. I don't care about small errors - what I care about is that those errors are not consistent for identical inputs. I should stress that this only seems to happen in the multi-threaded/multi-processing regime. The threaded case I could understand - simply attribute it to thread safely issues in the underlying routines. The fact that it happens for processes is a little more worrying.

bennahugo · 2022-12-08T12:45:53Z

It is probably stems from an accumulator precision error inside the astrometry / timing adjustments. As I said though this is negligible levels - you don't even know the astrometry to that kind of precision.

…

On Thu, Dec 8, 2022 at 2:35 PM JSKenyon ***@***.***> wrote: It isn't really about numerical precision - it is about non-deterministic results. I don't care about small errors - what I care about is that those errors are not consistent for identical inputs. I should stress that this only seems to happen in the multi-threaded/multi-processing regime. The threaded case I could understand - simply attribute it to thread safely issues in the underlying routines. The fact that it happens for processes is a little more worrying. — Reply to this email directly, view it on GitHub <#221 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4RE6QJZGOUTN64IDQF473WMHIZHANCNFSM6AAAAAASWYCIW4> . You are receiving this because you commented.Message ID: ***@***.***>

-- -- Benjamin Hugo PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

bennahugo · 2022-12-08T12:47:07Z

(accumulators are not commutitive at this level so it sort of makes sense that you see differences after multiple calls in various orderings)

…

On Thu, Dec 8, 2022 at 2:45 PM Benna Hugo ***@***.***> wrote: It is probably stems from an accumulator precision error inside the astrometry / timing adjustments. As I said though this is negligible levels - you don't even know the astrometry to that kind of precision. On Thu, Dec 8, 2022 at 2:35 PM JSKenyon ***@***.***> wrote: > It isn't really about numerical precision - it is about non-deterministic > results. I don't care about small errors - what I care about is that those > errors are not consistent for identical inputs. I should stress that this > only seems to happen in the multi-threaded/multi-processing regime. The > threaded case I could understand - simply attribute it to thread safely > issues in the underlying routines. The fact that it happens for processes > is a little more worrying. > > — > Reply to this email directly, view it on GitHub > <#221 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AB4RE6QJZGOUTN64IDQF473WMHIZHANCNFSM6AAAAAASWYCIW4> > . > You are receiving this because you commented.Message ID: > ***@***.***> > -- -- Benjamin Hugo PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

-- -- Benjamin Hugo PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

JSKenyon · 2022-12-08T12:51:59Z

(accumulators are not commutitive at this level so it sort of makes sense that you see differences after multiple calls in various orderings)

I will defer to your knowledge but that still seems off. I appreciate that adding numbers in a different order will give different results due to round-off. What I don't understand is why those numbers would be accumulated in a different order given that each call should be independent. The only way this makes sense is if the measures server somehow remembers is previous usage in each process, despite the fact that each function call should create a new measures object. I don't pretend to know how the C++ layer works here though.

bennahugo · 2022-12-08T13:09:43Z

It has some inner shared state at class level (which I confess not to have dived deep into, but it is there -- hence threadsafety issues as noted in Daniel Muscat's thesis). I am guessing this shared state is what is causing a slight error in the accumulation somewhere

…

On Thu, Dec 8, 2022 at 2:52 PM JSKenyon ***@***.***> wrote: (accumulators are not commutitive at this level so it sort of makes sense that you see differences after multiple calls in various orderings) I will defer to your knowledge but that still seems off. I appreciate that adding numbers in a different order will give different results due to round-off. What I don't understand is why those numbers would be accumulated in a different order given that each call should be independent. The only way this makes sense is if the measures server somehow remembers is previous usage in each process, despite the fact that each function call should create a new measures object. I don't pretend to know how the C++ layer works here though. — Reply to this email directly, view it on GitHub <#221 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4RE6TKISP4T7H52JNHGZ3WMHKXVANCNFSM6AAAAAASWYCIW4> . You are receiving this because you commented.Message ID: ***@***.***>

-- -- Benjamin Hugo PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

JSKenyon · 2022-12-08T13:13:37Z

I am guessing this shared state is what is causing a slight error in the accumulation somewhere

Agreed - but I am not sure what the path forward is. I have actually just implemented a somewhat nasty workaround that briefly launches a process pool whenever parallactic angles need to be computed. This guarantees a unique PID which seems to solve the problem. It is nasty though, as it falls into the threads creating processes regime which may be brittle at scale.

bennahugo · 2022-12-08T13:20:38Z

Yea I would just decrease the testing tolerances - launching processes to deal with a numerical issue that is below the achievable coordinate accuracy is a bit... too conservative ... on the numbers I think :)

Edit: I think a good tolerance to use is order 50 mas

JSKenyon · 2022-12-08T13:26:24Z

Again, this is not about numerical tolerances. It is about the fact that no two runs will have same results. That makes debugging almost impossible. It can even cause catastrophic differences. Imagine having a point which is flagged in one run but not in another due to this discrepancy (this is not hypothetical, I have observed this behaviour). This can have major knock on effects which can hinder end-users and me as a developer. Don't get me wrong, I know that the error is tiny but it is inconsistent and that is the problem.

bennahugo · 2022-12-08T13:42:41Z

You can probably use astropy or pyephem library for this tbh. The main thing to get correct I think is the epoch conversions.

The only thing I really use casacore for is uvw calculations - nothing else comes close

JSKenyon · 2023-01-03T08:50:23Z

Just checking in here although I am feeling pretty defeated by the problem. I have done a fair amount of work trying out both astropy and skyfield. They both produce results which differ from casacore.meaures. The first point of discrepancy is the antenna positions. astropy and skyfield give similar geodetic values but their latitude values always differ from casacore.measures by ~1e-3. This leads to discrepancies in the parallactic angle, but it is difficult to say if this is the only source of the discrepancies as I am not sure what corrections are applied under the hood in the various packages.

JSKenyon pinned this issue Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallactic angle computation via measures exhibits non-determinisitic behaviour. #221

Parallactic angle computation via measures exhibits non-determinisitic behaviour. #221

JSKenyon commented Dec 7, 2022

JSKenyon commented Dec 8, 2022

JSKenyon commented Dec 8, 2022

sjperkins commented Dec 8, 2022

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022 via email

bennahugo commented Dec 8, 2022 via email

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022 via email

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022 via email

bennahugo commented Dec 8, 2022 via email

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022 via email

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022 •

edited

Loading

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022

JSKenyon commented Jan 3, 2023

Parallactic angle computation via measures exhibits non-determinisitic behaviour. #221

Parallactic angle computation via measures exhibits non-determinisitic behaviour. #221

Comments

JSKenyon commented Dec 7, 2022

JSKenyon commented Dec 8, 2022

JSKenyon commented Dec 8, 2022

sjperkins commented Dec 8, 2022

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022 via email

bennahugo commented Dec 8, 2022 via email

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022 via email

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022 via email

bennahugo commented Dec 8, 2022 via email

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022 via email

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022 • edited Loading

JSKenyon commented Dec 8, 2022

bennahugo commented Dec 8, 2022

JSKenyon commented Jan 3, 2023

bennahugo commented Dec 8, 2022 •

edited

Loading