dseupd_ randomly fails on mips64el arch #294

jgmbenoit · 2021-02-03T15:18:37Z

Expected behavior

dseupd_ always succeeds on any arch

Actual behavior

dseupd_ randomly fails on mips64el arch

Where/how to reproduce the problem

arpack-ng: release 3.8.0
OS: debian Sid (20210202)
compiler: gcc
environment:
configure:
input data:

Steps to reproduce the problem

compile bugreport-arpack-dseupd.c against the arpack library as distributed in Debain Sid on a mips64el computer.
enter the command-line while (./bugreport-arpack-dseupd) ; do : ; done | nl and wait until it stops.

Error message

Traces

Callstack

Notes, remarks

The bug is reported in Debian #981646

The text was updated successfully, but these errors were encountered:

sylvestre · 2021-02-03T15:30:21Z

Patches welcome ! I am not planing to spend time on this, sorry!

jgmbenoit · 2021-02-03T16:09:45Z

Do you have any hint ?

sylvestre · 2021-02-03T16:12:18Z

Nope, sorry :(

fghoussen · 2021-02-04T18:55:24Z

Can't afford spending time on that neither ?!...
Did you try ISO_C_BINDING : cmake -DICB=ON and replace old fashioned dseupd_ (I've seen some in your bug report) by type compliant (insured by compiler) dseupd_c

jgmbenoit · 2021-02-04T19:28:49Z

Thanks for your suggestion. What it would look like for the test code ? just replace dseupd_ with dseupd_c ?

fghoussen · 2021-02-05T17:40:00Z

Thanks for your suggestion. What it would look like for the test code ? just replace dseupd_ with dseupd_c ?

Yes. With ICB enabled, make install will install arpack.h. Include arpack.h, and, replace any of the [sdcz][sn][ae]upd_ by the associated [sdcz][sn][ae]upd_c. dseupd_c is just a "C/F90 gateway" which calls the original dseupd_ behind the scene but with correct types insured by the compiler according to Fortran 2003 standard (an example described here : https://scc.ustc.edu.cn/zlsc/chinagrid/intel/compiler_f/main_for/GUID-C0FA4080-4548-4045-9E6D-EFE57A0DE1A3.htm) so the compiler version you use must support F2003 standard (or configure should fail).

jgmbenoit · 2021-02-08T13:25:43Z

Thanks for the precisions. I transformed my test sample accordingly: bugreport-arpack-dseupd_c.c .
Unfortunately the error persists.

fghoussen · 2022-07-09T10:16:21Z

@jgmbenoit: howmny seems to be character*1

arpack-ng/SRC/dseupd.f

Line 54 in 872df83

c HOWMNY Character*1 (INPUT)

Did you try that?

- char *all="All";
+ char *all="A";

as pointed out in opencollab/arpack-ng#294 (comment)

szhorvat · 2022-07-09T10:43:12Z

It shouldn't make a different if the buffer has extra characters though. dseupd can choose to read just a single one.

I do wonder about the appropriate calling convention for this function. Doesn't gfortran require passing the string lengths as additional parameters at the end?

fghoussen · 2022-07-09T11:42:37Z

Doesn't gfortran require passing the string lengths as additional parameters at the end?

Don't think so

fghoussen · 2022-07-09T15:55:18Z

Could you compile bugreport-arpack-dseupd_c.c in debug mode on mips64el and report here a backtrace? May be helpful to get some clue

jgmbenoit · 2022-07-10T13:58:19Z

I am not so sure what I may read by ``in debug''. Can you specify the sequence of command to follow ?

fghoussen · 2022-07-10T14:37:32Z

In bugreport-arpack-dseupd_c.c, why do you set double tol=+0x1p-53;? This is a weird format that I never seen before (not sure to understand it), but, it sounds like 1e-53. If so, why do you need such a strict tolerance? It is OK with just double tol=1.e-10;. Arpack is an iterative search/method so using very-very strict tolerance has no meaning (at least as I understand it: there is no way you get an exact result).

Are you sure you never hit ido == 2? (https://github.com/igraph/igraph/blob/89df4735b20737fdfcaa7457a594713ae9ac3e04/src/linalg/arpack.c#L989). In this case:

arpack-ng/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp

Line 797 in d280435

else if (ido == 2) {

not so sure what I may read by ``in debug'

compile with -g -O0 and report a backtrace if possible

szhorvat · 2022-07-10T15:03:55Z

In bugreport-arpack-dseupd_c.c, why do you set double tol=+0x1p-53;? This is a weird format that I never seen before (not sure to understand it), but, it sounds like 1e-53.

That is 2^-53, not 10^-53. Probably related to double storing 53 digits.

fghoussen · 2022-07-10T16:52:07Z

Try to use a regular tolerance (1.e-10 or 1.e-12 - below this has no meaning and may even trigger numerical instabilities).
Make sure you don't miss cases where ido == 2 in the **aupd-while loop, otherwise, iterations are not over and you may don't even get an eigen vector! :D (in case bugreport-arpack-dseupd_c.c has missed ido cases, this may explain crash in dseupd call - I mean the math behind the scene need well-enough conditioned matrices, etc).
Trap unexpected ido to make sure you don't screw the RCI pattern

arpack-ng/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp

Line 802 in d280435

else if (ido != 99) {cerr << "Error: unexpected ido " << ido << " - KO" << endl; return 1;}

As far as I remember, if you don't use implicit shifts, you never hit ido == 3 but otherwise you must handle this case too

arpack-ng/SRC/dsaupd.f

Line 155 in d280435

c IPARAM(1) = ISHIFT: method for selecting the implicit shifts.

To me, the most simple solution is to always use exact shifts

arpack-ng/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp

Line 725 in d280435

iparamAupd[0] = 1; // Use exact shifts (=> we'll never have ido == 3).
Break the **aupd-while loop only if ido == 99

arpack-ng/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp

Line 807 in d280435

} while (ido != 99);

I had a hard time with arpack too... Hope this helps

szhorvat · 2022-07-10T18:20:03Z

Including @ntamas here regarding the gfortran calling convention.

ntamas · 2022-07-11T10:34:37Z

@szhorvat Re the calling convention: I think it depends on whether you are using gfortran or not, and whether you are using sibling-call optimizations. More details in the R blog, start reading from the "Strings in Fortran and LAPACK" section.

fghoussen · 2022-07-14T10:34:26Z

To me, looks more like an arpack miuse that an arpack bug.
@jgmbenoit: if so, when OK at your side, don't forget to close this issue at our side. Thanks!

jgmbenoit · 2022-07-14T10:42:11Z

I will a have a closer look this week-end.

szhorvat · 2022-07-14T11:40:45Z

2. Make sure you don't miss cases where ido == 2 in the **aupd-while loop, otherwise, iterations are not over and you may don't even get an eigen vector! :D (in case bugreport-arpack-dseupd_c.c has missed ido cases, this may explain crash in dseupd call - I mean the math behind the scene need well-enough conditioned matrices, etc).

bmat was set to 'I' for standard problems and there is no B matrix. Doesn't that mean that there will never be an ido == 2 value? That's my understanding from reading the docs here.

fghoussen · 2022-07-14T12:00:17Z

Doesn't that mean that there will never be an ido == 2 value?

Sounds likely, but, not sure: doc is not clear to me (when bmat = 'I' you have actually B = Identity so that ido == 2 is not a no-op: you replace Y with X before next iteration which re-use the modified Y... Right?). From here, your biggest problem looks like to be the tolerance: below tol = 1.e-12, tol has no meaning (arpack is all about float and double which max tol is 1.e-12: this means you can hope at best to have 12 significant digits [in debug mode, less in release mode], so that if you ask for example tol = 1.e-15 the 13th, 14th, 15th will anyway have no meaning... But will take more time to compute and may increase risks of numerical instabilities). So if you ask tol = 2^-53 ?!... You are risking even more!

May be wrong but that's my impression (had a hard time with arpack long time ago!)

fghoussen · 2022-07-14T12:35:13Z

@szhorvat: BTW, you may want/need to make some statistics with your baseline data/use-cases.

Say you ask for tol = 1.e-12 and maxit = 100, but that, you observe on your baseline data/use-cases that you get poor accuracy (say "bad" eigen vectors/values only 1.e-6 accurate): here, you may want/need to overkill maxit (maxit = 200 or maxit = 1000). tol is not the only stopping criterion for arpack: arpack will stop if tol is achieved or maxit is achieved. From my experience, keeping tol under control (w.r.t the need) but increasing maxit is a better way to get "accurate" results (w.r.t the need).

Note: with tol = 2^-53, it's likely you always stop at it = maxit as there is no way to reach such a tol! Check out how your baseline use-cases run

szhorvat · 2022-07-14T12:40:23Z

I believe the 2^-53 tolerance did not come from igraph (doesn't seem to be present in the source code). I check for IDO values within igraph now: if it's not a value that's handled by the code, it throws an error. The test suite did not reveal such a case.

fghoussen · 2022-07-14T16:58:03Z

If you don't have ido == 2, make sure you use a regular tolerance (1.e-12), check maxit and the quality of the results (are your vectors eigen up-to the tolerance you pass to arpack?).

arpack-ng provides arpackmm (https://github.com/opencollab/arpack-ng/tree/master/EXAMPLES/MATRIX_MARKET - cmake -DICBEXMM=ON ..) and pyarpack(https://github.com/opencollab/arpack-ng/tree/master/EXAMPLES/PYARPACK - cmake -DPYTHON3=ON .. based on Boost.python3) that may help you to check/compare results (if you have iparam[6] = 2/3, shifting and/or the kind of solver you use may have big impacts on results - this is easily checked/compared with arpackmm or pyarpack).

Hope this helps

fghoussen · 2022-07-14T20:14:43Z

BTW, you need to set at least nbCV = 2 * nbEV + 1, but, if you set nbCV > 2 * nbEV + 1 you may get better/faster CV: arpackmm / pyarpack let you play with these parameters too.

ntamas · 2022-07-15T11:57:25Z

BTW, you need to set at least nbCV = 2 * nbEV + 1, but, if you set nbCV > 2 * nbEV + 1 you may get better/faster CV

I think these are already satisfied in our code in igraph; igraph_arpack_rnsolve() is "just" a generic ARPACK wrapper, but it tries to set nbCV automatically if the caller specifies 0 there (which is always the case in all places where we use the solver from igraph). This is done by this function.

Anyhow, we've made some changes in igraph and now try to handle all cases of ido (even ido == 2); not sure if this will help with the original issue, but I think we should definitely test it before pursuing this further in arpack-ng. @jgmbenoit igraph 0.10 is just around the corner, so I was wondering what your preference would be:

wait until igraph 0.10 comes out and then test whether this bug still appears on mips64el, or
patch igraph 0.9.x according to igraph/igraph@a17105b and see if it solves the problem

jgmbenoit · 2022-07-15T13:06:24Z

@ntamas
It appears that the last igraph versions built well on all architecture (so on mips64el arch) that Debian officially supports (except on the s390x architecture because of a bug of mine in plfit). So I guess you can go ahead with igraph version 0.10 . I may package it this weekend, and I may have a closer look on this particular issue alongside. This bug is quite old, so a lot of things has changed in Debian since then. In particular, the GCC has been upgraded.

jgmbenoit · 2022-07-17T21:34:55Z

I went back to a mips64el arch box to check the current status of the issue. The check-code bugreport-arpack-dseupd_c.c is still relevant as it fails on the mips64el arch box but it works on my amd64 box. Now gcc is gcc version 11.3.0. I ran several times the tests (make -C _BUILD_SHARED check): the tests 174 (igraph_pagerank) and 340 (example::eigenvector_centrality) failed each time on the mips64el arch box but succeeded on my amd64 box. This was for igraph-0.9.9+ds-1 without extra patch.

jgmbenoit · 2022-07-17T22:04:19Z

It appears that patch a17105beb3e cannot be applied to version0.9.9.

I guess that number of changes to getting too big.
So I suggest to come back as soon as version 0.10 is out (and packaged).

jgmbenoit · 2022-07-17T22:15:44Z

That is 2^-53, not 10^-53. Probably related to double storing 53 digits.

Exactly. I used this format to be sure that the code plays with exactly the same doubles. We are debugging a weird issue here.

jgmbenoit · 2022-07-17T22:23:03Z

compile with -g -O0 and report a backtrace if possible

I can compile with -g -O0. But I cannot see how I can report a backtrace as the error is of numerical nature.

ntamas · 2022-07-18T10:35:09Z

It appears that patch a17105beb3e cannot be applied to version 0.9.9.

I've cherry-picked the four relevant commits into a separate hotfix/0.9.9-arpack branch, take a diff of that against 0.9.9 and you should have a clean patch.

jgmbenoit · 2022-07-18T12:39:00Z

I could applied these patches. However, the issue persists on the mips64el arch box: tests 174 and 340 failed.

fghoussen · 2022-07-18T17:16:57Z

tests 174 and 340 failed.

What's your tol? Do you always hit maxit?

ntamas · 2022-07-21T09:48:46Z

@jgmbenoit Can you send a link to the build log?

jgmbenoit · 2022-07-21T10:33:49Z

We can find the build log of the Debian packages at buildd.debian.org.
For the patched version, I have to save them. Do you want the build logs of the patched version ?

ntamas · 2022-07-22T11:09:47Z

Yes, I'd like to figure out which ones are tests 174 and 340 in this version so I can check what the tol and maxit values are.

jgmbenoit · 2022-07-22T11:14:56Z

Okay. As mentioned above, these tests are test::igraph_pagerank (174) and example::eigenvector_centrality (340).

fghoussen · 2022-07-23T08:44:00Z

Just striking me!... Is mips64el ILP64-based as the name may suggest?!...
If so you must set INTERFACE64=1 and provide ILP64-compliant BLAS/LAPACK (like MKL)

arpack-ng/.github/workflows/jobs.yml

Line 156 in d280435

    
                     ./configure --enable-mpi --enable-icb --with-blas=mkl_gf_ilp64 --with-lapack=mkl_gf_ilp64

fghoussen · 2022-10-23T10:50:59Z

Does using a rand initialized v0 helps? JuliaLinearAlgebra/Arpack.jl#138 (comment)

jgmbenoit mentioned this issue Feb 3, 2021

Test suite error on mips64el architecture igraph/igraph#1469

Open

2 tasks

szhorvat added a commit to igraph/igraph that referenced this issue Jul 9, 2022

fix: HOWMNY argument of dseupd()/dneupd() must be single-character

f618cba

as pointed out in opencollab/arpack-ng#294 (comment)

dseupd_ randomly fails on mips64el arch #294

dseupd_ randomly fails on mips64el arch #294

Comments

jgmbenoit commented Feb 3, 2021

Expected behavior

Actual behavior

Where/how to reproduce the problem

Steps to reproduce the problem

Error message

Traces

Callstack

Notes, remarks

sylvestre commented Feb 3, 2021

jgmbenoit commented Feb 3, 2021

sylvestre commented Feb 3, 2021

fghoussen commented Feb 4, 2021

jgmbenoit commented Feb 4, 2021

fghoussen commented Feb 5, 2021

jgmbenoit commented Feb 8, 2021 • edited Loading

fghoussen commented Jul 9, 2022

szhorvat commented Jul 9, 2022

fghoussen commented Jul 9, 2022

fghoussen commented Jul 9, 2022

jgmbenoit commented Jul 10, 2022

fghoussen commented Jul 10, 2022 • edited Loading

szhorvat commented Jul 10, 2022

fghoussen commented Jul 10, 2022

szhorvat commented Jul 10, 2022

ntamas commented Jul 11, 2022

fghoussen commented Jul 14, 2022

jgmbenoit commented Jul 14, 2022

szhorvat commented Jul 14, 2022

fghoussen commented Jul 14, 2022 • edited Loading

fghoussen commented Jul 14, 2022 • edited Loading

szhorvat commented Jul 14, 2022

fghoussen commented Jul 14, 2022 • edited Loading

fghoussen commented Jul 14, 2022

ntamas commented Jul 15, 2022

jgmbenoit commented Jul 15, 2022

jgmbenoit commented Jul 17, 2022

jgmbenoit commented Jul 17, 2022 • edited Loading

jgmbenoit commented Jul 17, 2022 • edited Loading

jgmbenoit commented Jul 17, 2022 • edited Loading

ntamas commented Jul 18, 2022

jgmbenoit commented Jul 18, 2022

fghoussen commented Jul 18, 2022

ntamas commented Jul 21, 2022

jgmbenoit commented Jul 21, 2022 • edited Loading

ntamas commented Jul 22, 2022

jgmbenoit commented Jul 22, 2022

fghoussen commented Jul 23, 2022

fghoussen commented Oct 23, 2022

jgmbenoit commented Feb 8, 2021 •

edited

Loading

fghoussen commented Jul 10, 2022 •

edited

Loading

fghoussen commented Jul 14, 2022 •

edited

Loading

fghoussen commented Jul 14, 2022 •

edited

Loading

fghoussen commented Jul 14, 2022 •

edited

Loading

jgmbenoit commented Jul 17, 2022 •

edited

Loading

jgmbenoit commented Jul 17, 2022 •

edited

Loading

jgmbenoit commented Jul 17, 2022 •

edited

Loading

jgmbenoit commented Jul 21, 2022 •

edited

Loading