Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dseupd_ randomly fails on mips64el arch #294

Open
jgmbenoit opened this issue Feb 3, 2021 · 40 comments
Open

dseupd_ randomly fails on mips64el arch #294

jgmbenoit opened this issue Feb 3, 2021 · 40 comments

Comments

@jgmbenoit
Copy link

Expected behavior

dseupd_ always succeeds on any arch

Actual behavior

dseupd_ randomly fails on mips64el arch

Where/how to reproduce the problem

  • arpack-ng: release 3.8.0
  • OS: debian Sid (20210202)
  • compiler: gcc
  • environment:
  • configure:
  • input data:

Steps to reproduce the problem

  1. compile bugreport-arpack-dseupd.c against the arpack library as distributed in Debain Sid on a mips64el computer.
  2. enter the command-line while (./bugreport-arpack-dseupd) ; do : ; done | nl and wait until it stops.

Error message

Traces

Callstack

Notes, remarks

The bug is reported in Debian #981646

@sylvestre
Copy link
Contributor

Patches welcome ! I am not planing to spend time on this, sorry!

@jgmbenoit
Copy link
Author

Do you have any hint ?

@sylvestre
Copy link
Contributor

Nope, sorry :(

@fghoussen
Copy link
Collaborator

Can't afford spending time on that neither ?!...
Did you try ISO_C_BINDING : cmake -DICB=ON and replace old fashioned dseupd_ (I've seen some in your bug report) by type compliant (insured by compiler) dseupd_c

@jgmbenoit
Copy link
Author

Thanks for your suggestion. What it would look like for the test code ? just replace dseupd_ with dseupd_c ?

@fghoussen
Copy link
Collaborator

Thanks for your suggestion. What it would look like for the test code ? just replace dseupd_ with dseupd_c ?

Yes. With ICB enabled, make install will install arpack.h. Include arpack.h, and, replace any of the [sdcz][sn][ae]upd_ by the associated [sdcz][sn][ae]upd_c. dseupd_c is just a "C/F90 gateway" which calls the original dseupd_ behind the scene but with correct types insured by the compiler according to Fortran 2003 standard (an example described here : https://scc.ustc.edu.cn/zlsc/chinagrid/intel/compiler_f/main_for/GUID-C0FA4080-4548-4045-9E6D-EFE57A0DE1A3.htm) so the compiler version you use must support F2003 standard (or configure should fail).

@jgmbenoit
Copy link
Author

jgmbenoit commented Feb 8, 2021

Thanks for the precisions. I transformed my test sample accordingly: bugreport-arpack-dseupd_c.c .
Unfortunately the error persists.

@fghoussen
Copy link
Collaborator

@jgmbenoit: howmny seems to be character*1

c HOWMNY Character*1 (INPUT)

Did you try that?

- char *all="All";
+ char *all="A";

@szhorvat
Copy link
Contributor

szhorvat commented Jul 9, 2022

It shouldn't make a different if the buffer has extra characters though. dseupd can choose to read just a single one.

I do wonder about the appropriate calling convention for this function. Doesn't gfortran require passing the string lengths as additional parameters at the end?

@fghoussen
Copy link
Collaborator

Doesn't gfortran require passing the string lengths as additional parameters at the end?

Don't think so

@fghoussen
Copy link
Collaborator

Could you compile bugreport-arpack-dseupd_c.c in debug mode on mips64el and report here a backtrace? May be helpful to get some clue

@jgmbenoit
Copy link
Author

I am not so sure what I may read by ``in debug''. Can you specify the sequence of command to follow ?

@fghoussen
Copy link
Collaborator

fghoussen commented Jul 10, 2022

In bugreport-arpack-dseupd_c.c, why do you set double tol=+0x1p-53;? This is a weird format that I never seen before (not sure to understand it), but, it sounds like 1e-53. If so, why do you need such a strict tolerance? It is OK with just double tol=1.e-10;. Arpack is an iterative search/method so using very-very strict tolerance has no meaning (at least as I understand it: there is no way you get an exact result).

Are you sure you never hit ido == 2? (https://github.com/igraph/igraph/blob/89df4735b20737fdfcaa7457a594713ae9ac3e04/src/linalg/arpack.c#L989). In this case:

not so sure what I may read by ``in debug'

compile with -g -O0 and report a backtrace if possible

@szhorvat
Copy link
Contributor

In bugreport-arpack-dseupd_c.c, why do you set double tol=+0x1p-53;? This is a weird format that I never seen before (not sure to understand it), but, it sounds like 1e-53.

That is 2^-53, not 10^-53. Probably related to double storing 53 digits.

@fghoussen
Copy link
Collaborator

  1. Try to use a regular tolerance (1.e-10 or 1.e-12 - below this has no meaning and may even trigger numerical instabilities).
  2. Make sure you don't miss cases where ido == 2 in the **aupd-while loop, otherwise, iterations are not over and you may don't even get an eigen vector! :D (in case bugreport-arpack-dseupd_c.c has missed ido cases, this may explain crash in dseupd call - I mean the math behind the scene need well-enough conditioned matrices, etc).
  3. Trap unexpected ido to make sure you don't screw the RCI pattern
    else if (ido != 99) {cerr << "Error: unexpected ido " << ido << " - KO" << endl; return 1;}
    As far as I remember, if you don't use implicit shifts, you never hit ido == 3 but otherwise you must handle this case too
    c IPARAM(1) = ISHIFT: method for selecting the implicit shifts.
    To me, the most simple solution is to always use exact shifts
    iparamAupd[0] = 1; // Use exact shifts (=> we'll never have ido == 3).
  4. Break the **aupd-while loop only if ido == 99

I had a hard time with arpack too... Hope this helps

@szhorvat
Copy link
Contributor

Including @ntamas here regarding the gfortran calling convention.

@ntamas
Copy link

ntamas commented Jul 11, 2022

@szhorvat Re the calling convention: I think it depends on whether you are using gfortran or not, and whether you are using sibling-call optimizations. More details in the R blog, start reading from the "Strings in Fortran and LAPACK" section.

@fghoussen
Copy link
Collaborator

To me, looks more like an arpack miuse that an arpack bug.
@jgmbenoit: if so, when OK at your side, don't forget to close this issue at our side. Thanks!

@jgmbenoit
Copy link
Author

I will a have a closer look this week-end.

@szhorvat
Copy link
Contributor

2. Make sure you don't miss cases where ido == 2 in the **aupd-while loop, otherwise, iterations are not over and you may don't even get an eigen vector! :D (in case bugreport-arpack-dseupd_c.c has missed ido cases, this may explain crash in dseupd call - I mean the math behind the scene need well-enough conditioned matrices, etc).

bmat was set to 'I' for standard problems and there is no B matrix. Doesn't that mean that there will never be an ido == 2 value? That's my understanding from reading the docs here.

@fghoussen
Copy link
Collaborator

fghoussen commented Jul 14, 2022

Doesn't that mean that there will never be an ido == 2 value?

Sounds likely, but, not sure: doc is not clear to me (when bmat = 'I' you have actually B = Identity so that ido == 2 is not a no-op: you replace Y with X before next iteration which re-use the modified Y... Right?). From here, your biggest problem looks like to be the tolerance: below tol = 1.e-12, tol has no meaning (arpack is all about float and double which max tol is 1.e-12: this means you can hope at best to have 12 significant digits [in debug mode, less in release mode], so that if you ask for example tol = 1.e-15 the 13th, 14th, 15th will anyway have no meaning... But will take more time to compute and may increase risks of numerical instabilities). So if you ask tol = 2^-53 ?!... You are risking even more!

May be wrong but that's my impression (had a hard time with arpack long time ago!)

@fghoussen
Copy link
Collaborator

fghoussen commented Jul 14, 2022

@szhorvat: BTW, you may want/need to make some statistics with your baseline data/use-cases.

Say you ask for tol = 1.e-12 and maxit = 100, but that, you observe on your baseline data/use-cases that you get poor accuracy (say "bad" eigen vectors/values only 1.e-6 accurate): here, you may want/need to overkill maxit (maxit = 200 or maxit = 1000). tol is not the only stopping criterion for arpack: arpack will stop if tol is achieved or maxit is achieved. From my experience, keeping tol under control (w.r.t the need) but increasing maxit is a better way to get "accurate" results (w.r.t the need).

Note: with tol = 2^-53, it's likely you always stop at it = maxit as there is no way to reach such a tol! Check out how your baseline use-cases run

@szhorvat
Copy link
Contributor

I believe the 2^-53 tolerance did not come from igraph (doesn't seem to be present in the source code). I check for IDO values within igraph now: if it's not a value that's handled by the code, it throws an error. The test suite did not reveal such a case.

@fghoussen
Copy link
Collaborator

fghoussen commented Jul 14, 2022

If you don't have ido == 2, make sure you use a regular tolerance (1.e-12), check maxit and the quality of the results (are your vectors eigen up-to the tolerance you pass to arpack?).

arpack-ng provides arpackmm (https://github.com/opencollab/arpack-ng/tree/master/EXAMPLES/MATRIX_MARKET - cmake -DICBEXMM=ON ..) and pyarpack(https://github.com/opencollab/arpack-ng/tree/master/EXAMPLES/PYARPACK - cmake -DPYTHON3=ON .. based on Boost.python3) that may help you to check/compare results (if you have iparam[6] = 2/3, shifting and/or the kind of solver you use may have big impacts on results - this is easily checked/compared with arpackmm or pyarpack).

Hope this helps

@fghoussen
Copy link
Collaborator

BTW, you need to set at least nbCV = 2 * nbEV + 1, but, if you set nbCV > 2 * nbEV + 1 you may get better/faster CV: arpackmm / pyarpack let you play with these parameters too.

@ntamas
Copy link

ntamas commented Jul 15, 2022

BTW, you need to set at least nbCV = 2 * nbEV + 1, but, if you set nbCV > 2 * nbEV + 1 you may get better/faster CV

I think these are already satisfied in our code in igraph; igraph_arpack_rnsolve() is "just" a generic ARPACK wrapper, but it tries to set nbCV automatically if the caller specifies 0 there (which is always the case in all places where we use the solver from igraph). This is done by this function.

Anyhow, we've made some changes in igraph and now try to handle all cases of ido (even ido == 2); not sure if this will help with the original issue, but I think we should definitely test it before pursuing this further in arpack-ng. @jgmbenoit igraph 0.10 is just around the corner, so I was wondering what your preference would be:

  • wait until igraph 0.10 comes out and then test whether this bug still appears on mips64el, or
  • patch igraph 0.9.x according to igraph/igraph@a17105b and see if it solves the problem

@jgmbenoit
Copy link
Author

@ntamas
It appears that the last igraph versions built well on all architecture (so on mips64el arch) that Debian officially supports (except on the s390x architecture because of a bug of mine in plfit). So I guess you can go ahead with igraph version 0.10 . I may package it this weekend, and I may have a closer look on this particular issue alongside. This bug is quite old, so a lot of things has changed in Debian since then. In particular, the GCC has been upgraded.

@jgmbenoit
Copy link
Author

I went back to a mips64el arch box to check the current status of the issue. The check-code bugreport-arpack-dseupd_c.c is still relevant as it fails on the mips64el arch box but it works on my amd64 box. Now gcc is gcc version 11.3.0. I ran several times the tests (make -C _BUILD_SHARED check): the tests 174 (igraph_pagerank) and 340 (example::eigenvector_centrality) failed each time on the mips64el arch box but succeeded on my amd64 box. This was for igraph-0.9.9+ds-1 without extra patch.

@jgmbenoit
Copy link
Author

jgmbenoit commented Jul 17, 2022

It appears that patch a17105beb3e cannot be applied to version0.9.9.

I guess that number of changes to getting too big.
So I suggest to come back as soon as version 0.10 is out (and packaged).

@jgmbenoit
Copy link
Author

jgmbenoit commented Jul 17, 2022

That is 2^-53, not 10^-53. Probably related to double storing 53 digits.

Exactly. I used this format to be sure that the code plays with exactly the same doubles. We are debugging a weird issue here.

@jgmbenoit
Copy link
Author

jgmbenoit commented Jul 17, 2022

compile with -g -O0 and report a backtrace if possible

I can compile with -g -O0. But I cannot see how I can report a backtrace as the error is of numerical nature.

@ntamas
Copy link

ntamas commented Jul 18, 2022

It appears that patch a17105beb3e cannot be applied to version 0.9.9.

I've cherry-picked the four relevant commits into a separate hotfix/0.9.9-arpack branch, take a diff of that against 0.9.9 and you should have a clean patch.

@jgmbenoit
Copy link
Author

I could applied these patches. However, the issue persists on the mips64el arch box: tests 174 and 340 failed.

@fghoussen
Copy link
Collaborator

tests 174 and 340 failed.

What's your tol? Do you always hit maxit?

@ntamas
Copy link

ntamas commented Jul 21, 2022

@jgmbenoit Can you send a link to the build log?

@jgmbenoit
Copy link
Author

jgmbenoit commented Jul 21, 2022

We can find the build log of the Debian packages at buildd.debian.org.
For the patched version, I have to save them. Do you want the build logs of the patched version ?

@ntamas
Copy link

ntamas commented Jul 22, 2022

Yes, I'd like to figure out which ones are tests 174 and 340 in this version so I can check what the tol and maxit values are.

@jgmbenoit
Copy link
Author

Okay. As mentioned above, these tests are test::igraph_pagerank (174) and example::eigenvector_centrality (340).

@fghoussen
Copy link
Collaborator

Just striking me!... Is mips64el ILP64-based as the name may suggest?!...
If so you must set INTERFACE64=1 and provide ILP64-compliant BLAS/LAPACK (like MKL)

./configure --enable-mpi --enable-icb --with-blas=mkl_gf_ilp64 --with-lapack=mkl_gf_ilp64

@fghoussen
Copy link
Collaborator

Does using a rand initialized v0 helps? JuliaLinearAlgebra/Arpack.jl#138 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants