Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Input X contains NaN #17

Closed
3 tasks done
hdante opened this issue Jun 10, 2024 · 5 comments
Closed
3 tasks done

ValueError: Input X contains NaN #17

hdante opened this issue Jun 10, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@hdante
Copy link

hdante commented Jun 10, 2024

Hello, when estimating with GPZ, if some magnitude error columns contain NaN, estimation stops with a ValueError exception.

Note: tried with nondetect_val=np.nan and without setting nondetect_val.

(base) [henrique.almeida@loginapl01 henrique.almeida]$ rail-estimate -a gpz t0-input/hdf5/ref/objectTable_tract_4852_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_27_20220306T050001Z-part1.hdf5 debug1.hdf5
Estimator algorithm: gpz
Configuration file: /lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/share/rail_scripts/estimator_gpz.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Starting setup.
Loading all program modules...
Configuring estimator...
Loading input file...
Setup done.
Starting estimate.
Inserting handle into data store.  model: /lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/share/rail_scripts/estimator_gpz.pkl, estimate
Process 0 running estimator on chunk 0 - 10000
Process 0 estimating GPz PZ PDF for rows 0 - 10,000
/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py:29: RuntimeWarning: invalid value encountered in log
  data[:, numbands + i] = np.log(data_dict[eband])
Inserting handle into data store.  output_estimate: inprogress_debug1.hdf5, estimate
Process 0 running estimator on chunk 10000 - 20000
Process 0 estimating GPz PZ PDF for rows 10,000 - 20,000
/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py:29: RuntimeWarning: invalid value encountered in log
  data[:, numbands + i] = np.log(data_dict[eband])
Process 0 running estimator on chunk 20000 - 30000
Process 0 estimating GPz PZ PDF for rows 20,000 - 30,000
/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py:29: RuntimeWarning: invalid value encountered in log
  data[:, numbands + i] = np.log(data_dict[eband])
Process 0 running estimator on chunk 30000 - 40000
Process 0 estimating GPz PZ PDF for rows 30,000 - 40,000
/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py:29: RuntimeWarning: invalid value encountered in log
  data[:, numbands + i] = np.log(data_dict[eband])
Process 0 running estimator on chunk 40000 - 50000
Process 0 estimating GPz PZ PDF for rows 40,000 - 50,000
/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py:29: RuntimeWarning: invalid value encountered in log
  data[:, numbands + i] = np.log(data_dict[eband])
Process 0 running estimator on chunk 50000 - 60000
Process 0 estimating GPz PZ PDF for rows 50,000 - 60,000
/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py:29: RuntimeWarning: invalid value encountered in log
  data[:, numbands + i] = np.log(data_dict[eband])
Traceback (most recent call last):
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/bin/rail-estimate", line 269, in <module>
    main()
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/bin/rail-estimate", line 265, in main
    estimate(cfg, ctx)
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/bin/rail-estimate", line 257, in estimate
    ctx.estimator.estimate(ctx.input)
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/estimator.py", line 97, in estimate
    self.run()
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/estimator.py", line 110, in run
    self._process_chunk(s, e, test_data, first)
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py", line 145, in _process_chunk
    mu, totalV, modelV, noiseV, _ = self.model.predict(test_array)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/_gpz_util.py", line 234, in predict
    Xt = self.pca.transform(X)
         ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 295, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/decomposition/_base.py", line 145, in transform
    X = self._validate_data(
        ^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/base.py", line 633, in _validate_data
    out = check_array(X, input_name="X", **check_params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1049, in check_array
    _assert_all_finite(
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 126, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 175, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
PCA does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
(base) [henrique.almeida@loginapl01 henrique.almeida]$ 

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
@hdante hdante added the bug Something isn't working label Jun 10, 2024
@sschmidt23
Copy link
Collaborator

Are any of the magnitude errors negative or zero? I see the actual error message above is:

RuntimeWarning: invalid value encountered in log
  data[:, numbands + i] = np.log(data_dict[eband])

so when it's taking the log of the magnitude errors. It doesn't look like there's a check for unphysical magnitude errors in the code, so having bad values could cause the error. We can add a check and maybe put in a minimum magnitude error value to eliminate the errors.

In general, though, it's good to make sure the input data is cleaned from such unphysical features before running any photo-z code.

@hdante
Copy link
Author

hdante commented Jun 21, 2024

Sam, yes, there are negative magnitude errors. I've sliced the original test file in two, one that contains rows with negative magnitude errors and another that contains rows with NaN magnitude errors. The output of estimating each file follows. Do you plan to add these checks in the GPZ code ? Thanks,

Negative magnitude errors:

(base) [henrique.almeida@loginapl01 henrique.almeida]$ rail-estimate -a gpz nonpositive.hdf5 debug1.hdf5
Estimator algorithm: gpz
Configuration file: /lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/share/rail_scripts/estimator_gpz.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Starting setup.
Loading all program modules...
Configuring estimator...
Loading input file...
Setup done.
Starting estimate.
Inserting handle into data store.  model: /lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/share/rail_scripts/estimator_gpz.pkl, estimate
Process 0 running estimator on chunk 0 - 10000
Process 0 estimating GPz PZ PDF for rows 0 - 10,000
/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py:29: RuntimeWarning: invalid value encountered in log
  data[:, numbands + i] = np.log(data_dict[eband])
Inserting handle into data store.  output_estimate: inprogress_debug1.hdf5, estimate
Process 0 running estimator on chunk 10000 - 20000
Process 0 estimating GPz PZ PDF for rows 10,000 - 20,000
/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py:29: RuntimeWarning: invalid value encountered in log
  data[:, numbands + i] = np.log(data_dict[eband])
Process 0 running estimator on chunk 20000 - 27656
Process 0 estimating GPz PZ PDF for rows 20,000 - 27,656
/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py:29: RuntimeWarning: invalid value encountered in log
  data[:, numbands + i] = np.log(data_dict[eband])
Estimate done.

NaN magnitude errors:

(base) [henrique.almeida@loginapl01 henrique.almeida]$ rail-estimate -a gpz nan.hdf5 debug1.hdf5
Estimator algorithm: gpz
Configuration file: /lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/share/rail_scripts/estimator_gpz.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Starting setup.
Loading all program modules...
Configuring estimator...
Loading input file...
Setup done.
Starting estimate.
Inserting handle into data store.  model: /lustre/t1/cl/lsst/tmp/henrique.almeida/slurm-home/share/rail_scripts/estimator_gpz.pkl, estimate
Process 0 running estimator on chunk 0 - 1
Process 0 estimating GPz PZ PDF for rows 0 - 1
Traceback (most recent call last):
  File "/lustre/t0/scratch/users/app.photoz/slurm-home/bin/rail-estimate", line 269, in <module>
    main()
  File "/lustre/t0/scratch/users/app.photoz/slurm-home/bin/rail-estimate", line 265, in main
    estimate(cfg, ctx)
  File "/lustre/t0/scratch/users/app.photoz/slurm-home/bin/rail-estimate", line 257, in estimate
    ctx.estimator.estimate(ctx.input)
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/estimator.py", line 97, in estimate
    self.run()
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/estimator.py", line 110, in run
    self._process_chunk(s, e, test_data, first)
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/gpz.py", line 145, in _process_chunk
    mu, totalV, modelV, noiseV, _ = self.model.predict(test_array)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/rail/estimation/algos/_gpz_util.py", line 234, in predict
    Xt = self.pca.transform(X)
         ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 295, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/decomposition/_base.py", line 145, in transform
    X = self._validate_data(
        ^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/base.py", line 633, in _validate_data
    out = check_array(X, input_name="X", **check_params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1049, in check_array
    _assert_all_finite(
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 126, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "/lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 175, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
PCA does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

@sschmidt23
Copy link
Collaborator

I can put in a check and replacement value for the negative and np.nan magnitude uncertainties, and a new config param that sets a value to replace those with (I'll ad hoc set a default value of maybe 0.1 for a large but not outrageously large magnitude error). This may or may not eventually get replaced if we decide to do a more standardized input sanitization in RAIL, as I am guessing that issues like this may pop up more often when we start playing with data that has more varied values as inputs. I'll see if I can get a PR in fairly quickly with the fix.

@sschmidt23
Copy link
Collaborator

@hdante I think the pull request that I just merged should fix this issue, note that there is a new configuration parameter now, replace_error_vals that is a list of the same length as the number of bands and err_bands, this is the value for which magnitude errors that are less than 0 or NaN will be replaced with (they default to 0.1 for all six of the LSST bands). I just created a new tagged version of the code v1.0.1, if you pip install that then hopefully this error will be resolved.

@hdante
Copy link
Author

hdante commented Jun 26, 2024

Nice, thank you Sam, I'll test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants