Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding statistics to Raster #638

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

vschaffn
Copy link

@vschaffn vschaffn commented Dec 3, 2024

Resolves #602 GlacioHack/xdem#660

Description:

This pull request introduces enhancements to the statistics calculation methods in the project. The goal is to provide users with more flexible and comprehensive statistics functionality for raster data. The get_stats method has been created to:

  • Return a list of statistics in a dict containing mean, median, max, mean, sum, sum of squares, 90th percentile, nmad, rmse, std.
  • Support single statistic: User can request one statistic.
  • Support multiple statistics: Users can request multiple statistics in one call.
  • Support custom callables: Users can pass any callable function to calculate custom statistics.

A list of aliases has been drawn up in order to be as flexible as possible when users request statistics.

Changes:

  • NMAD in geoutils.stats
    note : nmad can't be called directly from a raster because of circular imports, you have to go through an array like.
  • _statistics method: Calculates common statistics (mean, median, max, min, sum, RMSE, etc.) for a specified band of raster data.
  • get_stats method: Retrieves requested statistics by their name (either from a predefined list or user-defined), with support for passing a callable function.
  • _get_single_stat method: Handles fetching a single statistic by matching it with its alias or function name. Prints a warning and return NaN if the statistic required is not in the list
  • Support for callable statistics: Users can now provide any custom function that reduces the raster data (such as custom percentiles).

Usage Examples:

  1. Get all statistics:

    stats = raster.get_stats()
  2. Get a single statistic (e.g., 'mean'):

    mean_value = raster.get_stats("mean")
  3. Get multiple statistics:

    selected_stats = raster.get_stats(["mean", "max", "rmse"])
  4. Using a custom callable statistic:

    def custom_stat(data):
        return np.nansum(data > 100)  # Count the number of pixels above 100
    custom_value = raster.get_stats(custom_stat)

Tests

The following tests ensure that the new functionality in the get_stats method works as expected:

  1. Full Statistics:

    • The test retrieves all available statistics (mean, median, max, min, sum, etc.) for a raster and ensures each statistic is present and not None.
  2. Single Statistic:

    • A test is included to fetch a single statistic, "Average" (which is an alias for "Mean"), and verifies that the returned value is a float.
  3. Selected Statistics:

    • The test checks the retrieval of a selected set of statistics: mean, maximum, and std. Additionally, it includes the use of a custom statistic function (percentile_95), which computes the 95th percentile of the data.
    • This validates that both built-in and user-defined statistics can be fetched successfully.
  4. Unrecognized Statistics:

    • Ensure that get_stats return NaN and a warning is raised if the statistic requested by the users is not in the list of aliases
  5. NMAD tests in test_stats:

    • Ensure the good behavior of the scale factor
    • Check if the returned nmad is the same than scipy.stats.median_abs_deviation

@vschaffn vschaffn force-pushed the 660-add_missing_stats branch from 31c857d to 2311ee6 Compare December 4, 2024 10:35
geoutils/raster/raster.py Outdated Show resolved Hide resolved
geoutils/raster/raster.py Outdated Show resolved Hide resolved
actual_name = stats_aliases[normalized_name]
return stats_dict[actual_name]
else:
logging.warning("Statistic name '%s' is not recognized", stat_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or raising an error here? Or is the reasoning for warning instead to accept potential typos in the statistic list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not raised an error here in order to obtain results if stats_name is a list and only one stat is not recognized. But it could be passed as a raise if needed

geoutils/raster/raster.py Outdated Show resolved Hide resolved
@rhugonnet
Copy link
Member

Amazing! 😄
I only have small comments on structure and documenting the alias names above.
Also had to approve the test run (as this is your first PR on this repo, I think), all passing!

Copy link

@adebardo adebardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you add a note to the documentation?
  • Can you add a test with an inconsistent stats name and check the log returned?
  • Did you manage to get it running on xdem?

"sum": "Sum",
"sumofsquares": "Sum of squares",
"sum2": "Sum of squares",
"percentile": "90th percentile",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind to consider the user, but it does make the documentation a bit more complex. We'll see if we leave that much flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[POC] add metrics from demcompare
3 participants