Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions for reading metrics remotely from Azure #16

Merged
merged 6 commits into from
Apr 26, 2024
Merged

Conversation

stuartlynn
Copy link
Contributor

@stuartlynn stuartlynn commented Apr 26, 2024

This PR adds a function that takes a list of MetricRequests and fetches the data from cloud storage over http range requests in an efficient manner.

let metrics  = [
    MetricRequest{
        file:"https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet".into(), 
        column:"B17021_E006".into() 
   }];
   let df = get_metrics(&metrics, None);

This generates the following results

test metadata::tests::test_loading_metadata_from_url ... ok
Ok(
    shape: (74_001, 2)
    ┌─────────────┬──────────────────────┐
    │ B17021_E006 ┆ GEO_ID               │
    │ ---         ┆ ---                  │
    │ i64         ┆ str                  │
    ╞═════════════╪══════════════════════╡
    │ 0           ┆ 1400000US01001020100 │
    │ 0           ┆ 1400000US01001020200 │
    │ 11          ┆ 1400000US01001020300 │
    │ 10          ┆ 1400000US01001020400 │
    │ 0           ┆ 1400000US01001020500 │
    │ …           ┆ …                    │
    │ 12          ┆ 1400000US72153750501 │
    │ 0           ┆ 1400000US72153750502 │
    │ 19          ┆ 1400000US72153750503 │
    │ 0           ┆ 1400000US72153750601 │
    │ 0           ┆ 1400000US72153750602 │
    └─────────────┴──────────────────────┘,
)

There is also a way to filter by GEOIDs as we do so

        let metrics  = [
            MetricRequest{
                file:"https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet".into(), 
                column:"B17021_E006".into() 
            }];
        let df = get_metrics(
            &metrics,
            Some(&["1400000US01001020100".into(), "1400000US01001020300".into()]),
        );

which gives the result

Ok(
    shape: (2, 2)
    ┌─────────────┬──────────────────────┐
    │ B17021_E006 ┆ GEO_ID               │
    │ ---         ┆ ---                  │
    │ i64         ┆ str                  │
    ╞═════════════╪══════════════════════╡
    │ 0           ┆ 1400000US01001020100 │
    │ 11          ┆ 1400000US01001020300 │
    └─────────────┴──────────────────────┘,
)

TODO

  • There is one place where we use an unwrap in the code. This could potentially lead to a panic but I am like 99% sure that panic is impossible from the previous code. It's probably worth trying to remove the unwrap and handle the result explicitly even if it is impossible.
  • The looks like the geo filtering version of the code is slower than the non geo filtering version. This is a bit counter intuitive so I want to properly benchmark it to see if that's true and try and figure out why. Opened ticket to follow up on this Explore issues with performance when using geo filtering with metrics #17
  • Some code ergonomics could probably be improved here. Lots of .into's for Strings which I can probably make go away

@stuartlynn
Copy link
Contributor Author

stuartlynn commented Apr 26, 2024

Some benchmarks

Without geo filtering

Query plan

 SELECT [col("B17021_E006"), col("GEO_ID")] FROM

    Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
    PROJECT */25318 COLUMNS
Benchmark 1: ./target/release/popgetter_cli
  Time (mean ± σ):      3.164 s ±  0.284 s    [User: 0.407 s, System: 0.159 s]
  Range (min … max):    2.684 s …  3.447 s    10 runs

With geo filtering

Query plan

FILTER col("GEO_ID").is_in([Series[geo_ids]]) FROM
 SELECT [col("B17021_E006"), col("GEO_ID")] FROM

    Parquet SCAN https://popgetter.blob.core.windows.net/popgetter-cli-test/tracts_2019_fiveYear.parquet
    PROJECT */25318 COLUMNS
Benchmark 1: ./target/release/popgetter_cli
  Time (mean ± σ):      7.296 s ±  0.312 s    [User: 4.364 s, System: 0.182 s]
  Range (min … max):    6.866 s …  8.064 s    10 runs

This is a bit weird and I am wondering if the issue is the large header for this file (which has about 7000 columns). Perhaps revisit this once we have the data split in to multiple smaller parquet files.

@stuartlynn stuartlynn changed the title WIP: Functions for reading metrics remotely from Azure Functions for reading metrics remotely from Azure Apr 26, 2024
src/parquet.rs Outdated Show resolved Hide resolved
Co-authored-by: Jonathan Yong <[email protected]>
@yongrenjie
Copy link
Contributor

Broadly: I'm wondering about use cases. Is there a situation where we want to get the same metric for different geometries (e.g. maybe different countries)? In that case would it be fair to say that it is the user's responsibility to call get_metrics() multiple times for each geometry and concatenate the tables themselves?

src/parquet.rs Show resolved Hide resolved
@stuartlynn stuartlynn merged commit a1b40cf into main Apr 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done:
Development

Successfully merging this pull request may close these issues.

2 participants