Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate parcels in iasworld.pardat missing shapes in spatial.parcel #729

Open
jeancochrane opened this issue Feb 7, 2025 · 0 comments
Assignees

Comments

@jeancochrane
Copy link
Contributor

jeancochrane commented Feb 7, 2025

Background

We generate the spatial.parcel table based on a County geoDB that is supposed to have geometries for every parcel. There are two scripts that perform this extraction and transformation from the geo DB, one to pull the raw data and another to clean it up and save it to spatial.parcel.

A comparison between iasworld.pardat (our source of truth for parcel data) and spatial.parcel reveals that they do not always contain an identical set of PIN10s. Run this query in the Athena console to see the counts over time:

with pins_in_pardat_not_county_geodb as (
    select
        par.taxyr as year,
        count(*) as pins_in_pardat_not_county_geodb
    from iasworld.pardat as par
    left join spatial.parcel as sp
        on substr(par.parid, 1, 10) = sp.pin10
        and par.taxyr = sp.year
    where par.cur = 'Y'
        and par.deactivat is null
        and sp.pin10 is null
        -- Exclude years that we know are missing from County geoDB
        and par.taxyr between '2000' and '2024'
    group by par.taxyr
),

pins_in_county_geodb_not_pardat as (
    select
        sp.year,
        count(*) as pins_in_county_geodb_not_pardat
    from spatial.parcel as sp
    left join iasworld.pardat as par
        on sp.pin10 = substr(par.parid, 1, 10)
        and sp.year = par.taxyr
        and par.cur = 'Y'
        and par.deactivat is null
    where par.parid is null
    group by sp.year
)

select *
from pins_in_pardat_not_county_geodb as par
full outer join pins_in_county_geodb_not_pardat as geodb
    using(year)
order by year

It's possible that these discrepancies are the result of irreducible messiness in one or both of the raw data sources, in which case we'll need to seek help from the stakeholders who own these raw data sources or geocode the missing parcels ourselves (see #720). However, both of those paths forward will require a lot of work, and before we pursue either of them I want to be confident that we're not causing the discrepancies in our own transformation code. The results of the query above make me feel suspicious that there may be a bug in our code, since the variance in pins_in_pardat_not_county_geodb is much higher than the other way around (pins_in_county_geodb_not_pardat).

Deliverable

Let's double-check the transformation script to confirm that we are not accidentally removing rows from the county geoDB that map to parcels in pardat. I'm imagining this will look like:

  • Grab the list of parcel files by year
  • Query iasworld.pardat to get all of the PIN10s in 2016, the most recent year with the highest count of missing shapes (this isn't a step in the transformation script, you'll have to write it yourself)
  • Step through the code in the process_parcel_file function for 2016 only
  • Check each step where we perform a transformation on the county data to make sure we're not dropping parcels

Once we're confident our code looks good, we can plan for a follow-up solution like #720.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants