Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iss418 - Pull request for Digital Access - Version 2025 #444

Open
wants to merge 6 commits into
base: version2025
Choose a base branch
from

Conversation

ridhi96
Copy link

@ridhi96 ridhi96 commented Jan 10, 2025

Mobility metric pull request template

Please include the following points in your PR:

  1. A link to the issue that this PR relates to. You can bring up a list of suggested issues and pull requests within the repository by typing #. Reconfigure digital access metric and update data to 2023 #418

  2. A description of the content in this pull request.

  • What was changed?
    • Re-created the digital access metric using data from IPUMS (5-yr ACS).
  • What should the reviewer be focusing on?
    • Methodology used to re-create the variable, quality variable calculations
  • Is there a logical order to review the files in?
    • No
  1. Detail on any issues or flags that the metric reviewer/data-team should be aware of.
  • Still working on making the evaluate_final_data function execute

@ridhi96 ridhi96 marked this pull request as draft January 10, 2025 23:30
@cdsolari cdsolari requested a review from kmartinchek January 16, 2025 20:06
@kmartinchek
Copy link
Collaborator

  1. Please add a note to users at the beginning of the script to (1) use here::here() to check the root folder and (2) use the R project file so that paths do not need to be manually set.
  2. Add a note that users need to uncomment the code in lines 76-114 (if they run as-is without doing so it wouldn’t work).
  3. Could be useful to flag which lines of code need to be manually adjusted at the top of the file (e.g., the extract number).
  4. After performing specific functions, please add unit tests to confirm they worked. For example, confirm that only GQ codes 1, 2, and 5 remain in the data after line 185 and confirm the race and income subgroup variables were coded correctly using a table between old and new codes after line 258 and line 276.
  5. More context as to what you are testing in lines 328-338 would be helpful. Also recommend using an assert to confirm that results are as expected.
  6. Should the testing for CIHISPEED be removed from this script? I think it would be easier to say that CIHISPEED takes on this value when these conditions are met and then test directly for those conditions. That would slim down the codebase.
  7. It is a bit hard to know if your calculation is doing what you expect it to because I am not sure exactly what you want to capture. I would add these details as a description—for instance, “here, we calculate a household-level measure of whether households have access to a laptop or tablet at home AND have access to broadband internet at home”. This definition seems reasonable to me, and is what you generate. I see that this is on line 548—I would move this up.
  8. Lines 418 to 425 should be removed or transformed into an assert-like statement, like assert that this variable does not have these values or otherwise remove, instead of how it is currently formatted.
  9. A broader comment is that I think that the code used here to query IPUMS and merge in the xwalk is easier to parse than some other code used in the UMF codebase—so I really appreciate that.
  10. For the merges, it would be helpful to use library("tidylog", warn.conflicts = FALSE) so you can see the merge information. Regardless, I recommend adding context for what you expect as output from the merge (e.g., how many matches, characteristics of matched/unmatched data), and then create a unit test to assess this.
  11. Why do you adjust HHWT for afact for places but not counties? Is this what you intend? I would describe why you do this on line 502.
  12. Why do you drop_na HHWT in line 519? Explain and also do a check of these NA values before you drop them to make sure the drop is not a mistake. It looks like when I run the count code on line 516, all observations report this as false. Same considerations with respect to the place code block that follows.
  13. For the effective sample code lines 575 to 621, I was surprised to see that all counties had an effective sample of 30. It does in some ways make sense if you’re using the 5 year estimates vs. 1 year -- is that the extract you want to use? The HH income metric is from 1 year ACS data, so just wanted to double-check.
  14. When you calculate share digital access, recommend adding an assert that these values fall in the bounded range between 0 and 100 percent – I see now this happens on line 777, so I would note that you assess quality there. I would also assert that if effective sample is less than 30, the value is NA. It is helpful to have these texts presented after the calculations.
  15. Great job on the data visualization functions.
  16. More detail on the decision for line 931 would be helpful in the file.
  17. Add a note on what sum_products is, why you test it to be unique, and why you use it for data quality. I can’t find any documentation on it—I know it’s a derivative of the geocorr xwalk, but it would be good to reference this documentation where the variable is discussed, because otherwise the user may think it is from IPUMS and not be able to track down the definition.
  18. Also, wouldn’t data quality level two need to be “(effective_sample_count >= 30 & sum_products >= 0.35 & sum_products < 0.75) ~ 2”, otherwise quality 1 would be overwritten? See line 1000.
  19. Do you need to make any adjustments based on AFACT? I have seen this done in other microdata UMF files (like for HH income calculations), but didn’t catch it here.
  20. Confirm that the distinct numbers in lines 1044 to 1061 are what you expect—otherwise, users won’t know if what their console returns are what is expected.
  21. Really interesting function for geographic equivalence – but it could use slightly more detailed text on what exactly you are testing for and why.
  22. For some of these checks at the end of the file, it may be better to format them as asserts instead of users needing to check the output and hand verify.
  23. When you surface the missing counties for ethnicity, (line 1153 to 1160), I would clarify why they are missing and why it makes sense to add them in. If they erroneously disappeared earlier, adding them back in this manner would not address the underlying issue.
  24. You mentioned the final evaluation function wasn’t working for you, so I did not check this. Also, the CIs were not calculated, so no checking related to this happened in this review.
  25. Could be useful when generating estimates from microdata with Cis: https://github.com/UI-Research/mobility-from-poverty/blob/main/01_financial-well-being/R/calc_income_quantiles.R

@ridhi96
Copy link
Author

ridhi96 commented Feb 13, 2025

Hi @kmartinchek! Please find below my responses to your comments. I have tried to provide more context in the code file. Please let me know if anything is unclear and I'd be happy to add more information. Thank you!

  • Please add a note to users at the beginning of the script to (1) use here::here() to check the root folder and (2) use the R project file so that paths do not need to be manually set.
  • Add a note that users need to uncomment the code in lines 76-114 (if they run as-is without doing so it wouldn’t work).
  • Could be useful to flag which lines of code need to be manually adjusted at the top of the file (e.g., the extract number).
  • After performing specific functions, please add unit tests to confirm they worked. For example, confirm that only GQ codes 1, 2, and 5 remain in the data after line 185 and confirm the race and income subgroup variables were coded correctly using a table between old and new codes after line 258 and line 276.
  • More context as to what you are testing in lines 328-338 would be helpful. Also recommend using an assert to confirm that results are as expected.
  • Should the testing for CIHISPEED be removed from this script? I think it would be easier to say that CIHISPEED takes on this value when these conditions are met and then test directly for those conditions. That would slim down the codebase.
    • I updated more context around this in the code book. We need to keep this because IPUMS is not very clear on why some data would have N/A(GQ) values if group quarters are removed (the value suggests N/A is assigned for observations categorized as group quarters). We decided to use two separate variables to define what is the universe of observations (denominator) and the households which have digital access (numerator). Please let me know if I can provide more context on this.
  • It is a bit hard to know if your calculation is doing what you expect it to because I am not sure exactly what you want to capture. I would add these details as a description—for instance, “here, we calculate a household-level measure of whether households have access to a laptop or tablet at home AND have access to broadband internet at home”. This definition seems reasonable to me, and is what you generate. I see that this is on line 548—I would move this up.
  • Lines 418 to 425 should be removed or transformed into an assert-like statement, like assert that this variable does not have these values or otherwise remove, instead of how it is currently formatted.
  • A broader comment is that I think that the code used here to query IPUMS and merge in the xwalk is easier to parse than some other code used in the UMF codebase—so I really appreciate that.
  • For the merges, it would be helpful to use library("tidylog", warn.conflicts = FALSE) so you can see the merge information. Regardless, I recommend adding context for what you expect as output from the merge (e.g., how many matches, characteristics of matched/unmatched data), and then create a unit test to assess this.
  • Why do you adjust HHWT for afact for places but not counties? Is this what you intend? I would describe why you do this on line 502.
    • I added more context around this.
  • Why do you drop_na HHWT in line 519? Explain and also do a check of these NA values before you drop them to make sure the drop is not a mistake. It looks like when I run the count code on line 516, all observations report this as false. Same considerations with respect to the place code block that follows.
  • For the effective sample code lines 575 to 621, I was surprised to see that all counties had an effective sample of 30. It does in some ways make sense if you’re using the 5 year estimates vs. 1 year -- is that the extract you want to use? The HH income metric is from 1 year ACS data, so just wanted to double-check.
    • Yes, I am using 5 year estimates as mentioned in the beginning of the code book. Effective sample less than 30 fails for some subgroups though and I have detailed this through tests in the code book.
  • When you calculate share digital access, recommend adding an assert that these values fall in the bounded range between 0 and 100 percent – I see now this happens on line 777, so I would note that you assess quality there. I would also assert that if effective sample is less than 30, the value is NA. It is helpful to have these texts presented after the calculations.
  • Great job on the data visualization functions.
  • More detail on the decision for line 931 would be helpful in the file.
    • Added in the code earlier when talking about creating the metric.
  • Add a note on what sum_products is, why you test it to be unique, and why you use it for data quality. I can’t find any documentation on it—I know it’s a derivative of the geocorr xwalk, but it would be good to reference this documentation where the variable is discussed, because otherwise the user may think it is from IPUMS and not be able to track down the definition.
    • Provided more detail and the crosswalk file used to create this.
  • Also, wouldn’t data quality level two need to be “(effective_sample_count >= 30 & sum_products >= 0.35 & sum_products < 0.75) ~ 2”, otherwise quality 1 would be overwritten? See line 1000.
    • No we don't need to add sum_products < 0.75 as the case_when function takes care of it. I have added some new tests for data quality ranges to provide more assurance.
  • Do you need to make any adjustments based on AFACT? I have seen this done in other microdata UMF files (like for HH income calculations), but didn’t catch it here.
    • This is done when I adjust HHWT at an earlier point in the metric calculation
  • Confirm that the distinct numbers in lines 1044 to 1061 are what you expect—otherwise, users won’t know if what their console returns are what is expected.
  • Really interesting function for geographic equivalence – but it could use slightly more detailed text on what exactly you are testing for and why.
  • For some of these checks at the end of the file, it may be better to format them as asserts instead of users needing to check the output and hand verify.
  • When you surface the missing counties for ethnicity, (line 1153 to 1160), I would clarify why they are missing and why it makes sense to add them in. If they erroneously disappeared earlier, adding them back in this manner would not address the underlying issue.
  • You mentioned the final evaluation function wasn’t working for you, so I did not check this. Also, the CIs were not calculated, so no checking related to this happened in this review.
    • Added the final evaluation function.
  • Could be useful when generating estimates from microdata with Cis: https://github.com/UI-Research/mobility-from-poverty/blob/main/01_financial-well-being/R/calc_income_quantiles.R
    • I didn't really understand what I need to do for this comment. Please let me know and I can update.

@ridhi96 ridhi96 marked this pull request as ready for review February 13, 2025 20:03
@cdsolari cdsolari requested review from cdsolari and jwalsh28 March 4, 2025 15:39
@ridhi96 ridhi96 marked this pull request as draft March 5, 2025 19:13
@ridhi96 ridhi96 marked this pull request as ready for review March 7, 2025 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants