-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scripts to preproc Scotland GP data and calculate score by IZ added, … #5
Conversation
…requirements updated, data files added.
Answering 1 and 2. The key thing to note is that this is a spatial interpolation approach. So we use the input data to fit the model, but the model predictions are then made on a lattice/grid. We're ignoring weights that are zeros because you can't take the inverse of zero. In practice this occurs when/if a candidate point in the grid happens to exactly coincide with an original data point. In this situation the weight is set to zero (using I think originally we used this inverse distance weighting approach because it is the simplest deterministic approach to creating an interpolated surface of values that exists. What we wanted to do is effectively redistrict values from point locations to areas while making some account of the spatial dependency in the data. The idw approach allowed us to a) make simple estimates for any arbitrary position in space, and b) aggregate estimates to a new geography (e.g. a zonal geography). The key thing to note is that there are a number of ways of doing this, the IDW is one of the simplest. I may be worthwhile to explore whether a geostatistical approach (like kriging (gaussian process regression) offers you a better solution. |
Answering 3. There's nothing inherently suspicious about k being 20, there's no reason to think that 5 is a good default for this application. You can think of k as a smoothing parameter - they higher it is the more points participate in the estimation of an average for a given location. The heterogeneous nature of spatial data tends to mean that in some areas k=20 will oversmooth, while in others it will undersmooth. The solution to this is to take a spatially adaptive approach to k, but this is much harder to do than to assert a fixed k. You process you suggest sounds reasonable in terms of finding a good value for k, and heuristically 20 feels fine to me. |
Answering 4. This cell size is the grid spacing, so working on the British National Grid is implies that each cell covers an area in reality of 250, x 250m, this is essentially the resolution of the interpolated surface. Basically, because this is currently an in-memory operation, you need to be able to store the entire grid (array) in memory at the required datatype. Every time you half the resolution (cell size) you quadruple the storage required. In this sense, the 250m is a convenience, compromising between estimating values on a very fine grid versus having enough ram to store and work with the output. It's likely that you can increase the resolution should you have a decent machine/vm with access to a reasonable amount of ram. A harder alternative would be to use dask or other out-of-memory computation frameworks to compute a finer scale raster. The effect of going to a smaller cell size should be to introduce more variance into the estimated values, and vice versa larger cells size should suppress variance - again like smoothing. The choice of cell size is likely related to the scale of your analysis and the spatial process that you're interpolating over, but these may be to some extent unobservable. A simple approach might be to do a sensitivity analysis checking whether the results are consistent for smaller cell sizes (e.g. 100m). |
Answering 5. The transformation effectively maps from array coordinates to British National Grid (BNG) coordindates. Think of a numpy array as having a cell size of 1 and an origin at (0,0), then any value in the array maps spatially to a projected coordinate system with an origin a (0,0) and a spatial unit of 1. Obviously, this isn't actually the case for us, so we need to construct an 'affine' transform that can map from array space to space in an arbitrary projected coordinate system (in this case, BNG). This is what the transform does. The values of 250 (and -250 due to upper vs. lower left origins in the array vs. BNG) reflect the cell size as describe in answer 4. The xmin and ymax values represent the amount you need to shift an origin (0,0) to make the array line up with our area of interest, the +/- 125 is simply padding (notably half of 250 as cells are measured from their middles), this means that cells are centred over the starting boundaries of the area of interest. Obviously then, if you wanted to change the cell size, you should also update these values to reflect the change (e.g. cellsize of 100m would need a padding of 50m etc.) |
LGTM |
…ed csv to take into account updated medication
…into scotland_gp_2022
…into scotland_gp_2022
The two main scripts for review are:
scotland_prescription_preproc_2022
cleans the GP prescription data. Outputs loneliness score per GP by postcode.scotland_idw_2022
uses output from above to create an inverse distance weighting model to predict the scores of all coordinates within an IZ. Mean score per IZ is calculated, IZs then ranked and put into deciles. Outputs a csv of loneliness score by IZ.Notebooks not for review - used for exploration.
Questions / assumptions to check concerning the inverse distance weighting model:
cellsize
is used to ensure this, hardcoded as 250. Where does this number come from?trans = rst.Affine.from_gdal(xmin-125,250,0,ymax+125,0,-250)
is used. How are the hardcoded values of 250 and 150 derived?