Scripts to preproc Scotland GP data and calculate score by IZ added, … #5

jennajt · 2023-08-18T07:51:27Z

The two main scripts for review are:

scotland_prescription_preproc_2022 cleans the GP prescription data. Outputs loneliness score per GP by postcode.
scotland_idw_2022 uses output from above to create an inverse distance weighting model to predict the scores of all coordinates within an IZ. Mean score per IZ is calculated, IZs then ranked and put into deciles. Outputs a csv of loneliness score by IZ.

Notebooks not for review - used for exploration.

Questions / assumptions to check concerning the inverse distance weighting model:

When creating the grid of coordinates for the inverse distance weighting model to predict, the input data points (i.e. coordinates of GPs) are not removed - data leakage. However is this avoided because when the idw model is instantiated it is set to ignore distances from input points that are 0, as points with 0 distance from the known data is the input data?
If the above is true and the GP coordinates are ignored, do they need to be put back into the dataset? The script uses the predictions from the model as the final output to calculate scores by IZ.
The best hyper parameter k for the model is 20, which is suspiciously high (the default is 5). I did train-test-split, used grid search to find best params and compared the MSE with default values - is this correct?
The grid of coordinates the inverse distance weighting model uses to predict loneliness scores from needs to be evenly spaced to ensure uniform coverage of the surface. The constant cellsize is used to ensure this, hardcoded as 250. Where does this number come from?
A transformation is required to map the idw estimates to the spatial coordinates of each IZ's geometry. trans = rst.Affine.from_gdal(xmin-125,250,0,ymax+125,0,-250) is used. How are the hardcoded values of 250 and 150 derived?

…requirements updated, data files added.

danlewis85 · 2023-08-24T14:21:24Z

Answering 1 and 2.

The key thing to note is that this is a spatial interpolation approach. So we use the input data to fit the model, but the model predictions are then made on a lattice/grid. We're ignoring weights that are zeros because you can't take the inverse of zero. In practice this occurs when/if a candidate point in the grid happens to exactly coincide with an original data point. In this situation the weight is set to zero (using numpy.nan_to_num()), this should mean that the original data point is not used in estimating a value for that cell, you should decide whether that is behaviour that you want.

I think originally we used this inverse distance weighting approach because it is the simplest deterministic approach to creating an interpolated surface of values that exists. What we wanted to do is effectively redistrict values from point locations to areas while making some account of the spatial dependency in the data. The idw approach allowed us to a) make simple estimates for any arbitrary position in space, and b) aggregate estimates to a new geography (e.g. a zonal geography).

The key thing to note is that there are a number of ways of doing this, the IDW is one of the simplest. I may be worthwhile to explore whether a geostatistical approach (like kriging (gaussian process regression) offers you a better solution.

danlewis85 · 2023-08-24T14:26:55Z

Answering 3.

There's nothing inherently suspicious about k being 20, there's no reason to think that 5 is a good default for this application. You can think of k as a smoothing parameter - they higher it is the more points participate in the estimation of an average for a given location.

The heterogeneous nature of spatial data tends to mean that in some areas k=20 will oversmooth, while in others it will undersmooth. The solution to this is to take a spatially adaptive approach to k, but this is much harder to do than to assert a fixed k.

You process you suggest sounds reasonable in terms of finding a good value for k, and heuristically 20 feels fine to me.

danlewis85 · 2023-08-24T14:36:13Z

Answering 4.

This cell size is the grid spacing, so working on the British National Grid is implies that each cell covers an area in reality of 250, x 250m, this is essentially the resolution of the interpolated surface.

Basically, because this is currently an in-memory operation, you need to be able to store the entire grid (array) in memory at the required datatype. Every time you half the resolution (cell size) you quadruple the storage required. In this sense, the 250m is a convenience, compromising between estimating values on a very fine grid versus having enough ram to store and work with the output.

It's likely that you can increase the resolution should you have a decent machine/vm with access to a reasonable amount of ram. A harder alternative would be to use dask or other out-of-memory computation frameworks to compute a finer scale raster.

The effect of going to a smaller cell size should be to introduce more variance into the estimated values, and vice versa larger cells size should suppress variance - again like smoothing.

The choice of cell size is likely related to the scale of your analysis and the spatial process that you're interpolating over, but these may be to some extent unobservable. A simple approach might be to do a sensitivity analysis checking whether the results are consistent for smaller cell sizes (e.g. 100m).

danlewis85 · 2023-08-24T14:44:59Z

Answering 5.

The transformation effectively maps from array coordinates to British National Grid (BNG) coordindates. Think of a numpy array as having a cell size of 1 and an origin at (0,0), then any value in the array maps spatially to a projected coordinate system with an origin a (0,0) and a spatial unit of 1.

Obviously, this isn't actually the case for us, so we need to construct an 'affine' transform that can map from array space to space in an arbitrary projected coordinate system (in this case, BNG). This is what the transform does.

The values of 250 (and -250 due to upper vs. lower left origins in the array vs. BNG) reflect the cell size as describe in answer 4. The xmin and ymax values represent the amount you need to shift an origin (0,0) to make the array line up with our area of interest, the +/- 125 is simply padding (notably half of 250 as cells are measured from their middles), this means that cells are centred over the starting boundaries of the area of interest. Obviously then, if you wanted to change the cell size, you should also update these values to reflect the change (e.g. cellsize of 100m would need a padding of 50m etc.)

danlewis85 · 2023-08-24T14:51:54Z

LGTM

… to readme

…ed csv to take into account updated medication

…into scotland_gp_2022

Scripts to preproc Scotland GP data and calculate score by IZ added, …

89633a3

…requirements updated, data files added.

jennajt requested a review from MikeJohnPage August 18, 2023 07:52

Remov sklearn

48555c1

jennajt and others added 15 commits August 30, 2023 11:44

Updated drug list with latest NICE/NHS guidelines; added drug sources…

6b51e76

… to readme

Fixed typo

cf746e4

Removed duplication of high blood pressure and hypertension; reproduc…

38a2499

…ed csv to take into account updated medication

Merge branch 'scotland_gp_2022' of github.com:humaniverse/loneliness …

227cb8f

…into scotland_gp_2022

Added explanatory comments to IDW functions

0e69d22

Updated .gitignore with csvs and temp working files

036277b

Replaced scotland csv files, added rda and documentation

db5d1db

Update README.md

8a29f9b

Fix typo in README.md

012d27a

Fix error cells

2c70402

Merge branch 'scotland_gp_2022' of github.com:humaniverse/loneliness …

52d0dba

…into scotland_gp_2022

Changed Scotland geography from IZ to DZ

c5e26e3

Added checks for venv

1378a6c

Added re-generated Scotland loneliness score at dz level

af09eb8

Added wip wales files

0c0a50b

jennajt merged commit 4893659 into main Oct 13, 2023

jennajt deleted the scotland_gp_2022 branch January 11, 2024 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts to preproc Scotland GP data and calculate score by IZ added, … #5

Scripts to preproc Scotland GP data and calculate score by IZ added, … #5

jennajt commented Aug 18, 2023

danlewis85 commented Aug 24, 2023

danlewis85 commented Aug 24, 2023

danlewis85 commented Aug 24, 2023

danlewis85 commented Aug 24, 2023

danlewis85 commented Aug 24, 2023

Scripts to preproc Scotland GP data and calculate score by IZ added, … #5

Scripts to preproc Scotland GP data and calculate score by IZ added, … #5

Conversation

jennajt commented Aug 18, 2023

danlewis85 commented Aug 24, 2023

danlewis85 commented Aug 24, 2023

danlewis85 commented Aug 24, 2023

danlewis85 commented Aug 24, 2023

danlewis85 commented Aug 24, 2023