This is a quick guide for data scientists and analysts on the topic of working with geospatial data. Geographical information in a dataset it can be used for feature generation and provide statistically important data. This is called spatial feature engineering. In this guide I mainly focus on geomarketing task, but its concepts can be used in any field.
Spatial weights represent spatial structure of the data. They define neighbouring spatial relations, so they can be treated as a weighted graph, stored in adjacency matrix or list. If no relation is considered between objects i and j, corresponding w_ij = 0. Spatial weight matrix is required to build models compute most geospatial indices. Spatial relationships, represented in weights, can be defined in different ways. For better understanding on the methods I recommend reading this article by ArcGIS.
- Counting nearby objects (e.g. competitors) - buffer + clip
- Distance to key objects (e.g. feature center, central feature or feature hotspot)
- Haversine distance
- Manhattan distance with haversine formula
- Distance to closest object (e.g. storage, shopping center or bus stop)
- Spatial lag - feature neighbour-weighted average
- Building areas, perimeters and volumes (momepy)
- Building alignment, adjacency, shared walls (momepy)
- Building intensity
- Neighbouring-based diversity indices (momepy)
- Elevation (altitude)
- Local spatial autocorrelation (cluster and outlier analysis)
- Local Moran
- Local G
- Local Geary
- Local join counts (for binary features)
- Shape measures and urban shape measures for polygonal objects
- Spatial accessibility metrics
- Clustering
- AZP (automatic zoning procedure)
- Bottom-up agglomerative
- Regional K-Means
- A-DBSCAN
- Multivariate
- Location set covering problem (LSCP)
- Silhouette samples
- Distance-preserving dimensionality reduction
In urban areas each object has a corresponding part of road graph (defined by point and radius). It can be utilized for generating features using osmnx.stats.basic_stats.
- Average circuity (circuity_avg)
- Total edge length per km^2 (edge_density_km)
- Intersection count per km^2 (intersection_density_km)
- Self-loop proportion (self_loop_proportion)
- Average number of streets per node (streets_per_node_avg)
- Node count per km^2 (node_density_km)
- Average street length (street_length_avg)
PySAL spaghetti module also provides various network statistics:
- Moran’s I (API)
- Point snap distance (API)
- Network weights (Network w_network attribute) used to find distances to network hotspots / K nearest clusters / ...
In some cases reverse geocoding can provide useful features. It can be done using GeoPy or reverse geocoder. Physical address can be used to generate categorical features (country, city, district, street) or can be treated as text data.
Another categorical feature, representing spatial proximity, is geohash, which is a unique identifier of a specific region on the Earth. It can be computed using geohash library.
Text data can also be used for feature engineering. This has nothing to do with spatial data analysis, but I want to cover the majority of topics in this guide. One way to utilize text data is to generate categorical data by parsing text using separators or regular expressions. A different approach involves using word2vec word embeddings for a single word. If you have a phrase of multiple words, it is worth taking average of word embeddings or (more precisely) average of word embeddings multiplied by their TF-IDF score. A complete different approach involves using BERT text embeddings as features.
There are machine learning models that do not require any spatial feature engineering at all. They understand spatial relationships along with attributive information and can be a pipeline for quick metrics estimation, a good start for hypothesis testing or initial data overview. Since most models are linear, it is worth using feature transformations (log, power).
- Geographically weighted regression (GWR)
- Multiscale GWR (MGWR)
- Ordinary least squares
- Seemingly unrelated regression
- Fixed and random effects panels
- Forest-based classification and regression
- Tests of homoskedasticity, normality, spatial randomness etc.
- ArcGIS documentation
- GeoDa documentation
- PySAL library
- Geographic Data Science with Python