Skip to content

peerindex/geocoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

geocoder

A quick and dirty library that extracts location from short free text. NO dependency to any geocoding services. 100% open source including Gazetteer data. No network call or disk IO. All necessary data is contained within the jar and will be stored in-memory (the jar is around 8MB, the code uses ca. 600-700MB of memory at runtime).

How to use

Code:

// Geocoder object is expensive to create & thread-safe.
// Share a single instance per application!
Geocoder geocoder = new Geocoder();
Location londonOH = geocoder.resolve("Rancho Cordova, US");
Location moscow   = geocoder.resolve("Москва является удивительным");

Output:

(Rancho Cordova, California, US)
{
  "geonameId" : 5385941,
  "featureCodeCategory" : "SUBADM",
  "defaultName" : "Rancho Cordova",
  "featureCode" : "PPL",
  "codes" : {
    "ADM2" : "067",
    "ADM1" : "CA",
    "PCL" : "US"
  },
  "names" : [ "Rancho Kordova", "Ранчо Кордова", ...],
  "population" : 64776,
  "lat" : 38.58907,
  "lng" : -121.30273
}

(Moscow, RU)
{
  "geonameId" : 524901,
  "featureCodeCategory" : "SUBADM",
  "defaultName" : "Moscow",
  "featureCode" : "PPLC",
  "codes" : {
    "ADM1" : "48",
    "ADM2" : "562331",
    "PCL" : "RU"
  },
  "names" : [ "mwskw", "Mosco"...],
  "population" : 10381222,
  "lat" : 55.75222,
  "lng" : 37.61556
}

Notes

####Input

  • Primarily meant for location entered as free text. It doesn't work well with longer texts (like articles).
  • Only works to town level (no support for street address) with at least 1K inhabitants
  • Language agnostic (but your mileage may vary for non-English texts)
  • Best used for things on the web, like Twitter (it uses population data adjusted for online activity)

####Notes

  • DO NOT create more than one Geocoder object. It's very expensive to create and is thread-safe.
  • DO NOT modify the Location object. It's an mutable object to play nice with serialization libraries, and the Geocoder object doesn't return a defensive copy for performance reasons (maybe I should, really).

####Output

  • Here is an example output with comment
{
  // The Geoname ID of this location (see http://www.geonames.org/)
  "geonameId" : 5385941,

  // The type of this location. Roughly matches Geonames' classification
  "featureCodeCategory" : "SUBADM",
  
  // The default, English name of this location
  "defaultName" : "Rancho Cordova",
  
  // The Geoname feature code of this location
  "featureCode" : "PPL",
  
  // Geoname aministration area codes
  "codes" : {
    // This stands for Sacramento county
    "ADM2" : "067",
    // This stands for the state of California
    "ADM1" : "CA",
    // This stands for USA
    "PCL" : "US"
  },
  
  // Alternative names this location is known by
  "names" : [ "Rancho Kordova", "Ранчо Кордова", ...],
  
  // Population of this location
  "population" : 64776,
  
  // Latitude & Longitude
  "lat" : 38.58907,
  "lng" : -121.30273
}

####Performance & accuracy

  • Performance (on my 4-core Macbook pro)
    • Avg. response time: 0.01 ms
    • Throughput: 300K calls / sec
  • Accuracy
    • Both precision and recall were > 0.95 but the test was done with a VERY limited dataset
    • It will heavily depend on your data

####Configurations

  • The bundled dictionary uses the following population threshold (population.threshold.txt) to cutdown memory requirement & increase accuracy. Locations that doesn't meet this threshold are not indexed. You can change these parameters, but note that you have to re-generate compressed.gazetteer.txt using CompressGazetteer
PCL,500000
ADM1,30000
ADM2,1000
ADM3,1000
ADM4,1000
SUBADM,1000
  • It also uses the config file online.activity.share.txt to adjust population data. The numbers stands for share of online activity. Therefore, you want the sum of all numbers here to be <= 1. This config can be changed WITHOUT re-generating compressed.gazetteer.txt
US,0.5088     <- 50.88% of online activity
BR,0.0879
GB,0.0720
CA,0.0435
DE,0.0249
ID,0.0241
AU,0.0239
NL,0.0239
IN,0.0239
MX,0.0239

####Acknowledgment
This library uses Gazetteer by GeoNames (http://www.geonames.org/) licensed under a Creative Commons Attribution 3.0 License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages