Cleaning Addresses in OpenRefine

This is an experiment, to look at how addresses can be cleaned using OpenRefine.

A more robust method might be to use a geocoding web service, which provides a match return, run on the two sets of addresses that you are trying to match. However, that may be time consuming if you have a lot of service calls to make.

This approach might help shorten that.

Note that I am not saying "standardized" as that would mean adherence to a standard, such as the US postal standard. This isn't as sophisticated as that - I'm basically looking to match addresses across disparate datasets. I have tens of thousands of records to go through, and this may happen on a recurring basis, so I want an easy way to do it... so enter OpenRefine to help do part of the heavy lift.

Turn to upper case

toUppercase(value)

Trim potential extraneous whitespace

Use trim() to do this

trim(toUppercase(value))

Remove extraneous punctuation and other characters

I'm not clever in the ways of regex... regex is pure romulan to me. There are probably other far more sophisticated ways do this than what I'm presenting, I'm open to suggestions - but this built-up method of doing it brute-force is working for the moment.

Use replace() to do this - append .replace("thing_to_be_replaced","replacement_value")

From my initial analysis of this dataset, I think I want to get rid of " . # / , ( ) and /t (tab) characters

Trying single quotes for getting rid of the doublequote value seems to work...

trim(toUppercase(value)).replace('"','').replace(".","").replace("#","").replace(",","").replace("(","").replace(")","")

I want to replace the "/" with a " "

trim(toUppercase(value)).replace('"','').replace(".","").replace("#","").replace(",","").replace("(","").replace(")","").replace("/"," ")

Building GREL in succession

To each piece, I will add on a new set of GREL directives

Directionals

I also want to have consistent directionals, so I'm going to append:

.replace("EAST ","E ") .replace("WEST ","W ") .replace("NORTH ","N ") .replace("SOUTH ","S ").replace(" SOU "," S ").replace(" SO "," S ")

Street Names

.replace(" FIRST "," 1ST ") .replace(" SECOND "," 2ND ") .replace(" THIRD "," 3RD ") .replace(" FOURTH "," 4TH ") .replace(" FIFTH "," 5TH ") .replace(" SIXTH "," 6TH ") .replace(" SEVENTH "," 7TH ") .replace(" EIGHTH "," 8TH ") .replace(" NINTH "," 9TH ")

Road Types

Via a preliminary analysis I came up with a few observations

"Street" had some occasional misspellings - "STEET", "SREET" et cetera - likewise, so did "Highway" - I will take the variants that I find, and plug them in as well.

.replace(" AVENUE"," AVE") .replace(" BUILDING"," BLDG").replace(" BYPASS"," BYP") .replace(" BOULEVARD"," BLVD") .replace(" CIRCLE"," CIR") .replace(" COURT"," CT") .replace(" COUNTY"," CTY") .replace(" CTH"," CTY HWY") .replace(" DRIVE"," DR").replace(" DIRVE"," DR") .replace(" EXTENSION"," EXT") .replace(" HIGHWAY"," HWY").replace(" HGWY"," HWY").replace(" HIGWAY"," HWY").replace(" HYW"," HWY") .replace(" INDUSTRIAL"," INDL") .replace(" INTERSECTION"," INT").replace(" INTEX"," INT").replace(" INTX"," INT") .replace(" JUNCTION"," JCT") .replace(" LANE"," LN") .replace(" MM"," MI").replace(" MILE"," MI").replace(" MILES"," MI") .replace(" ROAD"," RD").replace(" ROADS"," RD") .replace(" ROUTE"," RT").replace(" RTE"," RT") .replace(" STREET"," ST").replace(" SREET"," ST").replace(" STEET"," ST") .replace(" USHWY"," US HWY")

Also, to handle a few other things in a consistent ways:

.replace("&"," & ").replace("-"," - ").replace(" "," ")

Putting it all together:

trim(toUppercase(value)).replace('"','').replace(".","").replace("#","").replace(",","").replace("(","").replace(")","").replace("/"," ").replace("EAST ","E ").replace("WEST ","W ").replace("NORTH ","N ").replace("SOUTH ","S ").replace(" SOU "," S ").replace(" SO "," S ").replace(" FIRST "," 1ST ").replace(" SECOND "," 2ND ").replace(" THIRD "," 3RD ").replace(" FOURTH "," 4TH ").replace(" FIFTH "," 5TH ").replace(" SIXTH "," 6TH ").replace(" SEVENTH "," 7TH ").replace(" EIGHTH "," 8TH ").replace(" NINTH "," 9TH ").replace(" AVENUE"," AVE").replace(" BUILDING"," BLDG").replace(" BYPASS"," BYP").replace(" BOULEVARD"," BLVD").replace(" CIRCLE"," CIR").replace(" COURT"," CT").replace(" COUNTY"," CTY").replace(" CTH"," CTY HWY").replace(" DRIVE"," DR").replace(" DIRVE"," DR").replace(" EXTENSION"," EXT").replace(" HIGHWAY"," HWY").replace(" HGWY"," HWY").replace(" HIGWAY"," HWY").replace(" HYW"," HWY").replace(" INDUSTRIAL"," INDL").replace(" INTERSECTION"," INT").replace(" INTEX"," INT").replace(" INTX"," INT").replace(" JUNCTION"," JCT").replace(" LANE"," LN").replace(" MM"," MI").replace(" MILE"," MI").replace(" MILES"," MI").replace(" ROAD"," RD").replace(" ROADS"," RD").replace(" ROUTE"," RT").replace(" RTE"," RT").replace(" STREET"," ST").replace(" SREET"," ST").replace(" STEET"," ST").replace(" TRAIL"," TRL").replace(" USHWY"," US HWY").replace("&"," & ").replace("-"," - ").replace(" "," ")

Next, you can compare data between columns:

if(cells["A"].value == cells["B"].value, "similar", "different")

This just saved me from manually looking through thousands of records

As a future thing, I may try and look at more sophisticated methods for comparing, for example fuzzy matches using metaphones or other approaches - involving parsing and detecting specific address components and actually standardizing, but that's a bit beyond my scope and timeframe for getting things done on my current project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean_address.md

clean_address.md

Cleaning Addresses in OpenRefine

Files

clean_address.md

Latest commit

History

clean_address.md

File metadata and controls

Cleaning Addresses in OpenRefine