Junjay's Blog

Thoughts about professional things that interest me, mostly software and data analysis-related

  • Home Page
  • About

Archives for September 2015

Best Python Data Cleaning Libraries For People Data

September 1, 2015 By Junjay Tan Leave a Comment

In the past year I’ve had to do various projects that needed data clean up of typical contact info fields like emails and phone numbers. You’d think people would have solved these problems by now, and that have (kind of). If you use Python to do data cleaning, here are some libraries you will find useful:

Validating Emails

Josh Tauberer, a known developer in the open government data movement, has developed an email validation library.

see: https://github.com/JoshData/python-email-validator

Validating Phone Numbers

Google has a great Java-based library for parsing, formatting, storing, and validating international phone numbers. David Drysdale has converted this into a python version!

see: https://pypi.python.org/pypi/phonenumbers

Parsing Unstructured Addresses into Components

The Atlanta Journal Constitution in collaboration with DataMade, a “civic technology company,” has released a python library for parsing unstructured USA address strings into address components. This library only parses addresses into components and does not validate that a given address is valid. But from deep firsthand experience having to use unstructured addresses, this alone is a powerful feature!

see: https://github.com/datamade/usaddress

To validate that an address is valid, you will unfortunately most likely need to connect to some proprietary service like ArcGIS’s API, Informatica Address Doctor, or Google Maps Enterprise. Alternatively, you might be able to connect to USPS’s address validation APIs if you only have a small amount of addresses (https://www.usps.com/business/web-tools-apis/address-information.htm)

Parsing Unstructured Names into Components

Instead of having to write your own name parser using Regular Expressions, you could use the nameparser library (early stage!).

See: https://pypi.python.org/pypi/nameparser

What other data fields do you find yourself often having to validate or clean regularly? And what tools have you found useful? Please let me know in the comments section!

Categories

  • Data Analytics
  • Misc
  • Tech

Tags

american community survey arcgis census geospatial mysql
September 2015
M T W T F S S
« Jun    
 123456
78910111213
14151617181920
21222324252627
282930  

Archives

  • September 2015
  • June 2014
  • February 2013
  • November 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012

Copyright © 2019 · Junjay Tan · Powered by WordPress