TY - GEN
T1 - Effective Scalable and Integrative Geocoding for Massive Address Datasets
AU - Rashidian, Sina
AU - Dong, Xinyu
AU - Avadhani, Amogh
AU - Poddar, Prachi
AU - Wang, Fusheng
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/11/7
Y1 - 2017/11/7
N2 - With increased accessibility of large scale open data, public health studies are able to take advantage of integrative spatial big data to increase the spatial resolution to community or neighborhood level. One critical information for such studies is the large number of addresses of patients, which is private and highly sensitive. Geocoding such massive private addresses poses major challenges for public health researchers. Many geocoders provide only Web APIs which require sending private addresses over the Internet, which is not feasible. Commercial geocoders require high licensing fee and often have limitations on daily usage, which becomes a major hurdle for researchers. Scalability is another major challenge for large scale address dataset. In this paper, we present EaserGeocoder, a novel open source geocoder for effectively geocoding massive address datasets. EaserGeocoder takes an integrative approach by using multiple references based on open address data sources contributed by governments or communities. It takes a machine learning approach to automatically find the best answer from candidates produced by multiple references. The system provides high scalability through parallel processing. Our comparative studies demonstrate EaserGeocoder outperforms open source geocoders and is comparable to commercial ones in terms of both accuracy and error. It provides a cost-effective and feasible solution for large scale public health studies.
AB - With increased accessibility of large scale open data, public health studies are able to take advantage of integrative spatial big data to increase the spatial resolution to community or neighborhood level. One critical information for such studies is the large number of addresses of patients, which is private and highly sensitive. Geocoding such massive private addresses poses major challenges for public health researchers. Many geocoders provide only Web APIs which require sending private addresses over the Internet, which is not feasible. Commercial geocoders require high licensing fee and often have limitations on daily usage, which becomes a major hurdle for researchers. Scalability is another major challenge for large scale address dataset. In this paper, we present EaserGeocoder, a novel open source geocoder for effectively geocoding massive address datasets. EaserGeocoder takes an integrative approach by using multiple references based on open address data sources contributed by governments or communities. It takes a machine learning approach to automatically find the best answer from candidates produced by multiple references. The system provides high scalability through parallel processing. Our comparative studies demonstrate EaserGeocoder outperforms open source geocoders and is comparable to commercial ones in terms of both accuracy and error. It provides a cost-effective and feasible solution for large scale public health studies.
KW - Geocoding
KW - Geographic Information System
KW - Text Searching
UR - https://www.scopus.com/pages/publications/85040980426
U2 - 10.1145/3139958.3139986
DO - 10.1145/3139958.3139986
M3 - Conference contribution
AN - SCOPUS:85040980426
SN - 9781450354905
T3 - GIS: Proceedings of the ACM International Symposium on Advances in Geographic Information Systems
BT - GIS
A2 - Ravada, Siva
A2 - Hoel, Erik
A2 - Tamassia, Roberto
A2 - Newsam, Shawn
A2 - Trajcevski, Goce
A2 - Trajcevski, Goce
PB - Association for Computing Machinery
T2 - 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM SIGSPATIAL GIS 2017
Y2 - 7 November 2017 through 10 November 2017
ER -