TY - GEN
T1 - Understanding the Intrinsic Characteristics of Spatial Partitioning in Distributed Spatial Join
AU - Xu, Zhuohan
AU - Teng, Dejun
AU - Peng, Zhaohui
AU - Wang, Fusheng
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Spatial join has become a frequently used yet resource-intensive operation in geospatial applications, driven by the increasing volume and complexity of geospatial data. With Hadoop and Spark becoming the de facto standard platforms for distributed computing, scalable spatial data processing is primarily achieved by partitioning the input space to form parallel units on these platforms. Effective spatial data partitioning is critical for task parallelization and load balancing, but it faces significant challenges due to data skew and the geometric and topological complexity of spatial objects, particularly in supporting spatial joins. This paper examines the interplay among query performance, spatial data partitioning, query types, data, and system characteristics. We qualitatively and quantitatively analyze the features of representative partitioning algorithms that impact overall query performance. Along with these analyses, we propose a data sampling-based approach for selecting optimized partitioning strategies. Extensive experiments on large and complex datasets using MapReduce frameworks are conducted to validate the correctness of our analysis and the effectiveness of our optimization approach.
AB - Spatial join has become a frequently used yet resource-intensive operation in geospatial applications, driven by the increasing volume and complexity of geospatial data. With Hadoop and Spark becoming the de facto standard platforms for distributed computing, scalable spatial data processing is primarily achieved by partitioning the input space to form parallel units on these platforms. Effective spatial data partitioning is critical for task parallelization and load balancing, but it faces significant challenges due to data skew and the geometric and topological complexity of spatial objects, particularly in supporting spatial joins. This paper examines the interplay among query performance, spatial data partitioning, query types, data, and system characteristics. We qualitatively and quantitatively analyze the features of representative partitioning algorithms that impact overall query performance. Along with these analyses, we propose a data sampling-based approach for selecting optimized partitioning strategies. Extensive experiments on large and complex datasets using MapReduce frameworks are conducted to validate the correctness of our analysis and the effectiveness of our optimization approach.
KW - distributed processing
KW - spatial partitioning
UR - https://www.scopus.com/pages/publications/85218005834
U2 - 10.1109/BigData62323.2024.10825846
DO - 10.1109/BigData62323.2024.10825846
M3 - Conference contribution
AN - SCOPUS:85218005834
T3 - Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024
SP - 403
EP - 412
BT - Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024
A2 - Ding, Wei
A2 - Lu, Chang-Tien
A2 - Wang, Fusheng
A2 - Di, Liping
A2 - Wu, Kesheng
A2 - Huan, Jun
A2 - Nambiar, Raghu
A2 - Li, Jundong
A2 - Ilievski, Filip
A2 - Baeza-Yates, Ricardo
A2 - Hu, Xiaohua
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Big Data, BigData 2024
Y2 - 15 December 2024 through 18 December 2024
ER -