TY - GEN
T1 - Using space and attribute partitioned partial replicas for data subsetting and aggregation queries
AU - Weng, Li
AU - Catalyurek, Umit
AU - Kurc, Tahsin
AU - Agrawal, Gagan
AU - Saltz, Joel
PY - 2006
Y1 - 2006
N2 - Partial replication is one type of optimization to speed up execution of queries submitted to large datasets. In partial replication, a portion of the dataset is extracted, re-organized, and re-distributed across the storage system. In this paper we investigate methods for efficient execution of queries when replicas of a dataset exist; we assume the replicas have already been created and do not target the replica creation problem. We propose a cost model and algorithm for combined use of space partitioned and attribute partitioned replicas for executing data subsetting range queries. We extend the cost model and propose a greedy algorithm to address range queries with aggregation operations. The extended replica selection algorithm allows uneven partitioning of replicas across storage nodes. Different replicas can be partitioned across different subsets of storage nodes. We have implemented these techniques as part of an automatic data virtualization system and have evaluated the benefits of our techniques using this system. We demonstrate the efficacy of the algorithms on parallel machines using queries on datasets from oil reservoir simulation studies and satellite data processing applications.
AB - Partial replication is one type of optimization to speed up execution of queries submitted to large datasets. In partial replication, a portion of the dataset is extracted, re-organized, and re-distributed across the storage system. In this paper we investigate methods for efficient execution of queries when replicas of a dataset exist; we assume the replicas have already been created and do not target the replica creation problem. We propose a cost model and algorithm for combined use of space partitioned and attribute partitioned replicas for executing data subsetting range queries. We extend the cost model and propose a greedy algorithm to address range queries with aggregation operations. The extended replica selection algorithm allows uneven partitioning of replicas across storage nodes. Different replicas can be partitioned across different subsets of storage nodes. We have implemented these techniques as part of an automatic data virtualization system and have evaluated the benefits of our techniques using this system. We demonstrate the efficacy of the algorithms on parallel machines using queries on datasets from oil reservoir simulation studies and satellite data processing applications.
UR - https://www.scopus.com/pages/publications/34547419572
U2 - 10.1109/ICPP.2006.73
DO - 10.1109/ICPP.2006.73
M3 - Conference contribution
AN - SCOPUS:34547419572
SN - 0769526365
SN - 9780769526362
T3 - Proceedings of the International Conference on Parallel Processing
SP - 271
EP - 278
BT - ICPP 2006
T2 - ICPP 2006: 2006 International Conference on Parallel Processing
Y2 - 14 August 2006 through 18 August 2006
ER -