TY - GEN
T1 - Toward terabyte pattern mining
T2 - 2007 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'07
AU - Buehrer, Gregory
AU - Parthasarathy, Srinivasan
AU - Tatikonda, Shirish
AU - Kurc, Tahsin
AU - Saltz, Joel
PY - 2007
Y1 - 2007
N2 - We present a strategy for mining frequent item sets from terabyte-scale data sets on cluster systems. The algorithm embraces the holistic notion of architecture-conscious datamining, taking into account the capabilities of the processor, the memory hierarchy and the available network interconnects. Optimizations have been designed for lowering communication costs using compressed data structures and a succinct encoding. Optimizations for improving cache, memory and I/O utilization using pruningand tiling techniques, and smart data placement strategies are also employed. We leverage the extended memory spaceand computational resources of a distributed message-passing clusterto design a scalable solution, where each node can extend its metastructures beyond main memory by leveraging 64-bit architecture support. Our solution strategy is presented in the context of FPGrowth, a well-studied and rather efficient frequent pattern mining algorithm. Results demonstrate that the proposed strategy result in near-linearscaleup on up to 48 nodes.
AB - We present a strategy for mining frequent item sets from terabyte-scale data sets on cluster systems. The algorithm embraces the holistic notion of architecture-conscious datamining, taking into account the capabilities of the processor, the memory hierarchy and the available network interconnects. Optimizations have been designed for lowering communication costs using compressed data structures and a succinct encoding. Optimizations for improving cache, memory and I/O utilization using pruningand tiling techniques, and smart data placement strategies are also employed. We leverage the extended memory spaceand computational resources of a distributed message-passing clusterto design a scalable solution, where each node can extend its metastructures beyond main memory by leveraging 64-bit architecture support. Our solution strategy is presented in the context of FPGrowth, a well-studied and rather efficient frequent pattern mining algorithm. Results demonstrate that the proposed strategy result in near-linearscaleup on up to 48 nodes.
KW - Data mining
KW - Itemset mining
KW - Out of core
KW - Parallel
UR - https://www.scopus.com/pages/publications/34748916970
U2 - 10.1145/1229428.1229432
DO - 10.1145/1229428.1229432
M3 - Conference contribution
AN - SCOPUS:34748916970
SN - 1595936025
SN - 9781595936028
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 2
EP - 12
BT - Proceedings of the 2007 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'07
Y2 - 14 March 2007 through 17 March 2007
ER -