TY - GEN
T1 - Improving parallelism of recursive stencil computations without sacrificing cache performance
AU - Tang, Yuan
AU - You, Ronghui
AU - Kan, Haibin
AU - Tithi, Jesmin Jahan
AU - Ganapathi, Pramod
AU - Chowdhury, Rezaul A.
PY - 2014/10/20
Y1 - 2014/10/20
N2 - The state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called "cache-oblivious wavefront" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.
AB - The state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called "cache-oblivious wavefront" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.
KW - Atomic operation
KW - Cacheoblivious wave front
KW - Cilk
KW - Multi-core
KW - Nested parallel computation
KW - Parallel cache-oblivious algorithm
KW - Stencil
UR - https://www.scopus.com/pages/publications/84961360270
U2 - 10.1145/2686745.2686752
DO - 10.1145/2686745.2686752
M3 - Conference contribution
AN - SCOPUS:84961360270
T3 - WOSC 2014 - Proceedings of the 2014 ACM SIGPLAN Workshop on Stencil Computations, Part of SPLASH 2014
SP - 1
EP - 7
BT - WOSC 2014 - Proceedings of the 2014 ACM SIGPLAN Workshop on Stencil Computations, Part of SPLASH 2014
PB - Association for Computing Machinery, Inc
T2 - 2014 ACM SIGPLAN 2nd Workshop on Stencil Computations, WOSC 2014
Y2 - 20 October 2014 through 20 October 2014
ER -