hepaccelerate: Fast Analysis of Columnar Collider Data

Creators: Pata, J.; Spiropulu, M.

Abstract

At HEP experiments, processing terabytes of structured numerical event data to a few statistical summaries is a common task. This step involves selecting events and objects within the event, reconstructing high-level variables, evaluating multivariate classifiers with up to hundreds of variations and creating thousands of low-dimensional histograms. Currently, this is done using multi-step workflows and batch jobs. Based on the CMS search for H(μμ), we demonstrate that it is possible to carry out significant parts of a real collider analysis at a rate of up to a million events per second on a single multicore server with optional GPU acceleration. This is achieved by representing HEP event data as memory-mappable sparse arrays, and by expressing common analysis operations as kernels that can be parallelized across the data using multithreading. We find that only a small number of relatively simple kernels are needed to implement significant parts of this Higgs analysis. Therefore, analysis of real collider datasets of billions events could be done within minutes to a few hours using simple multithreaded codes, reducing the need for managing distributed workflows in the exploratory phase. This approach could speed up the cycle for delivering physics results at HEP experiments. We release the hepaccelerate prototype library as a demonstrator of such accelerated computational kernels. We look forward to discussion, further development and use of efficient and easy-to-use software for terabyte-scale high-level data analysis in the physical sciences.

Additional Information

We would like to thank Jim Pivarski and Lindsey Gray for helpful feedback at the start of this project. We are grateful to Nan Lu and Irene Dutta for providing a reference implementation of the H(μμ) analysis that could be adapted to vectorized code. We would like to thank Christina Reissel for being an independent early tester of these approaches and for helpful feedback on this report. The availability of the excellent Python libraries uproot, awkward, coffea, Numba, cupy and numpy was imperative for this project and we are grateful to the developers of those projects. Part of this work was conducted at "iBanks", the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro and the Kavli Foundation for their support of "iBanks".

Attached Files

Submitted - 1906.06242.pdf

Files

1906.06242.pdf

Files (287.0 kB)

Name	Size	Download all
1906.06242.pdf md5:cbf5b0f166c9f30aa28f2ebe403c6292	287.0 kB	Preview Download

Additional details

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes