论文标题
Pylspack:用于草图,列子集选择,回归和利用分数的并行算法和数据结构
pylspack: Parallel algorithms and data structures for sketching, column subset selection, regression and leverage scores
论文作者
论文摘要
我们介绍了数值线性代数中三个基本操作的平行算法和数据结构:(i)高斯和近计数随机投影及其组合,(ii)革兰氏矩阵的计算以及(iii)计算两个矩阵产品的平方行规范的计算,并专注于“ tall-spinesny”,“ tall and tall inse ny Matrices ny”我们提供了无处不在的Countsketch变换的详细分析及其与高斯随机预测的结合,计算内存需求,计算复杂性和工作负载平衡。我们还展示了这些结果如何应用于列子集选择,最小二乘回归并利用分数计算。这些工具已在PYLSPACK中实现,PYLSPACK是一种公开可用的Python软件包(https://github.com/ibm/pylspack),其核心用C ++编写并与OpenMP并行,并且与标准矩阵数据结构兼容Scipy和Numpy的标准矩阵数据结构。广泛的数值实验表明,所提出的算法尺寸很好,并且在高质矩阵的现有库中明显优于现有的库。
We present parallel algorithms and data structures for three fundamental operations in Numerical Linear Algebra: (i) Gaussian and CountSketch random projections and their combination, (ii) computation of the Gram matrix and (iii) computation of the squared row norms of the product of two matrices, with a special focus on "tall-and-skinny" matrices, which arise in many applications. We provide a detailed analysis of the ubiquitous CountSketch transform and its combination with Gaussian random projections, accounting for memory requirements, computational complexity and workload balancing. We also demonstrate how these results can be applied to column subset selection, least squares regression and leverage scores computation. These tools have been implemented in pylspack, a publicly available Python package (https://github.com/IBM/pylspack) whose core is written in C++ and parallelized with OpenMP, and which is compatible with standard matrix data structures of SciPy and NumPy. Extensive numerical experiments indicate that the proposed algorithms scale well and significantly outperform existing libraries for tall-and-skinny matrices.
