论文标题
具有非随机分布数据的分布式系统上广义线性模型的有效估计
Efficient Estimation for Generalized Linear Models on a Distributed System with Nonrandomly Distributed Data
论文作者
论文摘要
分布式系统在实践中已被广泛用于完成巨大规模的数据分析任务。在这项工作中,我们针对具有非随机分布数据的分布式系统上广义线性模型的估计问题。我们开发了一种伪Newton-Raphson算法,以进行有效的估计。在此算法中,我们首先根据从不同工人收集的小样本中获得一个试点估计器。然后,基于试点估算器中每个工人中对数可能的函数的计算衍生物进行一步更新。最终的单步估计器被证明是统计上有效的,即使使用非随机分布数据,也是全球估计器。此外,就通信成本和存储使用而言,它在计算上是有效的。基于一步估计器,我们还为假设检验开发了可能性比测试。研究了一步估计器的理论特性和相应的似然比检验。通过模拟评估有限样本性能。最后,在火花集群上分析了一个美国航空公司的数据集,以实现插图目的。
Distributed systems have been widely used in practice to accomplish data analysis tasks of huge scales. In this work, we target on the estimation problem of generalized linear models on a distributed system with nonrandomly distributed data. We develop a Pseudo-Newton-Raphson algorithm for efficient estimation. In this algorithm, we first obtain a pilot estimator based on a small random sample collected from different Workers. Then conduct one-step updating based on the computed derivatives of log-likelihood functions in each Worker at the pilot estimator. The final one-step estimator is proved to be statistically efficient as the global estimator, even with nonrandomly distributed data. In addition, it is computationally efficient, in terms of both communication cost and storage usage. Based on the one-step estimator, we also develop a likelihood ratio test for hypothesis testing. The theoretical properties of the one-step estimator and the corresponding likelihood ratio test are investigated. The finite sample performances are assessed through simulations. Finally, an American Airline dataset is analyzed on a Spark cluster for illustration purpose.
