Communication-Efficient Pilot Estimation for Non-Randomly Distributed Data in Diverging Dimensions

发布时间:2024-10-18 10:11 阅读:
A A A

Distributed learning has been a dispensable tool in dealing with massive or distributed datasets. As an important and popular distributed learning method, the communication-efficient surrogate likelihood (CSL, Jordan et al, 2019, JASA) framework were proposed and has received much attention from the distributed machine learning community. In most of the works based on the CSL framework, there are two common treatments: (i) choosing the first machine as the central machine to solve an optimization problem using the data on the first machine; and (ii) assuming that the dimension is fixed when deriving some statistical properties. However, treatment (i) may not be appropriate when the data are stored in a non-random manner or heterogeneously distributed across different machines, which might be common in practice; and treatment (ii) largely limits the applications of CSL to diverging- or high-dimensional datasets, especially when the purpose is to infer some parameters of interest. To address the challenges posed by (i) and (ii), we develop a communication-efficient pilot (CEP) estimation strategy. Specifically, we first implement a pilot sampling on each machine to obtain a pilot sample dataset, and then use a new pilot sample-based surrogate loss function to approximate the global one and its minimizer is named as the CEP estimator. Second, we rigorously investigate theoretical properties of the CEP estimator including its convergence rate, which can reach the global rate √(𝑃_𝑛/𝑁), and its asymptotic normality when the dimension 𝑃_𝑛 diverges with the pilot sample size 𝑟 and 𝑃_𝑛<𝑛. Furthermore, we extend the CEP method to the high dimensional case, i.e., 𝑃_𝑛>𝑛 and propose a regularized version of CEP (CERP). We establish the non-asymptotic error bounds for an 𝑙_1-regularized CERP estimator (CERP-Lasso) and provide the convergence rate and asymptotic normality for a weighted 𝑙_1-regularized CERP estimator (CERP-aLasso) under generalized linear models. Finally, extensive synthetic and real datasets are employed to illustrate the superiority of the proposed approaches.