Hashing-Based Distributed Clustering for Massive High-Dimensional Data

Yifeng Xiao,Jiang Xue,Deyu Meng

CoRR(2023)

引用 0|浏览26
暂无评分
摘要
Clustering analysis is of substantial significance for data mining. The properties of big data raise higher demand for more efficient and economical distributed clustering methods. However, existing distributed clustering methods mainly focus on the size of data but ignore possible problems caused by data dimension. To solve this problem, we propose a new distributed algorithm, referred to as Hashing-Based Distributed Clustering (HBDC). Motivated by the outstanding performance of hashing methods for nearest neighbor searching, this algorithm applies the learning-to-hash technique to the clustering problem, which possesses incomparable advantages for data storage, transmission and computation. Following a global-sub-site paradigm, the HBDC consists of distributed training of hashing network and spectral clustering for hash codes at the global site. The sub-sites use the learnable network as a hash function to convert massive HD original data into a small number of hash codes, and send them to the global site for final clustering. In addition, a sample-selection method and slight network structures are designed to accelerate the convergence of the hash network. We also analyze the transmission cost of HBDC, including the upper bound. Our experiments on synthetic and real datasets illustrate the superiority of HBDC compared with existing state-of-the-art algorithms.
更多
查看译文
关键词
clustering,hashing-based,high-dimensional
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要