Hierarchical clustering of text documents

作者:L. S. Lomakina;V. B. Rodionov;A. S. Surkova 刊名:Automation and Remote Control 上传者:康丽娜


ISSN 0005-1179, Automation and Remote Control, 2014, Vol. 75, No. 7, pp. 1309–1315. © Pleiades Publishing, Ltd., 2014. Original Russian Text © L.S. Lomakina, V.B. Rodionov, A.S. Surkova, 2012, published in Sistemy Upravleniya i Informatsionnye Tekhnologii, 2012, No. 3, pp. 39–44. CONTROL SYSTEMS AND INFORMATION TECHNOLOGIES Hierarchical Clustering of Text Documents L. S. Lomakina, V. B. Rodionov, and A. S. Surkova Alexeev Nizhni Novgorod State Technical University, Nizhni Novgorod, Russia e-mail: llomakina@list.ru Received April 18, 2012 Abstract—We consider the possibility to use compression algorithms to compute similarity distances in order to solve the clustering problem. We propose an actual hierarchical clustering machine that constructs a binary tree of object dependencies similar to a taxonomy. DOI: 10.1134/S000511791407011X 1. INTRODUCTION At present, methods for improving the efficiency of computer processing of large volumes of textual data in such fields as electronic libraries, information retrieval, news categorization etc. have been attracting a lot of attention. In this work, this problem is solved by introducing a corresponding structure on the text data with clustering. Since prior knowledge regarding the structure of the text is usually absent, there arises the problem of specifying the criterion for text data classification. Therefore, in this work we concentrate on the following objectives: • automatic identification of clustering criteria; • constructing a tr