基于Spark的序列数据质量评价

作者:韩超;段磊;邓松;王慧锋;唐常杰; 刊名:计算机科学与探索 上传者:任晓静

【摘要】随着序列数据在实际中的广泛应用,序列数据质量评价成为学术、工业等众多领域的热门研究问题.目前主流的序列数据质量评价方法是基于概率后缀树模型进行数据质量评价,然而这种方法难以实现对大规模数据的处理.为解决此问题,提出了基于Spark的序列数据质量评价算法STALK(sequential data quality evaluation with Spark),并且采用了改进的剪枝策略来提高算法效率.具体地,在Spark平台下,利用大规模序列数据高效建立生成模型,并根据生成模型对查询序列的数据质量进行快速评价.最后通过真实序列数据集验证了STALK算法的有效性、执行效率和可扩展性.

全文阅读

基于Spark的序列数据质量评价* 韩 超1,段 磊1,2+,邓 松3,王慧锋1,唐常杰1 1. 四川大学 计算机学院,成都 610065 2. 四川大学 华西公共卫生学院,成都 610041 3. 南京邮电大学 先进技术研究院,南京 210003 Evaluation of Sequential Data Quality Using Spark􀆽 HAN Chao1, DUAN Lei1,2+, DENG Song3, WANG Huifeng1, TANG Changjie1 1. School of Computer Science, Sichuan University, Chengdu 610065, China 2. West China School of Public Health, Sichuan University, Chengdu 610041, China 3. Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing 210003, China + Corresponding author: E-mail: leiduan@scu.edu.cn HAN Chao, DUAN Lei, DENG Song, et al. Evaluation of sequential data quality using Spark. Journal of Frontiers of Computer Science and Technology, 2017, 11(6):897-907. Abstract: Sequential data are prevalent in many real world applications. The quality evaluation on sequential data, which attracts the attentions from both academic research and industry fields, is important and prerequisite for extracting knowledge from the sequential data. Recently, a method using the probabilistic suffix tree has been proposed for evaluating the sequential data quality. However, this method cannot deal with the large-scale data set. To break this limitation, this paper proposes a Spark- based algorithm, called STALK (sequential data quality evaluation with Spark), for evaluating the quality of large-scale sequential data. Moreover, this paper uses the novel pruning strategies to improve the efficiency of STALK. Specifically, on the Spark platf

参考文献

引证文献

问答

我要提问