Precision and recall in classifying scientific literature: Comparing topic modelling to Kernel-based spectral clustering

Arho Suominen, S. Carley, Hannes Toivanen, A. Porter

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsProfessional

Abstract

The availability of methods that can be applied directly to text, such as topic modelling and string kernels, have shown promise as a tool for text mining. Studies show that the text-based clustering methods can differentiate between document groups with high accuracy. Recently Yau et al. showed that topic modelling algorithms, although dependant on the method, showed excellent precision and recall values for scientific documents. In this study we extend the study by Yau et al. to consider kernel-based spectral clustering, implemented in the R statistical software, in classifying a set of selected scientific papers. The sample used consists of seven technologies that were merged to a single corpus (N = 1254) for which the algorithm was used to distinguish between the documents on different technologies. The analysis was done with three levels of pre-processing, increasing with pre-processing intensity. The algorithms were run with each of the three pre-processed corpuses, to which the classification accuracy was calculated as precision, recall and F-score. The results show that kernel-based spectral clustering is able to classify documents with a high accuracy - highest F-score average for the seven technologies being 0,721. The variance between the F-scores of technologies is however significant, from a high of 0,874 to an low of 0,217. The results also suggest that increasing pre-processing intensity lowers the algorithms capability to distinguish between the technologies. The F-score average diminishes from 0,721, with the minimal pre-processing, to 0,606 as pre-processing is increased.
Original languageEnglish
Title of host publicationGlobal TechMining Conference 2013
Publication statusPublished - 2013
MoE publication typeNot Eligible
Event3rd Global TechMining Conference - Atlanta, GA, United States
Duration: 25 Sep 2013 → …

Conference

Conference3rd Global TechMining Conference
CountryUnited States
CityAtlanta, GA
Period25/09/13 → …

Fingerprint

Processing
Availability

Keywords

  • topic modeling
  • Kernel-based spectral clustering
  • text mining

Cite this

Suominen, A., Carley, S., Toivanen, H., & Porter, A. (2013). Precision and recall in classifying scientific literature: Comparing topic modelling to Kernel-based spectral clustering. In Global TechMining Conference 2013
Suominen, Arho ; Carley, S. ; Toivanen, Hannes ; Porter, A. / Precision and recall in classifying scientific literature : Comparing topic modelling to Kernel-based spectral clustering. Global TechMining Conference 2013. 2013.
@inproceedings{d23a8e71321842c7aea9d8a8a5b89228,
title = "Precision and recall in classifying scientific literature: Comparing topic modelling to Kernel-based spectral clustering",
abstract = "The availability of methods that can be applied directly to text, such as topic modelling and string kernels, have shown promise as a tool for text mining. Studies show that the text-based clustering methods can differentiate between document groups with high accuracy. Recently Yau et al. showed that topic modelling algorithms, although dependant on the method, showed excellent precision and recall values for scientific documents. In this study we extend the study by Yau et al. to consider kernel-based spectral clustering, implemented in the R statistical software, in classifying a set of selected scientific papers. The sample used consists of seven technologies that were merged to a single corpus (N = 1254) for which the algorithm was used to distinguish between the documents on different technologies. The analysis was done with three levels of pre-processing, increasing with pre-processing intensity. The algorithms were run with each of the three pre-processed corpuses, to which the classification accuracy was calculated as precision, recall and F-score. The results show that kernel-based spectral clustering is able to classify documents with a high accuracy - highest F-score average for the seven technologies being 0,721. The variance between the F-scores of technologies is however significant, from a high of 0,874 to an low of 0,217. The results also suggest that increasing pre-processing intensity lowers the algorithms capability to distinguish between the technologies. The F-score average diminishes from 0,721, with the minimal pre-processing, to 0,606 as pre-processing is increased.",
keywords = "topic modeling, Kernel-based spectral clustering, text mining",
author = "Arho Suominen and S. Carley and Hannes Toivanen and A. Porter",
note = "Project code: 75913",
year = "2013",
language = "English",
booktitle = "Global TechMining Conference 2013",

}

Suominen, A, Carley, S, Toivanen, H & Porter, A 2013, Precision and recall in classifying scientific literature: Comparing topic modelling to Kernel-based spectral clustering. in Global TechMining Conference 2013. 3rd Global TechMining Conference, Atlanta, GA, United States, 25/09/13.

Precision and recall in classifying scientific literature : Comparing topic modelling to Kernel-based spectral clustering. / Suominen, Arho; Carley, S.; Toivanen, Hannes; Porter, A.

Global TechMining Conference 2013. 2013.

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsProfessional

TY - GEN

T1 - Precision and recall in classifying scientific literature

T2 - Comparing topic modelling to Kernel-based spectral clustering

AU - Suominen, Arho

AU - Carley, S.

AU - Toivanen, Hannes

AU - Porter, A.

N1 - Project code: 75913

PY - 2013

Y1 - 2013

N2 - The availability of methods that can be applied directly to text, such as topic modelling and string kernels, have shown promise as a tool for text mining. Studies show that the text-based clustering methods can differentiate between document groups with high accuracy. Recently Yau et al. showed that topic modelling algorithms, although dependant on the method, showed excellent precision and recall values for scientific documents. In this study we extend the study by Yau et al. to consider kernel-based spectral clustering, implemented in the R statistical software, in classifying a set of selected scientific papers. The sample used consists of seven technologies that were merged to a single corpus (N = 1254) for which the algorithm was used to distinguish between the documents on different technologies. The analysis was done with three levels of pre-processing, increasing with pre-processing intensity. The algorithms were run with each of the three pre-processed corpuses, to which the classification accuracy was calculated as precision, recall and F-score. The results show that kernel-based spectral clustering is able to classify documents with a high accuracy - highest F-score average for the seven technologies being 0,721. The variance between the F-scores of technologies is however significant, from a high of 0,874 to an low of 0,217. The results also suggest that increasing pre-processing intensity lowers the algorithms capability to distinguish between the technologies. The F-score average diminishes from 0,721, with the minimal pre-processing, to 0,606 as pre-processing is increased.

AB - The availability of methods that can be applied directly to text, such as topic modelling and string kernels, have shown promise as a tool for text mining. Studies show that the text-based clustering methods can differentiate between document groups with high accuracy. Recently Yau et al. showed that topic modelling algorithms, although dependant on the method, showed excellent precision and recall values for scientific documents. In this study we extend the study by Yau et al. to consider kernel-based spectral clustering, implemented in the R statistical software, in classifying a set of selected scientific papers. The sample used consists of seven technologies that were merged to a single corpus (N = 1254) for which the algorithm was used to distinguish between the documents on different technologies. The analysis was done with three levels of pre-processing, increasing with pre-processing intensity. The algorithms were run with each of the three pre-processed corpuses, to which the classification accuracy was calculated as precision, recall and F-score. The results show that kernel-based spectral clustering is able to classify documents with a high accuracy - highest F-score average for the seven technologies being 0,721. The variance between the F-scores of technologies is however significant, from a high of 0,874 to an low of 0,217. The results also suggest that increasing pre-processing intensity lowers the algorithms capability to distinguish between the technologies. The F-score average diminishes from 0,721, with the minimal pre-processing, to 0,606 as pre-processing is increased.

KW - topic modeling

KW - Kernel-based spectral clustering

KW - text mining

M3 - Conference article in proceedings

BT - Global TechMining Conference 2013

ER -