Precision and recall in classifying scientific literature: Comparing topic modelling to Kernel-based spectral clustering

Arho Suominen, S. Carley, Hannes Toivanen, A. Porter

    Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsProfessional

    Abstract

    The availability of methods that can be applied directly to text, such as topic modelling and string kernels, have shown promise as a tool for text mining. Studies show that the text-based clustering methods can differentiate between document groups with high accuracy. Recently Yau et al. showed that topic modelling algorithms, although dependant on the method, showed excellent precision and recall values for scientific documents. In this study we extend the study by Yau et al. to consider kernel-based spectral clustering, implemented in the R statistical software, in classifying a set of selected scientific papers. The sample used consists of seven technologies that were merged to a single corpus (N = 1254) for which the algorithm was used to distinguish between the documents on different technologies. The analysis was done with three levels of pre-processing, increasing with pre-processing intensity. The algorithms were run with each of the three pre-processed corpuses, to which the classification accuracy was calculated as precision, recall and F-score. The results show that kernel-based spectral clustering is able to classify documents with a high accuracy - highest F-score average for the seven technologies being 0,721. The variance between the F-scores of technologies is however significant, from a high of 0,874 to an low of 0,217. The results also suggest that increasing pre-processing intensity lowers the algorithms capability to distinguish between the technologies. The F-score average diminishes from 0,721, with the minimal pre-processing, to 0,606 as pre-processing is increased.
    Original languageEnglish
    Title of host publicationGlobal TechMining Conference 2013
    Publication statusPublished - 2013
    MoE publication typeNot Eligible
    Event3rd Global TechMining Conference - Atlanta, GA, United States
    Duration: 25 Sep 2013 → …

    Conference

    Conference3rd Global TechMining Conference
    CountryUnited States
    CityAtlanta, GA
    Period25/09/13 → …

    Keywords

    • topic modeling
    • Kernel-based spectral clustering
    • text mining

    Fingerprint Dive into the research topics of 'Precision and recall in classifying scientific literature: Comparing topic modelling to Kernel-based spectral clustering'. Together they form a unique fingerprint.

  • Cite this