Precision and recall in classifying scientific literature: Comparing topic modelling to Kernel-based spectral clustering

Arho Suominen, S. Carley, Hannes Toivanen, A. Porter

    Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsProfessional

    Abstract

    The availability of methods that can be applied directly to text, such as topic modelling and string kernels, have shown promise as a tool for text mining. Studies show that the text-based clustering methods can differentiate between document groups with high accuracy. Recently Yau et al. showed that topic modelling algorithms, although dependant on the method, showed excellent precision and recall values for scientific documents. In this study we extend the study by Yau et al. to consider kernel-based spectral clustering, implemented in the R statistical software, in classifying a set of selected scientific papers. The sample used consists of seven technologies that were merged to a single corpus (N = 1254) for which the algorithm was used to distinguish between the documents on different technologies. The analysis was done with three levels of pre-processing, increasing with pre-processing intensity. The algorithms were run with each of the three pre-processed corpuses, to which the classification accuracy was calculated as precision, recall and F-score. The results show that kernel-based spectral clustering is able to classify documents with a high accuracy - highest F-score average for the seven technologies being 0,721. The variance between the F-scores of technologies is however significant, from a high of 0,874 to an low of 0,217. The results also suggest that increasing pre-processing intensity lowers the algorithms capability to distinguish between the technologies. The F-score average diminishes from 0,721, with the minimal pre-processing, to 0,606 as pre-processing is increased.
    Original languageEnglish
    Title of host publicationGlobal TechMining Conference 2013
    Publication statusPublished - 2013
    MoE publication typeNot Eligible
    Event3rd Global TechMining Conference - Atlanta, GA, United States
    Duration: 25 Sep 2013 → …

    Conference

    Conference3rd Global TechMining Conference
    CountryUnited States
    CityAtlanta, GA
    Period25/09/13 → …

    Fingerprint

    Processing
    Availability

    Keywords

    • topic modeling
    • Kernel-based spectral clustering
    • text mining

    Cite this

    Suominen, A., Carley, S., Toivanen, H., & Porter, A. (2013). Precision and recall in classifying scientific literature: Comparing topic modelling to Kernel-based spectral clustering. In Global TechMining Conference 2013
    Suominen, Arho ; Carley, S. ; Toivanen, Hannes ; Porter, A. / Precision and recall in classifying scientific literature : Comparing topic modelling to Kernel-based spectral clustering. Global TechMining Conference 2013. 2013.
    @inproceedings{d23a8e71321842c7aea9d8a8a5b89228,
    title = "Precision and recall in classifying scientific literature: Comparing topic modelling to Kernel-based spectral clustering",
    abstract = "The availability of methods that can be applied directly to text, such as topic modelling and string kernels, have shown promise as a tool for text mining. Studies show that the text-based clustering methods can differentiate between document groups with high accuracy. Recently Yau et al. showed that topic modelling algorithms, although dependant on the method, showed excellent precision and recall values for scientific documents. In this study we extend the study by Yau et al. to consider kernel-based spectral clustering, implemented in the R statistical software, in classifying a set of selected scientific papers. The sample used consists of seven technologies that were merged to a single corpus (N = 1254) for which the algorithm was used to distinguish between the documents on different technologies. The analysis was done with three levels of pre-processing, increasing with pre-processing intensity. The algorithms were run with each of the three pre-processed corpuses, to which the classification accuracy was calculated as precision, recall and F-score. The results show that kernel-based spectral clustering is able to classify documents with a high accuracy - highest F-score average for the seven technologies being 0,721. The variance between the F-scores of technologies is however significant, from a high of 0,874 to an low of 0,217. The results also suggest that increasing pre-processing intensity lowers the algorithms capability to distinguish between the technologies. The F-score average diminishes from 0,721, with the minimal pre-processing, to 0,606 as pre-processing is increased.",
    keywords = "topic modeling, Kernel-based spectral clustering, text mining",
    author = "Arho Suominen and S. Carley and Hannes Toivanen and A. Porter",
    note = "Project code: 75913",
    year = "2013",
    language = "English",
    booktitle = "Global TechMining Conference 2013",

    }

    Suominen, A, Carley, S, Toivanen, H & Porter, A 2013, Precision and recall in classifying scientific literature: Comparing topic modelling to Kernel-based spectral clustering. in Global TechMining Conference 2013. 3rd Global TechMining Conference, Atlanta, GA, United States, 25/09/13.

    Precision and recall in classifying scientific literature : Comparing topic modelling to Kernel-based spectral clustering. / Suominen, Arho; Carley, S.; Toivanen, Hannes; Porter, A.

    Global TechMining Conference 2013. 2013.

    Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsProfessional

    TY - GEN

    T1 - Precision and recall in classifying scientific literature

    T2 - Comparing topic modelling to Kernel-based spectral clustering

    AU - Suominen, Arho

    AU - Carley, S.

    AU - Toivanen, Hannes

    AU - Porter, A.

    N1 - Project code: 75913

    PY - 2013

    Y1 - 2013

    N2 - The availability of methods that can be applied directly to text, such as topic modelling and string kernels, have shown promise as a tool for text mining. Studies show that the text-based clustering methods can differentiate between document groups with high accuracy. Recently Yau et al. showed that topic modelling algorithms, although dependant on the method, showed excellent precision and recall values for scientific documents. In this study we extend the study by Yau et al. to consider kernel-based spectral clustering, implemented in the R statistical software, in classifying a set of selected scientific papers. The sample used consists of seven technologies that were merged to a single corpus (N = 1254) for which the algorithm was used to distinguish between the documents on different technologies. The analysis was done with three levels of pre-processing, increasing with pre-processing intensity. The algorithms were run with each of the three pre-processed corpuses, to which the classification accuracy was calculated as precision, recall and F-score. The results show that kernel-based spectral clustering is able to classify documents with a high accuracy - highest F-score average for the seven technologies being 0,721. The variance between the F-scores of technologies is however significant, from a high of 0,874 to an low of 0,217. The results also suggest that increasing pre-processing intensity lowers the algorithms capability to distinguish between the technologies. The F-score average diminishes from 0,721, with the minimal pre-processing, to 0,606 as pre-processing is increased.

    AB - The availability of methods that can be applied directly to text, such as topic modelling and string kernels, have shown promise as a tool for text mining. Studies show that the text-based clustering methods can differentiate between document groups with high accuracy. Recently Yau et al. showed that topic modelling algorithms, although dependant on the method, showed excellent precision and recall values for scientific documents. In this study we extend the study by Yau et al. to consider kernel-based spectral clustering, implemented in the R statistical software, in classifying a set of selected scientific papers. The sample used consists of seven technologies that were merged to a single corpus (N = 1254) for which the algorithm was used to distinguish between the documents on different technologies. The analysis was done with three levels of pre-processing, increasing with pre-processing intensity. The algorithms were run with each of the three pre-processed corpuses, to which the classification accuracy was calculated as precision, recall and F-score. The results show that kernel-based spectral clustering is able to classify documents with a high accuracy - highest F-score average for the seven technologies being 0,721. The variance between the F-scores of technologies is however significant, from a high of 0,874 to an low of 0,217. The results also suggest that increasing pre-processing intensity lowers the algorithms capability to distinguish between the technologies. The F-score average diminishes from 0,721, with the minimal pre-processing, to 0,606 as pre-processing is increased.

    KW - topic modeling

    KW - Kernel-based spectral clustering

    KW - text mining

    M3 - Conference article in proceedings

    BT - Global TechMining Conference 2013

    ER -