Abstract
The availability of methods that can be applied directly
to text, such as topic modelling and string kernels,
have shown promise as a tool for text mining. Studies
show that the text-based clustering methods can
differentiate between document groups with high accuracy.
Recently Yau et al. showed that topic modelling
algorithms, although dependant on the method, showed
excellent precision and recall values for scientific
documents. In this study we extend the study by Yau et
al. to consider kernel-based spectral clustering,
implemented in the R statistical software, in classifying
a set of selected scientific papers. The sample used
consists of seven technologies that were merged to a
single corpus (N = 1254) for which the algorithm was used
to distinguish between the documents on different
technologies. The analysis was done with three levels of
pre-processing, increasing with pre-processing intensity.
The algorithms were run with each of the three
pre-processed corpuses, to which the classification
accuracy was calculated as precision, recall and F-score.
The results show that kernel-based spectral clustering is
able to classify documents with a high accuracy - highest
F-score average for the seven technologies being 0,721.
The variance between the F-scores of technologies is
however significant, from a high of 0,874 to an low of
0,217. The results also suggest that increasing
pre-processing intensity lowers the algorithms capability
to distinguish between the technologies. The F-score
average diminishes from 0,721, with the minimal
pre-processing, to 0,606 as pre-processing is increased.
Original language | English |
---|---|
Title of host publication | Global TechMining Conference 2013 |
Number of pages | 2 |
Publication status | Published - 2013 |
MoE publication type | D3 Professional conference proceedings |
Event | 3rd Global TechMining Conference - Atlanta, GA, United States Duration: 25 Sept 2013 → … |
Conference
Conference | 3rd Global TechMining Conference |
---|---|
Country/Territory | United States |
City | Atlanta, GA |
Period | 25/09/13 → … |
Keywords
- topic modeling
- Kernel-based spectral clustering
- text mining