Clustering scientific documents with topic modeling

C-K Yau, A Porter, N Newman, Arho Suominen (Corresponding Author)

Research output: Contribution to journalArticleScientificpeer-review

65 Citations (Scopus)

Abstract

Topic modeling is a type of statistical model for discovering the latent "topics" that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities
Original languageEnglish
Pages (from-to)767-786
JournalScientometrics
Volume100
Issue number3
DOIs
Publication statusPublished - 2014
MoE publication typeA1 Journal article-refereed

Fingerprint

Learning systems
text analysis
learning
Statistical Models

Keywords

  • Topic modeling
  • text analysis
  • latent dirichlet allocation

Cite this

Yau, C-K ; Porter, A ; Newman, N ; Suominen, Arho. / Clustering scientific documents with topic modeling. In: Scientometrics. 2014 ; Vol. 100, No. 3. pp. 767-786.
@article{1ea5651f249f4aef966499238412fd13,
title = "Clustering scientific documents with topic modeling",
abstract = "Topic modeling is a type of statistical model for discovering the latent {"}topics{"} that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities",
keywords = "Topic modeling, text analysis, latent dirichlet allocation",
author = "C-K Yau and A Porter and N Newman and Arho Suominen",
note = "Project code: CEK",
year = "2014",
doi = "10.1007/s11192-014-1321-8",
language = "English",
volume = "100",
pages = "767--786",
journal = "Scientometrics",
issn = "0138-9130",
publisher = "Springer",
number = "3",

}

Clustering scientific documents with topic modeling. / Yau, C-K; Porter, A; Newman, N; Suominen, Arho (Corresponding Author).

In: Scientometrics, Vol. 100, No. 3, 2014, p. 767-786.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - Clustering scientific documents with topic modeling

AU - Yau, C-K

AU - Porter, A

AU - Newman, N

AU - Suominen, Arho

N1 - Project code: CEK

PY - 2014

Y1 - 2014

N2 - Topic modeling is a type of statistical model for discovering the latent "topics" that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities

AB - Topic modeling is a type of statistical model for discovering the latent "topics" that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities

KW - Topic modeling

KW - text analysis

KW - latent dirichlet allocation

U2 - 10.1007/s11192-014-1321-8

DO - 10.1007/s11192-014-1321-8

M3 - Article

VL - 100

SP - 767

EP - 786

JO - Scientometrics

JF - Scientometrics

SN - 0138-9130

IS - 3

ER -