Abstract
With the rise of user created content on the Internet,
the focus of text mining has shifted. Twitter messages
and product descriptions are examples of new corpora
available for text mining. Keyword extraction, user
modeling and text categorization are all areas that are
focusing on utilizing this new data. However, as the
documents within these corpora are considerably shorter
than in the traditional cases, such as news articles,
there are also new challenges. In this paper, we focus on
keyword extraction from documents such as event and
product descriptions, and movie plot lines that often
hold 30 to 60 words. We propose a novel unsupervised
keyword extraction approach called Informativeness-based
Keyword Extraction (IKE) that uses clustering and
three levels of word evaluation to address the challenges
of short documents. We evaluate the performance
of our approach by using manually tagged test sets and
compare the results against other keyword extrac-
tion methods, such as CollabRank, KeyGraph, Chi-squared,
and TF-IDF. We also evaluate the precision and
effectiveness of the extracted keywords for user modeling
and recommendation and report the results of all
approaches. In all of the experiments IKE out-performs
the competition
Original language | English |
---|---|
Title of host publication | Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval |
Subtitle of host publication | SSTM 2012 |
Publisher | SciTePress |
Pages | 411-421 |
ISBN (Print) | 978-989-8565-29-7 |
DOIs | |
Publication status | Published - 2012 |
MoE publication type | A4 Article in a conference publication |
Event | 4th International Conference on Knowledge Discovery and Information Retrieval, KIDR 2012 - Barcelona, Spain Duration: 4 Oct 2012 → 7 Oct 2012 |
Conference
Conference | 4th International Conference on Knowledge Discovery and Information Retrieval, KIDR 2012 |
---|---|
Country/Territory | Spain |
City | Barcelona |
Period | 4/10/12 → 7/10/12 |
Keywords
- Keyword extraction
- machine learning
- short documents
- term weighting
- text mining