Informativeness-based keyword extraction from short documents

Mika Timonen, Timo Toivanen, Y. Teng, C. Cheng, L. He

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

5 Citations (Scopus)

Abstract

With the rise of user created content on the Internet, the focus of text mining has shifted. Twitter messages and product descriptions are examples of new corpora available for text mining. Keyword extraction, user modeling and text categorization are all areas that are focusing on utilizing this new data. However, as the documents within these corpora are considerably shorter than in the traditional cases, such as news articles, there are also new challenges. In this paper, we focus on keyword extraction from documents such as event and product descriptions, and movie plot lines that often hold 30 to 60 words. We propose a novel unsupervised keyword extraction approach called Informativeness-based Keyword Extraction (IKE) that uses clustering and three levels of word evaluation to address the challenges of short documents. We evaluate the performance of our approach by using manually tagged test sets and compare the results against other keyword extrac- tion methods, such as CollabRank, KeyGraph, Chi-squared, and TF-IDF. We also evaluate the precision and effectiveness of the extracted keywords for user modeling and recommendation and report the results of all approaches. In all of the experiments IKE out-performs the competition
Original languageEnglish
Title of host publicationProceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval
Subtitle of host publicationSSTM 2012
Pages411-421
DOIs
Publication statusPublished - 2012
MoE publication typeA4 Article in a conference publication
Event4th International Conference on Knowledge Discovery and Information Retrieval, KIDR 2012 - Barcelona, Spain
Duration: 4 Oct 20127 Oct 2012

Conference

Conference4th International Conference on Knowledge Discovery and Information Retrieval, KIDR 2012
CountrySpain
CityBarcelona
Period4/10/127/10/12

Fingerprint

Internet
Experiments

Keywords

  • Keyword extraction
  • machine learning
  • short documents
  • term weighting
  • text mining

Cite this

Timonen, M., Toivanen, T., Teng, Y., Cheng, C., & He, L. (2012). Informativeness-based keyword extraction from short documents. In Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval: SSTM 2012 (pp. 411-421) https://doi.org/10.5220/0004130704110421
Timonen, Mika ; Toivanen, Timo ; Teng, Y. ; Cheng, C. ; He, L. / Informativeness-based keyword extraction from short documents. Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval: SSTM 2012. 2012. pp. 411-421
@inproceedings{9c82c4be94f642fda28e87ecb091e12e,
title = "Informativeness-based keyword extraction from short documents",
abstract = "With the rise of user created content on the Internet, the focus of text mining has shifted. Twitter messages and product descriptions are examples of new corpora available for text mining. Keyword extraction, user modeling and text categorization are all areas that are focusing on utilizing this new data. However, as the documents within these corpora are considerably shorter than in the traditional cases, such as news articles, there are also new challenges. In this paper, we focus on keyword extraction from documents such as event and product descriptions, and movie plot lines that often hold 30 to 60 words. We propose a novel unsupervised keyword extraction approach called Informativeness-based Keyword Extraction (IKE) that uses clustering and three levels of word evaluation to address the challenges of short documents. We evaluate the performance of our approach by using manually tagged test sets and compare the results against other keyword extrac- tion methods, such as CollabRank, KeyGraph, Chi-squared, and TF-IDF. We also evaluate the precision and effectiveness of the extracted keywords for user modeling and recommendation and report the results of all approaches. In all of the experiments IKE out-performs the competition",
keywords = "Keyword extraction, machine learning, short documents, term weighting, text mining",
author = "Mika Timonen and Timo Toivanen and Y. Teng and C. Cheng and L. He",
note = "Project code: 73137",
year = "2012",
doi = "10.5220/0004130704110421",
language = "English",
isbn = "978-989-8565-29-7",
pages = "411--421",
booktitle = "Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval",

}

Timonen, M, Toivanen, T, Teng, Y, Cheng, C & He, L 2012, Informativeness-based keyword extraction from short documents. in Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval: SSTM 2012. pp. 411-421, 4th International Conference on Knowledge Discovery and Information Retrieval, KIDR 2012, Barcelona, Spain, 4/10/12. https://doi.org/10.5220/0004130704110421

Informativeness-based keyword extraction from short documents. / Timonen, Mika; Toivanen, Timo; Teng, Y.; Cheng, C.; He, L.

Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval: SSTM 2012. 2012. p. 411-421.

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

TY - GEN

T1 - Informativeness-based keyword extraction from short documents

AU - Timonen, Mika

AU - Toivanen, Timo

AU - Teng, Y.

AU - Cheng, C.

AU - He, L.

N1 - Project code: 73137

PY - 2012

Y1 - 2012

N2 - With the rise of user created content on the Internet, the focus of text mining has shifted. Twitter messages and product descriptions are examples of new corpora available for text mining. Keyword extraction, user modeling and text categorization are all areas that are focusing on utilizing this new data. However, as the documents within these corpora are considerably shorter than in the traditional cases, such as news articles, there are also new challenges. In this paper, we focus on keyword extraction from documents such as event and product descriptions, and movie plot lines that often hold 30 to 60 words. We propose a novel unsupervised keyword extraction approach called Informativeness-based Keyword Extraction (IKE) that uses clustering and three levels of word evaluation to address the challenges of short documents. We evaluate the performance of our approach by using manually tagged test sets and compare the results against other keyword extrac- tion methods, such as CollabRank, KeyGraph, Chi-squared, and TF-IDF. We also evaluate the precision and effectiveness of the extracted keywords for user modeling and recommendation and report the results of all approaches. In all of the experiments IKE out-performs the competition

AB - With the rise of user created content on the Internet, the focus of text mining has shifted. Twitter messages and product descriptions are examples of new corpora available for text mining. Keyword extraction, user modeling and text categorization are all areas that are focusing on utilizing this new data. However, as the documents within these corpora are considerably shorter than in the traditional cases, such as news articles, there are also new challenges. In this paper, we focus on keyword extraction from documents such as event and product descriptions, and movie plot lines that often hold 30 to 60 words. We propose a novel unsupervised keyword extraction approach called Informativeness-based Keyword Extraction (IKE) that uses clustering and three levels of word evaluation to address the challenges of short documents. We evaluate the performance of our approach by using manually tagged test sets and compare the results against other keyword extrac- tion methods, such as CollabRank, KeyGraph, Chi-squared, and TF-IDF. We also evaluate the precision and effectiveness of the extracted keywords for user modeling and recommendation and report the results of all approaches. In all of the experiments IKE out-performs the competition

KW - Keyword extraction

KW - machine learning

KW - short documents

KW - term weighting

KW - text mining

U2 - 10.5220/0004130704110421

DO - 10.5220/0004130704110421

M3 - Conference article in proceedings

SN - 978-989-8565-29-7

SP - 411

EP - 421

BT - Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval

ER -

Timonen M, Toivanen T, Teng Y, Cheng C, He L. Informativeness-based keyword extraction from short documents. In Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval: SSTM 2012. 2012. p. 411-421 https://doi.org/10.5220/0004130704110421