Statistical Approach for Term Weighting in Very Short Documents for Text Categorization

Mika Timonen, M. Kasari

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

In this paper, we propose a novel approach for term weighting in very short documents that is used with a Support Vector Machine classifier. We focus on market research and social media documents. In both of these data sources, the average length of a document is below twenty words. As the documents are short, each word occurs usually only once within a document. This is known as hapax legomenon and in our previous work as Term Frequency=1 challenge. For this reason, the traditional term weighting approaches become less effective with short documents. In this paper we propose a novel approach for term weighting that does not use term frequency within a document butsubstitutes it with other word statistics. In the experimental evaluation and comparison against several other term weighting approaches the proposed method produced promising results by out-performing the competition.
Original languageEnglish
Title of host publicationKnowledge Discover, Knowledge Engineering and Knowledge Management
Subtitle of host publicationIC3K 2012
EditorsA. Fred, J.L.G. Dietz, K. Liu, J. Filipe
PublisherSpringer
Pages3-18
ISBN (Electronic)978-3-642-54105-6
ISBN (Print)978-3-642-54104-9
DOIs
Publication statusPublished - 2013
MoE publication typeNot Eligible
Event4th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2012 - Barcelona, Spain
Duration: 4 Oct 20127 Oct 2012

Publication series

SeriesCommunications in Computer and Information Science
Volume415
ISSN1865-0929

Conference

Conference4th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2012
Abbreviated titleIC3K 2012
CountrySpain
CityBarcelona
Period4/10/127/10/12

Fingerprint

Support vector machines
Classifiers
Statistics

Keywords

  • feature weighting
  • hapax legomenon
  • short documetn categorization
  • support vector machine
  • text categorization

Cite this

Timonen, M., & Kasari, M. (2013). Statistical Approach for Term Weighting in Very Short Documents for Text Categorization. In A. Fred, J. L. G. Dietz, K. Liu, & J. Filipe (Eds.), Knowledge Discover, Knowledge Engineering and Knowledge Management: IC3K 2012 (pp. 3-18). Springer. Communications in Computer and Information Science, Vol.. 415 https://doi.org/10.1007/978-3-642-54105-6_1
Timonen, Mika ; Kasari, M. / Statistical Approach for Term Weighting in Very Short Documents for Text Categorization. Knowledge Discover, Knowledge Engineering and Knowledge Management: IC3K 2012. editor / A. Fred ; J.L.G. Dietz ; K. Liu ; J. Filipe. Springer, 2013. pp. 3-18 (Communications in Computer and Information Science, Vol. 415).
@inproceedings{4d9ef449512b4ea9a684781441a4482a,
title = "Statistical Approach for Term Weighting in Very Short Documents for Text Categorization",
abstract = "In this paper, we propose a novel approach for term weighting in very short documents that is used with a Support Vector Machine classifier. We focus on market research and social media documents. In both of these data sources, the average length of a document is below twenty words. As the documents are short, each word occurs usually only once within a document. This is known as hapax legomenon and in our previous work as Term Frequency=1 challenge. For this reason, the traditional term weighting approaches become less effective with short documents. In this paper we propose a novel approach for term weighting that does not use term frequency within a document butsubstitutes it with other word statistics. In the experimental evaluation and comparison against several other term weighting approaches the proposed method produced promising results by out-performing the competition.",
keywords = "feature weighting, hapax legomenon, short documetn categorization, support vector machine, text categorization",
author = "Mika Timonen and M. Kasari",
year = "2013",
doi = "10.1007/978-3-642-54105-6_1",
language = "English",
isbn = "978-3-642-54104-9",
series = "Communications in Computer and Information Science",
publisher = "Springer",
pages = "3--18",
editor = "A. Fred and J.L.G. Dietz and K. Liu and J. Filipe",
booktitle = "Knowledge Discover, Knowledge Engineering and Knowledge Management",
address = "Germany",

}

Timonen, M & Kasari, M 2013, Statistical Approach for Term Weighting in Very Short Documents for Text Categorization. in A Fred, JLG Dietz, K Liu & J Filipe (eds), Knowledge Discover, Knowledge Engineering and Knowledge Management: IC3K 2012. Springer, Communications in Computer and Information Science, vol. 415, pp. 3-18, 4th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2012, Barcelona, Spain, 4/10/12. https://doi.org/10.1007/978-3-642-54105-6_1

Statistical Approach for Term Weighting in Very Short Documents for Text Categorization. / Timonen, Mika; Kasari, M.

Knowledge Discover, Knowledge Engineering and Knowledge Management: IC3K 2012. ed. / A. Fred; J.L.G. Dietz; K. Liu; J. Filipe. Springer, 2013. p. 3-18 (Communications in Computer and Information Science, Vol. 415).

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

TY - GEN

T1 - Statistical Approach for Term Weighting in Very Short Documents for Text Categorization

AU - Timonen, Mika

AU - Kasari, M.

PY - 2013

Y1 - 2013

N2 - In this paper, we propose a novel approach for term weighting in very short documents that is used with a Support Vector Machine classifier. We focus on market research and social media documents. In both of these data sources, the average length of a document is below twenty words. As the documents are short, each word occurs usually only once within a document. This is known as hapax legomenon and in our previous work as Term Frequency=1 challenge. For this reason, the traditional term weighting approaches become less effective with short documents. In this paper we propose a novel approach for term weighting that does not use term frequency within a document butsubstitutes it with other word statistics. In the experimental evaluation and comparison against several other term weighting approaches the proposed method produced promising results by out-performing the competition.

AB - In this paper, we propose a novel approach for term weighting in very short documents that is used with a Support Vector Machine classifier. We focus on market research and social media documents. In both of these data sources, the average length of a document is below twenty words. As the documents are short, each word occurs usually only once within a document. This is known as hapax legomenon and in our previous work as Term Frequency=1 challenge. For this reason, the traditional term weighting approaches become less effective with short documents. In this paper we propose a novel approach for term weighting that does not use term frequency within a document butsubstitutes it with other word statistics. In the experimental evaluation and comparison against several other term weighting approaches the proposed method produced promising results by out-performing the competition.

KW - feature weighting

KW - hapax legomenon

KW - short documetn categorization

KW - support vector machine

KW - text categorization

U2 - 10.1007/978-3-642-54105-6_1

DO - 10.1007/978-3-642-54105-6_1

M3 - Conference article in proceedings

SN - 978-3-642-54104-9

T3 - Communications in Computer and Information Science

SP - 3

EP - 18

BT - Knowledge Discover, Knowledge Engineering and Knowledge Management

A2 - Fred, A.

A2 - Dietz, J.L.G.

A2 - Liu, K.

A2 - Filipe, J.

PB - Springer

ER -

Timonen M, Kasari M. Statistical Approach for Term Weighting in Very Short Documents for Text Categorization. In Fred A, Dietz JLG, Liu K, Filipe J, editors, Knowledge Discover, Knowledge Engineering and Knowledge Management: IC3K 2012. Springer. 2013. p. 3-18. (Communications in Computer and Information Science, Vol. 415). https://doi.org/10.1007/978-3-642-54105-6_1