TY - JOUR
T1 - On entropy-based term weighting schemes for text categorization
AU - Wang, Tao
AU - Cai, Yi
AU - LEUNG, Ho-fung
AU - LAU, Raymond Y. K
AU - Xie, Haoran
AU - Li, Qing
N1 - Funding Information:
This work was supported by the Fundamental Research Funds for the Central Universities, SCUT (No. D2200150, D2201300), the Science and Technology Programs of Guangzhou (Nos. 201704030076, 201802010027, 201902010046), National Natural Science Foundation of China (No. 62076100) and National Key Research and Development Program of China (Standard knowledge graph for epidemic prevention and production recovering intelligent service platform and its applications).
Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021/9
Y1 - 2021/9
N2 - In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of term weighting schemes have been proposed for VSM to improve text categorization performance. Much evidence shows that the performance of a term weighting scheme often varies across different text categorization tasks, while the mechanism underlying variability in a scheme’s performance remains unclear. Moreover, existing schemes often weight a term with respect to a category locally, without considering the global distribution of a term’s occurrences across all categories in a corpus. In this paper, we first systematically examine pros and cons of existing term weighting schemes in text categorization and explore the reasons why some schemes with sound theoretical bases, such as chi-square test and information gain, perform poorly in empirical evaluations. By measuring the concentration that a term distributes across all categories in a corpus, we then propose a series of entropy-based term weighting schemes to measure the distinguishing power of a term in text categorization. Through extensive experiments on five different datasets, the proposed term weighting schemes consistently outperform the state-of-the-art schemes. Moreover, our findings shed new light on how to choose and develop an effective term weighting scheme for a specific text categorization task.
AB - In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of term weighting schemes have been proposed for VSM to improve text categorization performance. Much evidence shows that the performance of a term weighting scheme often varies across different text categorization tasks, while the mechanism underlying variability in a scheme’s performance remains unclear. Moreover, existing schemes often weight a term with respect to a category locally, without considering the global distribution of a term’s occurrences across all categories in a corpus. In this paper, we first systematically examine pros and cons of existing term weighting schemes in text categorization and explore the reasons why some schemes with sound theoretical bases, such as chi-square test and information gain, perform poorly in empirical evaluations. By measuring the concentration that a term distributes across all categories in a corpus, we then propose a series of entropy-based term weighting schemes to measure the distinguishing power of a term in text categorization. Through extensive experiments on five different datasets, the proposed term weighting schemes consistently outperform the state-of-the-art schemes. Moreover, our findings shed new light on how to choose and develop an effective term weighting scheme for a specific text categorization task.
UR - http://www.scopus.com/inward/record.url?scp=85109603998&partnerID=8YFLogxK
U2 - 10.1007/s10115-021-01581-5
DO - 10.1007/s10115-021-01581-5
M3 - Article
SN - 0219-1377
VL - 63
SP - 2313
EP - 2346
JO - KNOWLEDGE AND INFORMATION SYSTEMS
JF - KNOWLEDGE AND INFORMATION SYSTEMS
IS - 9
ER -