An Efficient Document Categorization Approach for Turkish Based Texts

Sevinç İlhan Omurca; Semih Baş; Ekin Ekinci

Authors

Sevinç İlhan Omurca Kocaeli University
Semih Baş IBTECH
Ekin Ekinci Kocaeli University

Keywords:

Document categorization, SVM, TF-IDF, User dependent term selecting, Hash table

Abstract

Since, it is infeasible to classify all the documents with human effort due to the rapid and uncontrollable growth in textual data, automatic methods have been approached in order to organize the data. Therefore a support vector machine (SVM) classifier is used for text categorization in this study. In text categorization applications, the text representation process could take a huge computation time on weighting the huge size of terms. So far, lexicons that contain less number of terms are used for the solution in the literature. However it has been observed that these kinds of solutions reduce the accuracy of the text classification. In this paper, the term-document matrix is constructed as user dependent according to the purpose of classification. Since the number of terms is still relatively large, we used a hash table for efficient search of terms. Hereby an efficient and rapid TF-IDF method is introduced to construct a weight-matrix to represent the term-document relations and a study concerning classification of the documents in Turkish based news and Turkish columnists is conducted. With the proposed study, the computational time that is required for term-weighting process is reduced substantially; also 99% accuracy is achieved in determination of the news categories and 98% accuracy is achieved in detection of the columnists.

Downloads

Download data is not yet available.

References

M. A. Kumar, and M. Gopal, “A comparison study on multiple binary-class SVM methods for unilabel text categorization,” Pattern Recognition Letters, vol. 31, pp. 1437-1444, Aug. 2010.

F. Sebastiani, Text Categorization, A. Zanasi, Ed. Southampton, UK: WIT Press, 2005.

W. Zhang, T. Yoshida, and X. Tang, “Text classification based on multi-word with support vector machine,” Knowledge-Based Systems, vol. 21, pp. 879-886, Dec. 2008.

W. Li, D. Miao, and W. Wang, “Two-level hierarchical combination method for text classification,” Expert Systems with Applications, vol. 38, pp. 2030-2039, Mar. 2011.

A. Sun, E. Lim, and Y. Liu, “On strategies for imbalanced text classification using SVM: A comparative study,” Decision Support Systems, vol. 48, pp. 191-201, Dec. 2009.

D. Miao, Q. Duan, H. Zhang, and N. Jiao, “Rough set based hybrid algorithm for text classification,” Expert Systems with Applications, vol. 36, pp. 9168-9174, July 2009.

L L. Shi, X. Ma, L. Xi, Q. Duan, and J. Zhao, “Rough set and ensemble learning based semi-supervised algorithm for text classification,” Expert Systems with Applications, vol. 38, pp. 6300-6306, May 2011.

V. Mitra, C. Wang, and S. Banerjee, “Text classification: A least square support vector machine approach,” Applied Soft Computing, vol. 7, pp. 908-914, June 2007.

S. Lo, “Web service quality control based on text mining using support vector machine,” Expert Systems with Applications, vol. 34, pp. 603-610, Jan. 2008.

K. Rajan, V. Ramalingam, M. Ganesan, S. Palanivel, and B. Palaniappan, “Automatic classification of Tamil documents using vector space model and artificial neural network,” Expert Systems with Applications, vol. 36, pp. 10914-10918, Oct. 2009.

L. Zhang, Y. Li, C. Sun, and W. Nadee, “Rough Set Based Approach to Text Classification,” in IEEE/WI/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT), 2013, p. 245.

J. J. G. Adeva, J. M. P. Atxa, M. U. Carrillo, and E. A. Zengotitabengoa, “Automatic text classification to support systematic reviews in medicine,” Expert Systems with Applications, vol. 41, pp. 1498-1508, Mar. 2014.

L. H. Lee, C. H. Wan, R. Rajkumar, and D. Isa, “An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization,” Applied Intelligence, vol. 37, pp. 80-99, July 2012.

Y. Kılıçaslan, E. S. Güner, and S. Yıldırım, “Learning-based pronoun resolution for Turkish with a comparative evaluation,” Computer Speech and Language, vol. 23, pp. 311-331, July 2009.

A. Çıltık, and T. Güngör, “Time-efficient spam e-mail filtering using n-gram models,” Pattern Recognition Letters, vol. 29, pp. 19-33, Jan. 2008.

Ö. Özyurt, and C. Köse, “Chat mining: Automatically determination of chat conversations,” Expert Systems with Applications, vol. 37, pp. 8705-8710, Dec. 2010.

L. Özgür, T. Güngör, and F. Gürgen, “Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish,” Pattern Recognition Letters, vol. 25, pp. 1819-1831, Dec. 2004.

E. Alparslan, A. Karahoca, and H. Bahşi, “Classification of confidential documents by using adaptive neurofuzzy inference systems,” Procedia Computer Science, vol. 3, pp. 1412-1417, 2011.

A. K. Uysal, and S. Gunal, “The impact of preprocessing on text classification,” Information Processing and Management, vol. 50, pp. 104-112, Jan. 2014.

F. Türkoğlu, B. Diri, and M. F. Amasyalı, Author Attribution of Turkish Texts by Feature Mining, D. –S. Huang, L. Heutte, M. Loog, Ed. Berlin, Germany: Springer-Verlag, 2007.

D. M. Christopher, and S. Hinrich, Foundations of statistical natural language processing, 4th ed., Cambridge, Massachusetts: MIT Press, 2001.

K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, vol. 60, pp. 493-502, 2004.

K. S. Jones, “IDF term weighting and IR research lessons,” Journal of Documentation, vol. 60, pp. 521-523, 2004.

J. L. Solka, “Text Data Mining: Theory and Methods,” Statistics Surveys, vol. 2, pp. 94-112, 2008.

J. -S. Xu, and Z. -O. Wang, “Tcblsa: A New Method Of Text Clustering,” in Proc. Second International Conference on Machine Learning and Cybernetics, 2003, p. 63.

W. Zhang, T. Yoshida, and X. Tang, “A comparative study of TF*IDF, LSI and multi-words for text classification,” Expert Systems with Applications, vol. 38, pp. 2758-2565, Mar. 2011.

Y. Yang, and J. O. Pedersen, “Comparative Study on Feature Selection in Text Categorization,” in Proc. ICML-97, 1997, p. 412.

V. N. Vapnik, The Nature of Statistical Learning Theory, 2nd ed., M. Jordan, S. L. Lauritzen, J. F. Lawless, V. Nair, Ed. New York, USA: Springer-Verlag, 2000.

T. Joachims, “Text categorization with support vector machines: Learning with many relevant feature,” in Proc. ECML-98, 1998, p. 137.

E. Leopold, and J. Kindermann, “Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?,” Machine Learning, vol. 46, pp. 423-444, 2002.

A. Wang, W. Yuan, J. Liu, Z. Yu, and H. Li, “A novel pattern recognition algorithm: Combining ART network with SVM to reconstruct a multi-class classifier,” Computers & Mathematics with Applications, vol. 57, pp. 1908-1914, June 2009.

S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” in Proc. CIKM '98, 1998, p. 148.

(2014) Fatih University Computer Engineering Website. [Online]. Available: http://nlp.ceng.fatih.edu.tr/blog/tr/?p=31/

(2014) Zemberek Website. [Online]. Available: https://code.google.com/p/zemberek/

M. Radovanovic, and M. Ivanovic, “Text Mining: Approaches And Applications,” Novi Sad J. Math., vol. 38, pp. 227-234, 2008.

(2014) Kemik Website. [Online]. Available: http://www.kemik.yildiz.edu.tr/?id=28/

E. Alpaydın, Introduction to Machine Learning, 2nd ed., London, England: MIT Press, 2010.

An Efficient Document Categorization Approach for Turkish Based Texts

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Announcements

Information for Authors

ijisae

Information

trindex