Text categorization in Indian languages using machine learning approaches
Text categorization in Indian languages using machine learning approaches
No Thumbnail Available
Date
2007-12-01
Authors
Raghuveer, K.
Murthy, Kavi Narayana
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In this paper we present our work on automatic text categorization in Indian languages. Here we use purely corpus based machine learning techniques. The methods we present are completely language independent - no language specific knowledge is used. We describe our experiments on ten of the major Indian languages including Assamese, Bengali (Bangla), Gujarati, Hindi, Kannada, Malayalam, Oriya, Punjabi, Tamil and Telugu. We have conducted our experiments on the DoE-CIIL corpora. We have also worked on the newspaper corpus forming part of the LERC-UoH Telugu corpus developed by us. We have used several machine learning techniques including naive Bayes classifier, k-Nearest-Neighbor classifier and SVMs. We have used one-vs.-all SVMs for multi-classification with 3-fold Cross Validation in all cases. We see that SVMs out-perform other classifiers. We describe our experiments with soft-margin linear SVMs as well as kernel based SVMs using polynomial and Radial Basis Function kernels. Kernel based SVMs have not performed significantly better than linear SVMs. There is not much work done in text categorization in Indian languages. Text categorization in Indian languages is challenging as Indian languages are very rich in morphology, giving rise to a very large number of word forms and hence very large feature spaces. We show how Mutual Information between features and categories can be used to achieve substantial reduction in the dimensionality of the feature space without reducing the performance. In fact many terms actually act as noise and we show that pruning such terms from the feature space actually enhances the performance. The paper is written in tutorial style and adequate background material is included on text categorization as also on the machine learning techniques used, for the benefit of readers who may not be familiar with these. Copyright © 2007 IICAI.
Description
Keywords
Indian languages,
KNN,
Mutual information,
Naive Bayes,
SVM,
Text categorization
Citation
Proceedings of the 3rd Indian International Conference on Artificial Intelligence, IICAI 2007