UHTelPCC: A Dataset for Telugu Printed Character Recognition
UHTelPCC: A Dataset for Telugu Printed Character Recognition
No Thumbnail Available
Date
2019-01-01
Authors
Kummari, Rakesh
Bhagvati, Chakravarthy
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This paper describes how UHTelPCC, a dataset for Telugu printed character recognition, is created and its characteristics. The dataset is created from characters extracted from images of printed Telugu texts from the period 1950–1990. Thus, it is hoped that the dataset provides the basis for developing practical Telugu OCR systems. UHTelPCC is to provide a standard benchmark for comparing different algorithms for Telugu OCR and helps in research and development of Telugu OCR systems. UHTelPCC contains 70K samples of 325 classes, and these samples are divided into 50K, 10K, 10K training, validation, and test sets respectively. It is hoped that UHTelPCC serves like MNIST, a dataset for handwritten digit recognition, for Telugu printed character recognition. The baseline performances on the test set using KNN, MLP, and CNN are 98.85%, 99.52%, and 99.68% respectively. UHTelPCC is available at http://scis.uohyd.ac.in/~chakcs/UHTelPCC.html.
Description
Keywords
OCR,
OCR dataset,
Optical Character Recognition,
Printed Telugu OCR,
Telugu character dataset,
Telugu dataset,
UHTelPCC
Citation
Communications in Computer and Information Science. v.1037