UHTelPCC: A Dataset for Telugu Printed Character Recognition

Kummari, Rakesh; Bhagvati, Chakravarthy

UHTelPCC: A Dataset for Telugu Printed Character Recognition

Date

2019-01-01

Authors

Kummari, Rakesh

Bhagvati, Chakravarthy

Abstract

This paper describes how UHTelPCC, a dataset for Telugu printed character recognition, is created and its characteristics. The dataset is created from characters extracted from images of printed Telugu texts from the period 1950–1990. Thus, it is hoped that the dataset provides the basis for developing practical Telugu OCR systems. UHTelPCC is to provide a standard benchmark for comparing different algorithms for Telugu OCR and helps in research and development of Telugu OCR systems. UHTelPCC contains 70K samples of 325 classes, and these samples are divided into 50K, 10K, 10K training, validation, and test sets respectively. It is hoped that UHTelPCC serves like MNIST, a dataset for handwritten digit recognition, for Telugu printed character recognition. The baseline performances on the test set using KNN, MLP, and CNN are 98.85%, 99.52%, and 99.68% respectively. UHTelPCC is available at http://scis.uohyd.ac.in/~chakcs/UHTelPCC.html.

Keywords

OCR, OCR dataset, Optical Character Recognition, Printed Telugu OCR, Telugu character dataset, Telugu dataset, UHTelPCC

Citation

Communications in Computer and Information Science. v.1037

URI

10.1007/978-981-13-9187-3_3
http://link.springer.com/10.1007/978-981-13-9187-3_3
https://dspace.uohyd.ac.in/handle/1/8703

Collections

Computer and Information Sciences - Publications

Full item page