Efficient Use of Resources for Statistical Machine Translation

  • Karunesh Kumar Arora Centre for Development of Advanced Computing
  • Shyam Sunder Agrawal KIIT Gurugram
Keywords: Statistical machine translation, Normalization, WordNet

Abstract

Machine translation has great potential to expand the audience for ever increasing digital collections. Success of data driven machine translation systems is governed by the volume of parallel data on which these systems are being modelled. The languages which do not have such resources in huge quantity, the optimum utilisation of them can only be assured through their quality. Morphologically rich language like Hindi poses further challenge, due to having more number of orthographic inflections for a given word and presence of non-standard word spellings in the corpus. This increases the chances of getting more number of words which are unseen in the training corpus. In this paper, the objective is to reduce redundancy of available corpus and utilise the other resources as well, to make best use of resources. Reduction in number of words unseen to the translation model is achieved through text noise removal, spell normalisation and utilising English WordNet (EWN). The test case presented here is for English-Hindi language pair. The results achieved are promising and set example for other morphological rich languages to optimise the resources to improve the performance of the translation system. 

Author Biographies

Karunesh Kumar Arora, Centre for Development of Advanced Computing

Mr Karunesh Kumar Arora is presently working as Joint Director with Centre for Development of Advanced Computing (CDAC), Noida. He has almost 20 years of experience of working in the field of natural language processing. He has authored 25 research papers and contributed 4 chapters in a book. 

This paper presents the results and observation of experiments performed in statistical machine translation.

Shyam Sunder Agrawal, KIIT Gurugram

Dr Shyam Sunder Agrawal obtained his PhD from Aligarh Muslim University, India, in 1970. Currently working as Director General of KIIT Group of College, Gurugram. He is having research experience of about 45 years at CEERI, Pilani and subsequently as Emeritus Scientist of CSIR, Advisor to CDAC, Noida. He has published more than 250 research papers. 

The experiments presented in this paper have been guided by him. 

Published
2017-10-23
How to Cite
Arora, K., & Agrawal, S. (2017). Efficient Use of Resources for Statistical Machine Translation. DESIDOC Journal of Library & Information Technology, 37(5), 307-312. https://doi.org/10.14429/djlit.37.5.11420
Section
Papers