Rule based Text Extraction from a Bibliographic Database

Veena Makhija; Swapnil Ahuja

doi:10.14429/djlit.38.1.12307

Authors

Veena Makhija DRDO-Solid State Physics Laboratory, Delhi - 110 054
Swapnil Ahuja University of Southern California

DOI:

https://doi.org/10.14429/djlit.38.1.12307

Keywords:

Text extraction, Rule based information extraction, Knowledge domain, Semiconductors, Controlled vocabulary, Metadata extraction

Abstract

The emergent concept of ‘ Big Data’ has shifted the paradigm from information retrieval to information extraction techniques. The information extraction techniques enables corpus analysis to draw useful interpretations and its possible applications. Selection of appropriate information extraction technique depends upon the type of data being dealt with and its possible applications. In an R&D environment, the published information is considered as an authenticated benchmark to study and analyse the growth pattern in that field of science, medicine, business. A rule based information extraction process, on the selected data extracted from a bibliographic database of published R&D papers is proposed in this paper. Aim of the study is to build up a database on relevant concepts, cleaning of retrieved data and automate the process of information retrieval in the local database. For this purpose, a concept based ‘subject profiles’ in the area of advanced semiconductors as well as the rules for text extraction from metadata retrieved from the bibliographic database was developed. This subset was used as an input to the knowledge domain to support R&D in the area of ‘advanced semiconductor materials and devices’ and provide information services on Intranet. Study found that concept based pattern matching on the datasets downloaded yielded better results as compared to the results by using the controlled vocabulary of the source database .

Author Biographies

Veena Makhija, DRDO-Solid State Physics Laboratory, Delhi - 110 054

Ms Veena Makhija has done MSc (Physics), specialisation in ‘Solid State Physics’ from Delhi University, India in 1986 and Associateship in Information Science from NISCAIR, CSIR in 1989 respectively. She is presently working as Scientist ‘F’ and Head, Technical Information Resource Center in DRDO-Solid State Physics Laboratory, Delhi. Her current research interest include Digital reference services, knowledge management system, organisational knowledge and open access resources

Swapnil Ahuja, University of Southern California

Mr Swapnil Ahuja received Bachelor’s in Information Technology from GGSIP, Delhi, India and currently pursuing Masters in Computer Science from University of Southern California specialising in data science. He has keen interest in software development and machine learning.

Rule based Text Extraction from a Bibliographic Database

Authors

DOI:

Keywords:

Abstract

Author Biographies

Veena Makhija, DRDO-Solid State Physics Laboratory, Delhi - 110 054

Swapnil Ahuja, University of Southern California

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Information

Announcements