Automated Knowledge Organization AI ML based Subject Indexing System for Libraries

Keywords: Semi-automated subject indexing, Annif, NDCG, OpenRefine, TF-IDF, Omikuji, Snowball analyzer

Abstract

The research study as reported here is an attempt to explore the possibilities of an AI/ML-based semi-automated indexing system in a library setup to handle large volumes of documents. It uses the Python virtual environment to install and configure an open source AI environment (named Annif) to feed the LOD (Linked Open Data) dataset of Library of Congress Subject Headings (LCSH) as a standard KOS (Knowledge Organization System). The framework deployed the Turtle format of LCSH after cleaning the file with Skosify, applied TF-IDF as a language model (backend algorithm), and selected Snowball as an analyzer. The training of Annif was conducted with a large set of bibliographic records populated with subject descriptors (MARC tag 650$a) and indexed by trained LIS professionals. The training dataset is first treated with MarcEdit to export it in a format suitable for OpenRefine, and then in OpenRefine it undergoes many steps to produce a bibliographic record set suitable to train Annif. The framework, after training, has been tested with a bibliographic dataset to measure indexing efficiencies, and finally, the automated indexing framework is integrated with a data wrangling software (OpenRefine) to produce suggested headings on a mass scale. The entire framework is based on open source software, open datasets, and open standards.

Published
2023-03-31
How to Cite
Ahmed, M., Mukhopadhyay, M., & Mukhopadhyay, P. (2023). Automated Knowledge Organization AI ML based Subject Indexing System for Libraries. DESIDOC Journal of Library & Information Technology, 43(01), 45-54. https://doi.org/10.14429/djlit.43.01.18619