Design and Development of an Intelligent Framework for Human Resource Case Document Processing: Integrating Image Processing, OCR, NLP, Sentiment Analysis and Artificial Intelligence

Abstract

This project is focused on the development and demonstration of an automated Human Resource (HR) case processing system designed to facilitate the storage and analysis of data to support decision making processes by extracting meaningful data from scanned HR case documents. Specifically, this project is intended for implementation in governmental or otherwise bureaucratic HR processes. Governmental context provides suitable conditions for the utilization of Optical Character Recognition (OCR) for the extraction of text and subsequent derivation of key aspects due to these documents maintaining a standard structured form with rare, minor deviations. Traditional manual, paper-based methods of human resource case document processing are commonplace within many organisations. Outdated systems such as these hinder efficiency and are prone to suffering the challenges of document deterioration, misplacement, and inaccuracies, leading to delays in decisionmaking, duplication of work, general errors in case resolution, and reduction in operational efficiency. To address the challenges posed by manual, paper-based document processing systems, an intelligent framework is proposed incorporating image pre-processing techniques for high accuracy scanning using computer vision for ROI detection and Tesseract OCR Engine for text identification and extraction. In addition, Sentiment Analysis is implemented for text processing based on keywords identified through RegEx to determine cases and extract data for structured storage in an SQL, ensuring referential posterity and non-biased decision making. An OCR Accuracy of 95% was successfully achieved— statistically, well-within standard benchmarks of OCR accuracy for the scanning of printed text. Further to this, the framework incorporates a modular AI-based classification component for predictive assessment of HR promotion cases. Multiple models were trained and evaluated on an imbalanced institutional dataset, including Logistic Regression, Random Forest, and their augmented variants using SMOTE, ENN, and Balanced Bagging techniques. The final selected model—Random Forest with Balanced Bagging and SMOTEENN—achieved an F2-Score of 0.97, Macro F1 of 0.67, and minority class precision of 0.31, demonstrating strong minority class sensitivity and overall balanced performance. Processing time was significantly reduced, with an end-to-end execution time of approximately 9.4 seconds for a standard two-page case batch and 3.3 seconds for a case batch of 1000 records using the developed AI model, in contrast to the average 3-day manual processing period. This proposed framework offers a novel approach to workflow efficiency improvements in potentially numerous sectors where similar document processing contexts are present.

Description

DISSERTATION

Citation

HARVARD REFRENCING

Collections

Endorsement

Review

Supplemented By

Referenced By