A Machine Learning Based Ensemble Technique for Effective Clustering Of Registered Documents

Authors

  • K. Neelima
  • S. Vasundra

Keywords:

Clustering, K-Means, Density, Machine Learning, Ensemble, DBScan, Outlier, Effective, Distance, Evaluation, Training, Testing, Cluster Center, Normalization, Statistics, Measures, Prediction, Registered document, Documents, ASCII Formats

Abstract

Data mining and machine learning techniques are very useful in different applications for performing predictions of useful patterns. Many Business Applications require the data to be prepared in a structured format so that it can help ease data validation, improve quality, performance, and handle exceptional data like Null Values, duplicates, unexpected data etc. Companies have applications that maintain various critical elements which require several mechanisms to present the data in the required format by applying several Business Rules. This work attempts to perform experimental evaluation of identifying an appropriate Business rule engine for data transformation of the critical element Document Number in the Land Registered documents by applying different data preprocessing techniques, like label encoding, one hot encoding, and Binary Encoding for data. Also, it aims to apply a clustering technique like K-Means clustering, to cluster the documents into buckets and Classify them into appropriate Labels. The distance measures such as Euclidean, Manhattan, Maximum, Binary, Minkowski and Canberra are used to calculate the number of inter and intra clusters. The appropriate clustering is derived using statistical techniques, namely, Elbow Curve Plot, Silhouette coefficient and ground truth labels. The clustering results are compared using a common metric called the Adjusted Rand Index(ARI). This work also applies the Principal Component Analysis (PCA) to confirm that the selected features are optimal. The proposed ensemble technique is evaluated and trained for effective derivation of clusters for Registered document numbers or a similar data set which contains mixed document number formats. The final objective of this work is to propose an unsupervised hybrid classification and clustering technique, which will enable users to identify and classify the appropriate business rules for any given data automatically.

Downloads

Download data is not yet available.

References

Ravi Shankar, Sourish Acharia and Alok Baveja (2009), "Soft-system Knowledge Management Framework for New Product Development", Journal of Knowledge Management, Vol. 13, pp.135-153,

Nair, S., & Mehta, J. (2011). Clustering with Apache Hadoop. Proceedings of the International Conference & Workshop on Emerging Trends in Technology - ICWET ’11, (Icwet), 505.

Soni, D. K. ., M. ‎, N. Kaushik, D. . Dhote, D. . Nigam, and K. G. . Krishna. “Website Redesign With Animation”. International Journal on Recent and Innovation Trends in Computing and Communication, vol. 10, no. 2, Mar. 2022, pp. 01-10, doi:10.17762/ijritcc.v10i2.5499.

Lidia Contreras-Ochando et al,(2020), "Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge", Joint European Conference on Machine Learning and Knowledge Discovery in Databases

Wolfgang Kratsch,Jonas , Manderscheid, Maximilian, Ro'glinger, Johannes Seyfried,(2020), "Machine Learning in Business Process Monitoring: A Comparison of Deep Learning and Classical Approaches Used for Outcome Prediction",Business & Information Systems Engineering

N. A. Libre. (2021). A Discussion Platform for Enhancing Students Interaction in the Online Education. Journal of Online Engineering Education, 12(2), 07–12. Retrieved from http://onlineengineeringeducation.com/index.php/joee/article/view/49

Abhijit Guha Debabrata Samanta,(2020), "Hybrid Approach to Document Anomaly Detection:An Application to Facilitate RPA in Title Insurance",International Journal of Automation and Computing

Rekha Nagar and Yudhvir Singh(2019), "A literature survey on Machine Learning Algorithms", Journal of Emerging Technologies and Innovative Research

Sarker, I.H. Machine Learning: Algorithms(2021), Real-World Applications and Research Directions. SN COMPUT. SCI. 2, 160, https://doi.org/10.1007/s42979-021-00592-x

Ghazaly, N. M. . (2022). Data Catalogue Approaches, Implementation and Adoption: A Study of Purpose of Data Catalogue. International Journal on Future Revolution in Computer Science &Amp; Communication Engineering, 8(1), 01–04. https://doi.org/10.17762/ijfrcsce.v8i1.2063

M. Alamuri, B. R. Surampudi and A. Negi(2014), "A survey of distance/similarity measures for categorical data," 2014 International Joint Conference on Neural Networks (IJCNN), pp. 1907-1914

R. Maclin, D. Opitz (1999), Journal of Artificial Intelligence Research, Volume 11, pages 169-198

Rudra Kumar, M., Rashmi Pathak, and Vinit Kumar Gunjan. "Diagnosis and Medicine Prediction for COVID-19 Using Machine Learning Approach." Computational Intelligence in Machine Learning. Springer, Singapore, 2022. 123-133

Madapuri, Rudra Kumar, and P. C. Mahesh. "HBS-CRA: scaling impact of change request towards fault proneness: defining a heuristic and biases scale (HBS) of change request artifacts (CRA)." Cluster Computing 22.5 (2019): 11591-11599

Ganjarapalli Manasa Divija Sree,S. Vasundra(2020). "Vector-Based ClassificationPrediction to Geographical Location", International Journal of Future Generation Communication and Networking.

Dursun, M., & Goker, N. (2022). Evaluation of Project Management Methodologies Success Factors Using Fuzzy Cognitive Map Method: Waterfall, Agile, And Lean Six Sigma Cases. International Journal of Intelligent Systems and Applications in Engineering, 10(1), 35–43. https://doi.org/10.18201/ijisae.2022.265

Kaimuru, Dalton & Mwangi, Waweru & Nderu, Lawrence. (2019). A Hybrid Ensemble Method for Multiclass Classification and Outlier Detection. International Journal of Sciences: Basic and Applied Research, Vol 45(1). 192-213.

K, S., & srinivasulu, T. (2022). Design and Development of Novel Hybrid Precoder for Millimeter-Wave MIMO System. International Journal of Communication Networks and Information Security (IJCNIS), 13(3). https://doi.org/10.17762/ijcnis.v13i3.5096

Chalapathi, M. M., et al. "Ensemble Learning by High-Dimensional Acoustic Features for Emotion Recognition from Speech Audio Signal." Security and Communication Networks 2022 (2022)

M. N. Prasad* et al., “Reciprocal Repository for Decisive Data Access in Disruption Tolerant Networks,” International Journal of Innovative Technology and Exploring Engineering, 2019, 9(1), pp. 4430–443.

Biswas, Saroj & Chakraborty, Manomita & Purkayastha, Biswajit & Roy, Pinki & Thounaojam, Dalton. (2017). Rule Extraction from Training Data Using Neural Network. International Journal of Artificial Intelligence Tools, World Scientific. 26. 10.1142/S0218213017500063

Proposed Ensemble Methodology

Downloads

Published

15.10.2022

How to Cite

[1]
K. . Neelima and S. Vasundra, “A Machine Learning Based Ensemble Technique for Effective Clustering Of Registered Documents ”, Int J Intell Syst Appl Eng, vol. 10, no. 1s, pp. 289 –, Oct. 2022.