A Machine Learning Based Ensemble Technique for Effective Clustering Of Registered Documents
Keywords:Clustering, K-Means, Density, Machine Learning, Ensemble, DBScan, Outlier, Effective, Distance, Evaluation, Training, Testing, Cluster Center, Normalization, Statistics, Measures, Prediction, Registered document, Documents, ASCII Formats
Data mining and machine learning techniques are very useful in different applications for performing predictions of useful patterns. Many Business Applications require the data to be prepared in a structured format so that it can help ease data validation, improve quality, performance, and handle exceptional data like Null Values, duplicates, unexpected data etc. Companies have applications that maintain various critical elements which require several mechanisms to present the data in the required format by applying several Business Rules. This work attempts to perform experimental evaluation of identifying an appropriate Business rule engine for data transformation of the critical element Document Number in the Land Registered documents by applying different data preprocessing techniques, like label encoding, one hot encoding, and Binary Encoding for data. Also, it aims to apply a clustering technique like K-Means clustering, to cluster the documents into buckets and Classify them into appropriate Labels. The distance measures such as Euclidean, Manhattan, Maximum, Binary, Minkowski and Canberra are used to calculate the number of inter and intra clusters. The appropriate clustering is derived using statistical techniques, namely, Elbow Curve Plot, Silhouette coefficient and ground truth labels. The clustering results are compared using a common metric called the Adjusted Rand Index(ARI). This work also applies the Principal Component Analysis (PCA) to confirm that the selected features are optimal. The proposed ensemble technique is evaluated and trained for effective derivation of clusters for Registered document numbers or a similar data set which contains mixed document number formats. The final objective of this work is to propose an unsupervised hybrid classification and clustering technique, which will enable users to identify and classify the appropriate business rules for any given data automatically.
Ravi Shankar, Sourish Acharia and Alok Baveja (2009), "Soft-system Knowledge Management Framework for New Product Development", Journal of Knowledge Management, Vol. 13, pp.135-153,
Nair, S., & Mehta, J. (2011). Clustering with Apache Hadoop. Proceedings of the International Conference & Workshop on Emerging Trends in Technology - ICWET ’11, (Icwet), 505.
Soni, D. K. ., M. , N. Kaushik, D. . Dhote, D. . Nigam, and K. G. . Krishna. “Website Redesign With Animation”. International Journal on Recent and Innovation Trends in Computing and Communication, vol. 10, no. 2, Mar. 2022, pp. 01-10, doi:10.17762/ijritcc.v10i2.5499.
Lidia Contreras-Ochando et al,(2020), "Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge", Joint European Conference on Machine Learning and Knowledge Discovery in Databases
Wolfgang Kratsch,Jonas , Manderscheid, Maximilian, Ro'glinger, Johannes Seyfried,(2020), "Machine Learning in Business Process Monitoring: A Comparison of Deep Learning and Classical Approaches Used for Outcome Prediction",Business & Information Systems Engineering
N. A. Libre. (2021). A Discussion Platform for Enhancing Students Interaction in the Online Education. Journal of Online Engineering Education, 12(2), 07–12. Retrieved from http://onlineengineeringeducation.com/index.php/joee/article/view/49
Abhijit Guha Debabrata Samanta,(2020), "Hybrid Approach to Document Anomaly Detection:An Application to Facilitate RPA in Title Insurance",International Journal of Automation and Computing
Rekha Nagar and Yudhvir Singh(2019), "A literature survey on Machine Learning Algorithms", Journal of Emerging Technologies and Innovative Research
Sarker, I.H. Machine Learning: Algorithms(2021), Real-World Applications and Research Directions. SN COMPUT. SCI. 2, 160, https://doi.org/10.1007/s42979-021-00592-x
Ghazaly, N. M. . (2022). Data Catalogue Approaches, Implementation and Adoption: A Study of Purpose of Data Catalogue. International Journal on Future Revolution in Computer Science &Amp; Communication Engineering, 8(1), 01–04. https://doi.org/10.17762/ijfrcsce.v8i1.2063
M. Alamuri, B. R. Surampudi and A. Negi(2014), "A survey of distance/similarity measures for categorical data," 2014 International Joint Conference on Neural Networks (IJCNN), pp. 1907-1914
R. Maclin, D. Opitz (1999), Journal of Artificial Intelligence Research, Volume 11, pages 169-198
Rudra Kumar, M., Rashmi Pathak, and Vinit Kumar Gunjan. "Diagnosis and Medicine Prediction for COVID-19 Using Machine Learning Approach." Computational Intelligence in Machine Learning. Springer, Singapore, 2022. 123-133
Madapuri, Rudra Kumar, and P. C. Mahesh. "HBS-CRA: scaling impact of change request towards fault proneness: defining a heuristic and biases scale (HBS) of change request artifacts (CRA)." Cluster Computing 22.5 (2019): 11591-11599
Ganjarapalli Manasa Divija Sree,S. Vasundra(2020). "Vector-Based ClassificationPrediction to Geographical Location", International Journal of Future Generation Communication and Networking.
Dursun, M., & Goker, N. (2022). Evaluation of Project Management Methodologies Success Factors Using Fuzzy Cognitive Map Method: Waterfall, Agile, And Lean Six Sigma Cases. International Journal of Intelligent Systems and Applications in Engineering, 10(1), 35–43. https://doi.org/10.18201/ijisae.2022.265
Kaimuru, Dalton & Mwangi, Waweru & Nderu, Lawrence. (2019). A Hybrid Ensemble Method for Multiclass Classification and Outlier Detection. International Journal of Sciences: Basic and Applied Research, Vol 45(1). 192-213.
K, S., & srinivasulu, T. (2022). Design and Development of Novel Hybrid Precoder for Millimeter-Wave MIMO System. International Journal of Communication Networks and Information Security (IJCNIS), 13(3). https://doi.org/10.17762/ijcnis.v13i3.5096
Chalapathi, M. M., et al. "Ensemble Learning by High-Dimensional Acoustic Features for Emotion Recognition from Speech Audio Signal." Security and Communication Networks 2022 (2022)
M. N. Prasad* et al., “Reciprocal Repository for Decisive Data Access in Disruption Tolerant Networks,” International Journal of Innovative Technology and Exploring Engineering, 2019, 9(1), pp. 4430–443.
Biswas, Saroj & Chakraborty, Manomita & Purkayastha, Biswajit & Roy, Pinki & Thounaojam, Dalton. (2017). Rule Extraction from Training Data Using Neural Network. International Journal of Artificial Intelligence Tools, World Scientific. 26. 10.1142/S0218213017500063
How to Cite
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.