A Machine Learning Based Ensemble Technique for Effective Clustering Of Registered Documents


  • K. Neelima
  • S. Vasundra


Clustering, K-Means, Density, Machine Learning, Ensemble, DBScan, Outlier, Effective, Distance, Evaluation, Training, Testing, Cluster Center, Normalization, Statistics, Measures, Prediction, Registered document, Documents, ASCII Formats


Data mining and machine learning techniques are very useful in different applications for performing predictions of useful patterns. Many Business Applications require the data to be prepared in a structured format so that it can help ease data validation, improve quality, performance, and handle exceptional data like Null Values, duplicates, unexpected data etc. Companies have applications that maintain various critical elements which require several mechanisms to present the data in the required format by applying several Business Rules. This work attempts to perform experimental evaluation of identifying an appropriate Business rule engine for data transformation of the critical element Document Number in the Land Registered documents by applying different data preprocessing techniques, like label encoding, one hot encoding, and Binary Encoding for data. Also, it aims to apply a clustering technique like K-Means clustering, to cluster the documents into buckets and Classify them into appropriate Labels. The distance measures such as Euclidean, Manhattan, Maximum, Binary, Minkowski and Canberra are used to calculate the number of inter and intra clusters. The appropriate clustering is derived using statistical techniques, namely, Elbow Curve Plot, Silhouette coefficient and ground truth labels. The clustering results are compared using a common metric called the Adjusted Rand Index(ARI). This work also applies the Principal Component Analysis (PCA) to confirm that the selected features are optimal. The proposed ensemble technique is evaluated and trained for effective derivation of clusters for Registered document numbers or a similar data set which contains mixed document number formats. The final objective of this work is to propose an unsupervised hybrid classification and clustering technique, which will enable users to identify and classify the appropriate business rules for any given data automatically.


Download data is not yet available.


Proposed Ensemble Methodology




K. . Neelima and S. Vasundra, “A Machine Learning Based Ensemble Technique for Effective Clustering Of Registered Documents ”, Int J Intell Syst Appl Eng, vol. 10, no. 1s, pp. 289 –, Oct. 2022.