A Novel Pipeline Model for Anomaly Detection in High Dimensional Data Sets

Authors

  • Upasana Gupta Research Scholar, Department of Computer Science & Engineering, Maharishi University of Information Technology, Lucknow (U.P)
  • Vaishali Singh Assistant Professor, Department of Computer Science & Engineering, Maharishi University of Information Technology, Lucknow (U.P)

Keywords:

High-Dimensional Data, Data Pre-processing and Visualization, Dimensionality Reduction, Reconstruction Error, Anomaly Detection, Healthcare, Multi-Layer Perceptron, Autoencoder, R Programming Language

Abstract

This paper presents a comprehensive method for dimension reduction and detecting anomalies in high-dimensional data (on healthcare datasets) using R. Realizing that traditional linear methods such as Principal Component Analysis (PCA) often ignore the complexity of the non-linear manifold of the data, our approach exploits iterative learning, on the belief that high-dimensional data is largely based on a low-dimensional manifold. The methodology starts by preparing the data using R libraries like Keras, dplyr, and ggplot2, addressing challenges like missing values ​​and visualizing meaningful information. Using the Mahalanobis distance, the paper identifies and removes country-specific outliers. The pipelined model integrates Principal Component Analysis (PCA) for data transformation and combines an Autoencoder with t-SNE for dimensionality reduction. This refined dataset is then used to train a Multi-Layer Perceptron (MLP) artificial neural network, which facilitates anomaly detection based on reconstruction errors, illustrated by the point cloud. Additionally, the paper explores metric multidimensional scaling using artificial neural networks, tests large datasets such as healthcare and wine, and compares the results of the work using conventional techniques. This study highlights the effectiveness of integrating various pre-processing, visualization, and artificial neural network strategies through R for effective anomaly detection.

Downloads

Download data is not yet available.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218.

Maaten, L.V.D., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9, 2579-2605.

Torgerson, W.S. (1958). Theory and methods of scaling. John Wiley & Sons.

Gopaper r, J.C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53(3-4), 325-338.

Kruskal, J.B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1), 1-27.

Chollet, F. (2018). Deep Learning with Python. Manning Publications Co.

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.

Wickham, H., Francois, R., Henry, L., & Müller, K. (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.7.

De Leeuw, J. (1988). Convergence of the majorization method for multidimensional scaling. Journal of Classification, 5(2), 163-180.

McCulloch, W.S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115-133.

Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.

Chauhan, A., Vig, L., & Sharma, A. (2018). Anomaly detection using autoencoders. Journal of Machine Learning & Cybernetics, 1(1), 19-30.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

Hawkins, D.M., Basak, S.C., & Mills, D. (2002). Assessing model fit by cross-validation. Journal of Chemical Information and Computer Sciences, 42(2), 579-586.

Ding, X., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. In Computer Systems Bioinformatics Conference, 2005. Proceedings. 2005 IEEE (pp. 523-528).

Tabachnick, B.G., & Fidell, L.S. (2013). Using multivariate statistics. Pearson.

Altenbuchinger M, , Weihs, A ,, Quackenbush, J, et al. Gaussian and mixed graphical models as (multi-)omics data analysis tools . Biochim Biophys Acta Gene Regul Mech, 2020, ; 1863 , : 9441

Feature Scaling Standardization vs. Normalization. Available online: https://www.analyticsvidhya.com /blog/2020/04 /feature-scaling-machine-learning-normalization-standardization/ (accessed on 21 November 2021).

Decision Tree Algorithm, Explained—KDnuggets. Available online: https://www.kdnuggets.com /2020/01/decision-tree-algorithm-explained.html (accessed on 2 March 2022).

K-Nearest Neighbor (KNN) Algorithm for Machine Learning—Javatpoint. Available online: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning (accessed on 2 March 2022).

Binarize Label Hivemall User Manual. Available online: https://hivemall.apache.org/userguide/ft_engineering /binarize.html (accessed on 2 March 2022).

De Kerf, T.; Gladines, J.; Sels, S.; Vanlanduit, S. Oil Spill Detection Using Machine Learning and Infrared Images. Remote Sens. 2020, 12, 4090.

IEEE Xplore Full-Text PDF. Available online: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber= 9226415 (accessed on 30 March 2022).

Thudumu, S., Branch, P., Jin, J. et al. A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data 7, 42 (2020). https://doi.org/10.1186/s40537-020-00320-x

Gupta, Upasana, Singh, Vaishali & Goyal, Dinesh (2023) Highly secure intelligent computer data detection of anomalies, Journal of Discrete Mathematical Sciences and Cryptography, 26:3, 875-884, DOI: 10.47974/JDMSC-1767.

Downloads

Published

07.02.2024

How to Cite

Gupta, U. ., & Singh, V. . (2024). A Novel Pipeline Model for Anomaly Detection in High Dimensional Data Sets. International Journal of Intelligent Systems and Applications in Engineering, 12(15s), 299–308. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/4749

Issue

Section

Research Article