A Novel Pipeline Model for Anomaly Detection in High Dimensional Data Sets
Keywords:
High-Dimensional Data, Data Pre-processing and Visualization, Dimensionality Reduction, Reconstruction Error, Anomaly Detection, Healthcare, Multi-Layer Perceptron, Autoencoder, R Programming LanguageAbstract
This paper presents a comprehensive method for dimension reduction and detecting anomalies in high-dimensional data (on healthcare datasets) using R. Realizing that traditional linear methods such as Principal Component Analysis (PCA) often ignore the complexity of the non-linear manifold of the data, our approach exploits iterative learning, on the belief that high-dimensional data is largely based on a low-dimensional manifold. The methodology starts by preparing the data using R libraries like Keras, dplyr, and ggplot2, addressing challenges like missing values and visualizing meaningful information. Using the Mahalanobis distance, the paper identifies and removes country-specific outliers. The pipelined model integrates Principal Component Analysis (PCA) for data transformation and combines an Autoencoder with t-SNE for dimensionality reduction. This refined dataset is then used to train a Multi-Layer Perceptron (MLP) artificial neural network, which facilitates anomaly detection based on reconstruction errors, illustrated by the point cloud. Additionally, the paper explores metric multidimensional scaling using artificial neural networks, tests large datasets such as healthcare and wine, and compares the results of the work using conventional techniques. This study highlights the effectiveness of integrating various pre-processing, visualization, and artificial neural network strategies through R for effective anomaly detection.
Downloads
References
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218.
Maaten, L.V.D., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9, 2579-2605.
Torgerson, W.S. (1958). Theory and methods of scaling. John Wiley & Sons.
Gopaper r, J.C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53(3-4), 325-338.
Kruskal, J.B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1), 1-27.
Chollet, F. (2018). Deep Learning with Python. Manning Publications Co.
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.
Wickham, H., Francois, R., Henry, L., & Müller, K. (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.7.
De Leeuw, J. (1988). Convergence of the majorization method for multidimensional scaling. Journal of Classification, 5(2), 163-180.
McCulloch, W.S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115-133.
Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
Chauhan, A., Vig, L., & Sharma, A. (2018). Anomaly detection using autoencoders. Journal of Machine Learning & Cybernetics, 1(1), 19-30.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Hawkins, D.M., Basak, S.C., & Mills, D. (2002). Assessing model fit by cross-validation. Journal of Chemical Information and Computer Sciences, 42(2), 579-586.
Ding, X., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. In Computer Systems Bioinformatics Conference, 2005. Proceedings. 2005 IEEE (pp. 523-528).
Tabachnick, B.G., & Fidell, L.S. (2013). Using multivariate statistics. Pearson.
Altenbuchinger M, , Weihs, A ,, Quackenbush, J, et al. Gaussian and mixed graphical models as (multi-)omics data analysis tools . Biochim Biophys Acta Gene Regul Mech, 2020, ; 1863 , : 9441
Feature Scaling Standardization vs. Normalization. Available online: https://www.analyticsvidhya.com /blog/2020/04 /feature-scaling-machine-learning-normalization-standardization/ (accessed on 21 November 2021).
Decision Tree Algorithm, Explained—KDnuggets. Available online: https://www.kdnuggets.com /2020/01/decision-tree-algorithm-explained.html (accessed on 2 March 2022).
K-Nearest Neighbor (KNN) Algorithm for Machine Learning—Javatpoint. Available online: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning (accessed on 2 March 2022).
Binarize Label Hivemall User Manual. Available online: https://hivemall.apache.org/userguide/ft_engineering /binarize.html (accessed on 2 March 2022).
De Kerf, T.; Gladines, J.; Sels, S.; Vanlanduit, S. Oil Spill Detection Using Machine Learning and Infrared Images. Remote Sens. 2020, 12, 4090.
IEEE Xplore Full-Text PDF. Available online: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber= 9226415 (accessed on 30 March 2022).
Thudumu, S., Branch, P., Jin, J. et al. A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data 7, 42 (2020). https://doi.org/10.1186/s40537-020-00320-x
Gupta, Upasana, Singh, Vaishali & Goyal, Dinesh (2023) Highly secure intelligent computer data detection of anomalies, Journal of Discrete Mathematical Sciences and Cryptography, 26:3, 875-884, DOI: 10.47974/JDMSC-1767.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.