An Automated Progressive Data Cleaning Framework for Lung Cancer Medical Data using Machine Learning


  • B. Samirana Acharya Research Scholar, Department of Computer Science and Engineering Koneru Lakshmaiah Education Foundation, Hyderabad-500075, Telangana, India
  • K. Ramasubramanian Associate professor Department of Computer science and Engineering Koneru Lakshmaiah Education Foundation, Hyderabad-500075, Telangana, India


Medical Data Pre-processing, Lung Canter data pre-processing, outlier detection, noise reduction, missing value imputations


With the immense growth in the field of computational algorithms and data management, the demand for automating the medical analysis and diagnosis is also increasing. The foundational demand from the medical analysis is rapid analysis with least error or almost with zero errors. The manual process is subjected to the higher human interventions and with higher scope of errors. Henceforth, dealing with analysis of life treating diseases such as lung cancer must be automated. The challenge with the computer driven automated processes is the quality of the data decides the accuracy of the final outcomes or information. Henceforth, the data cleaning or as called literally data pre-processing is one of the major focused areas of concern for building automated frameworks for disease detections. Many Researchers have dedicatedly worked towards achieving the best pre-processing framework. Nonetheless, these research attempts are criticised for various reasons such not designed for medical information pre-processing as various parameters like precision, “missing value” and dimension of the data plays a major role. Few of parallel research outcomes have demonstrated higher focus on the medical information pre-processing while building the framework. However, these methods demonstrate higher complexity and hard to adapt due to strong dependency on the “dataset”. Henceforth, the paper proposes a novel framework for medical data pre-processing with few benchmarking proposed algorithms with adaptive and threshold driven method for “outlier” detection and imputation, domain specific “missing value” detection and imputation, and finally mete information specific noise reduction. The outcome of the proposed framework demonstrates nearly 50% improvement with the benchmarked algorithms attached with the proposed framework due to this adaptation.     


Download data is not yet available.


N. Nasrullah, J. Sang, M. S. Alam, M. Mateen, B. Cai and H. Hu, "Automated lung nodule detection and classification using deep learning combined with multiple strategies", Sensors, vol. 19, no. 17, pp. 3722, Aug. 2019.

I. Ali, G. R. Hart, G. Gunabushanam, Y. Liang, W. Muhammad, B. Nartowt, et al., "Lung nodule detection via deep reinforcement learning", Frontiers Oncol., vol. 8, pp. 108, Apr. 2018.

W. Zuo, F. Zhou, Z. Li and L. Wang, "Multi-resolution CNN and knowledge transfer for candidate classification in lung nodule detection", IEEE Access, vol. 7, pp. 32510-32521, 2019.

N. Gupta, D. Gupta, A. Khanna, P. P. R. Filho and V. H. C. de Albuquerque, "Evolutionary algorithms for automatic lung disease detection", Measurement, vol. 140, pp. 590-608, Jul. 2019.

Y. Chen, Y. Wang, F. Hu and D. Wang, "A lung dense deep convolution neural network for robust lung parenchyma segmentation", IEEE Access, vol. 8, pp. 93527-93547, 2020.

A. M. Anter and A. E. Hassenian, "CT liver tumor segmentation hybrid approach using neutrosophic sets fast fuzzy C-means and adaptive watershed algorithm", Artif. Intell. Med., vol. 97, pp. 105-117, Jun. 2019.

G. Wei, H. Cao, H. Ma, S. Qi, W. Qian and Z. Ma, "Content-based image retrieval for lung nodule classification using texture features and learned distance metric", J. Med. Syst., vol. 42, no. 1, pp. 13, Jan. 2018.

J. Gong, J.-Y. Liu, L.-J. Wang, X.-W. Sun, B. Zheng and S.-D. Nie, "Automatic detection of pulmonary nodules in CT images by incorporating 3D tensor filtering with local image feature analysis", Phys. Medica, vol. 46, pp. 124-133, Feb. 2018.

J. J. Chabon, E. G. Hamilton, D. M. Kurtz, M. S. Esfahani, E. J. Moding, H. Stehr, et al., "Integrating genomic features for non-invasive early lung cancer detection", Nature, vol. 580, pp. 245-251, Apr. 2020.

A. Mobiny, P. Yuan, P. A. Cicalese, S. K. Moulik, N. Garg, C. C. Wu, et al., "Memory-augmented capsule network for adaptable lung nodule classification", IEEE Trans. Med. Imag., Jan. 2021.

M. A. Heuvelmans, P. M. A. van Ooijen, S. Ather, C. F. Silva, D. Han, C. P. Heussel, et al., "Lung cancer prediction by deep learning to identify benign lung nodules", Lung Cancer, vol. 154, pp. 1-4, Apr. 2021.

I. W. Harsono, S. Liawatimena and T. W. Cenggoro, "Lung nodule detection and classification from Thorax CT-scan using RetinaNet with transfer learning", J. King Saud Univ.-Comput. Inf. Sci., vol. 1319, pp. 1-8, Apr. 2020.

Y. Xie, Y. Xia, J. Zhang, Y. Song, D. Feng, M. Fulham, et al., "Knowledge-based collaborative deep learning for benign-malignant lung nodule classification on chest CT", IEEE Trans. Med. Imag., vol. 38, no. 4, pp. 991-1004, Apr. 2019.

I. Ali, M. Muzammil, I. U. Haq, A. A. Khaliq and S. Abdullah, "Efficient lung nodule classification using transferable texture convolutional neural network", IEEE Access, vol. 8, pp. 175859-175870, 2020.

A. Naik and D. R. Edla, "Lung nodule classification on computed tomography images using deep learning", Wireless Pers. Commun., vol. 116, pp. 655-690, Jan. 2021.

C.-J. Lin and Y.-C. Li, "Lung nodule classification using Taguchi-based convolutional neural networks for computer tomography images", Electronics, vol. 9, no. 7, pp. 1066, Jun. 2020.

R. Dey, Z. Lu and Y. Hong, "Diagnostic classification of lung nodules using 3D neural networks", Proc. IEEE 15th Int. Symp. Biomed. Imag. (ISBI), pp. 774-778, Apr. 2018.

M. Al-Shabi, H. K. Lee and M. Tan, "Gated-dilated networks for lung nodule classification in CT scans", IEEE Access, vol. 7, pp. 178827-178838, 2019.

R. V. M. D. Nobrega, S. A. Peixoto, S. P. P. D. Silva and P. P. R. Filho, "Lung nodule classification via deep transfer learning in CT lung images", Proc. IEEE 31st Int. Symp. Comput. Based Med. Syst. (CBMS), pp. 244-249, Jun. 2018.

Y. Qin, H. Zheng, Y. M. Zhu and J. Yang, "Simultaneous accurate detection of pulmonary nodules and false positive reduction using 3D CNNs", Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 1005-1009, Apr. 2018.

D. Ardila, A. P. Kiraly, S. Bharadwaj, B. Choi, J. J. Reicher, L. Peng, et al., "End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography", Nature Med., vol. 25, no. 6, pp. 954-961, Jun. 2019.

P. Monkam, S. Qi, H. Ma, W. Gao, Y. Yao and W. Qian, "Detection and classification of pulmonary nodules using convolutional neural networks: A survey", IEEE Access, vol. 7, pp. 78075-78091, 2019.

B. Fielding and L. Zhang, "Evolving deep DenseBlock architecture ensembles for image classification", Electronics, vol. 9, no. 11, pp. 1880, Nov. 2020.

C. Zhao, T. F. Wang and B. Y. Lei, "Medical image fusion method based on dense block and deep convolutional generative adversarial network", Neural Comput. Appl., vol. 11600, pp. 1-16, Oct. 2020.

B. X. Chen, T. J. Liu, K. H. Liu, H. H. Liu and S. C. Pei, "Image super-resolution using complex dense block on generative adversarial networks", Proc. IEEE Int. Conf. Image Process. (ICIP), pp. 2866-2870, Sep. 2019.

Sherje, N. P., Agrawal, S. A., Umbarkar, A. M., Kharche, P. P., & Dhabliya, D. (2021). Machinability study and optimization of CNC drilling process parameters for HSLA steel with coated and uncoated drill bit. Materials Today: Proceedings, doi:10.1016/j.matpr.2020.12.1070

Moore, B., Clark, R., Muñoz, S., Rodríguez, D., & López, L. Automated Grading Systems in Engineering Education: A Machine Learning Approach. Kuwait Journal of Machine Learning, 1(2). Retrieved from

Gandhi, L. ., Rishi, R. ., & Sharma, S. . (2023). An Efficient and Robust Tuple Timestamp Hybrid Historical Relational Data Model. International Journal on Recent and Innovation Trends in Computing and Communication, 11(3), 01–10.




How to Cite

Acharya, B. S. ., & Ramasubramanian, K. . (2023). An Automated Progressive Data Cleaning Framework for Lung Cancer Medical Data using Machine Learning . International Journal of Intelligent Systems and Applications in Engineering, 11(4), 146–157. Retrieved from



Research Article

Similar Articles

You may also start an advanced similarity search for this article.