A Genuine GLCM-based Feature Extraction for Breast Tissue Classification on Mammograms

A breast tissue type detection system is designed, and verified on a publicly available mammogram dataset constructed by the Mammographic Image Analysis Society (MIAS) in this paper. This database consists of three fundamental breast tissue types that are fatty, fatty-glandular, and dense-glandular. At the pre-processing stage of the designed detection system, median filtering and morphological operations are applied for noise reduction and artifact suppression, respectively; then a pectoral muscle removal operation follows by using a region growing algorithm. Then, 88-dimensional texture features are computed from the GLCMs (Gray-Level CoOccurrence Matrices) of mammogram images. Besides, a formerly introduced 108-dimensional feature ensemble is also computed and cascaded with the 88-dimensional texture features. Finally, a classification process is realized using Fisher’s Linear Discriminant Analysis (FLDA) classifier in four different classification cases: one-stage classification, first fatty – then others, first fatty-glandular – then others, and first dense-glandular – then others. A maximum of 72.93% classification accuracy is achieved using only texture features whereas it is increased to 82.48% when cascade features are utilized. This consequence clearly exposes that the cascade features are more representative than texture features. The maximum classification accuracy is attained when “first fatty-glandular – then others” classification case is implemented, that is consistent with the fact that fatty-glandular tissue type is easily confused with fatty and denseglandular tissue types.


Introduction
Breast cancer is the second major cause of female deaths all over the world (Jemal et al;. Although early diagnosis helps mortality to reduce, suspicious mass detection from mammograms becomes harder as breast tissue type becomes denser. Hence, it will be efficient to use a Computer-Aided Detection (CAD) system which can first define the breast tissue type of a mammogram, then detect and diagnose the type of breast cancer. Basically three major problems occur during breast tissue type classification: digitization noise, artifacts like labels of the mammograms, and pectoral muscle regions in the images.
Since digitization noise appears as a high-frequency component in an image, smoothing filter implementation such as mean and median filtering is needed for noise reduction (L.M. Mina (Bick et al;1995), using gradients (Mendez et al;1996), and active contours (M.A. Wirth and A. Stapinski, 2003). Almost the most important intensity-based problem for breast tissue classification is the existence of pectoral muscles on the mammogram images. Pectoral muscles show up like triangle geometry at any of the top corners on the mammogram having brighter intensities than breast parenchyma. The studies on pectoral muscle removal in the literature generally focus on intensity-based and wavelet-based approaches. These approaches are examined under three main topics, which are line detection techniques, statistical techniques, and other techniques (Ganesan et al;. Intensity-based approaches either use directly pixel intensities (Saltanat et  In this paper, the design of a CAD system for breast tissue type classification of mammogram images is aimed. In accordance with this purpose, noise reduction and artifact suppression are initially realized on mammogram images in the database using median filter and morphological operations, respectively. Then, a pectoral muscle removal process is executed using region growing algorithm. These pre-processing operations are elaborately explained in Section 2. A feature extraction procedure explained in Section 3 is performed on the pre-processed mammogram images. The experimental study employed in this paper and all conclusions are given in Sections 4 and 5, respectively.

Pre-Processing
Digitization noise, low/high-level artifacts in the background and presence of pectoral muscles, as shown on the sample mammogram image in ( Figure.1), obstruct intensity-based breast tissue type classification of mammogram images. Hence, a preprocessing stage is essential in order to reduce noise, suppress artifacts, and remove pectoral muscles on original mammogram images.

Noise Reduction
Smoothing filters are used for noise reduction although they cause loss in gross details in an image. Hence, the use of filters that can remove noise as well as preserving edge details is essential. In this paper, noise reduction is carried out via median filtering. The median filter is a commonly preferred non-linear filter for noise reduction (Neal et al;1981). This filter is capable of preserving edge information while removing differences between pixels in the pre-defined neighbourhood.

Artifact Suppression
Morphological operations are applied for both low and high-level artifact suppressions after noise reduction step. In this respect, the mammography images are converted into their corresponding binary level images. Then, the largest area is assumed to be breast parenchyma on each binary level image since its area should be greater than the area occupied by an artifact.

Pectoral Muscle Removal
Region growing algorithm is performed for pectoral muscle removal process in this study. Region growing algorithm, a region-based segmentation method, splits all pixels in an image directly into sub-regions by taking the pre-defined similarity conditions for the growing process into consideration (R.C. Gonzalez and R.E. Woods, 2007). This algorithm is based on an enlargement of regions by aggregating the pixels with similar properties. For this purpose, initially, a similarity condition and a seed point or a set of seed points are defined. Specified seed/seed set is considered as the initial sub-region and the pixels around 4 or 8-neighbors of each pixel are considered in terms of similarity condition.

Gray-Level Co-Occurence Matrix (GLCM)
GLCM is one of the commonly used methods for texture analysis and it compares the gray-level differences of any two neighbour pixels in a specified displacement and direction on an image (A. Eleyan  , and its neighborhood.

GLCM Texture Features
Texture features, introduced by Haralick et al. (Haralick et al;1973), Soh et al. (Soh and Tsatsaulis, 1999), and Clausi (Clausi, 2002), are extracted from the GLCMs of mammograms in this paper. These features and their mathematical representations are given in (Table.1 Autocorrelation (Soh and Tsatsaulis, 1999) Contrast ( Texture features, extracted in this paper, give information about the homogeneity, symmetry, complexity, and contrast in the GLCMs of the mammogram images.

Experimental Study
In this paper, a CAD system that classifies mammogram images into the breast tissue types of fatty, fatty-glandular, and dense-glandular is proposed. Firstly, a pre-processing of mammogram images is performed where median filtering and morphological operations are applied for noise reduction and artifact suppression, respectively; then a pectoral muscle removal process follows by using region growing algorithm. Secondly, at the feature extraction stage, 88-dimensional texture features are computed from the GLCMs of mammogram images. Finally, classification is realized using Fisher's Linear Discriminant Analysis (FLDA) in four different classification cases.

Database
A publicly available mammogram database constructed by the Mammographic Image Analysis Society (MIAS) is used in this paper (Suckling et al;1994). This database consists of three health status classes (normal, benign, malignant) for each of three breast tissue type classes (fatty, fatty-glandular, and dense-glandular). It has 322 MLO mammogram images (106 fatty, 104 fatty-glandular, and 112 dense-glandular) with 330 diagnosis (207 normal, 69 benign, and 54 malignant). The images in the MIAS database are of size 1024x1024 at 200 μm/pixel resolution and they are in ".pgm" imaging format. All mammogram images in the database are resized into a size of 256x256 for ease of operation. Sample images of each class in the MIAS database are shown in (Figure.

Feature Vector Construction
In this paper, the texture features, given in (Table.1

Classification
Breast tissue type classification is performed in one-stage and two-stage processes in this paper. In the one-stage classification process, mammogram images are directly categorized as having fatty, fatty-glandular, and denseglandular tissue types. Besides, the two-stage classification process is carried out in three different ways: first fattythen others, first fatty-glandularthen others, and first dense-glandularthen others.
In the first stage of the "first fattythen others" classification process, the mammogram images are primarily classified as fatty and non-fatty mammograms. Then, in the second stage, the mammograms labeled as non-fatty are classified as fatty-glandular and denseglandular. Similarly, in the "first fatty-glandularthen others" classification case, mammograms are firstly classified as fatty-glandular and non-fatty-glandular, and then non-fatty-glandular mammograms are classified as fatty and dense-glandular. Finally, in the "first denseglandularthen others" classification case, mammograms are initially categorized as denseglandular and non-dense-glandular, and then non-denseglandular mammograms are classified as fatty and fattyglandular.

Performance Evaluation
In this paper, a breast tissue type detection system is designed for four different classification cases using FLDA classifier. Average and fold-by-fold classification accuracies of FLDA classifier when texture features are used in four classification processes are given in ( strengthens the data representability of GLCM texture features. Moreover, the maximum average classification accuracy is again attained in "first fattyglandularthen others" classification cases. The maximum classification accuracy is reached when "first fatty-glandularthen others" classification case is performed. It is already known by radiologists that fattyglandular tissues can easily be confused with fatty and dense-glandular tissue types. Hence, it would be wisely to detect fatty-glandular tissue type primarily, and then re-categorize non-fatty-glandular ones as fatty or denseglandular. This reality is quite coherent with the classification process that gives maximum classification accuracy.

Conclusion
In this paper, a Computer Aided Detection (CAD) system for breast tissue type classification is designed, and it is verified on a popular mammogram database compiled by the Mammographic Image Analysis Society (MIAS). This database consists of the mammograms of three fundamental breast tissue types namely fatty, fattyglandular, and dense-glandular. The 88-dimensional texture features computed from GLCMs of the mammogram images, are concatenated to the 108dimensional feature ensemble. Ultimately, 196dimensional feature vectors are obtained for each mammogram image and then they are classified using Fisher's Linear Discriminant Analysis (FLDA) classifier in four different classification cases: one-stage classification, first fattythen others, first fattyglandularthen others, and first dense-glandularthen others.
A maximum of 72.93% classification accuracy is achieved using only texture features while it is increased to 82.48% when the final 196-dimensional feature vectors are employed. This consequence clearly implies that the final concatenated feature vectors are more descriptive than texture features. Besides this finding, the increase in the number of the dimension for the evaluated feature vectors reveals more representative vectors so that mammogram images are eventually characterized more effectively.