Authors Sayan Hazra & Sankalpa Chowdhury
Understanding how and why we are here is one of the fundamental questions for the human race. Part of the answer to this question lies in the origins of galaxies, such as our own Milky Way. Yet questions remain about how the Milky Way (or any of the other ~100 billion galaxies in our Universe) was formed and has evolved. Galaxies come in all shapes, sizes and colors: from beautiful spirals to huge ellipticals. Understanding the distribution, location and types of galaxies as a function of shape, size, and color are critical pieces for solving this puzzle. (Source)
With each passing day telescopes around and above the Earth capture more and more images of distant galaxies. As better and bigger telescopes continue to collect these images, the datasets begin to explode in size. In order to better understand how the different shapes (or morphologies) of galaxies relate to the physics that create them, such images need to be sorted and classified.
Galaxies in this set have already been classified once through the help of hundreds of thousands of volunteers, who collectively classified the shapes of these images by eye in a successful citizen science crowdsourcing project. However, this approach becomes less feasible as data sets grow to contain of hundreds of millions (or even billions) of galaxies. Here we implemet a deep learning model to classify huge number of galaxies with high accuracy.
To use the repository, clone the repository using
$ git clone https://github.com/sankalpachowdhury/Galaxy-Classification-using-CNN.git
File Structure
├── File_Structure
│ └── generating_git_file_structure.ipynb
├── Galaxy-Classification CNN models
│ ├── Galaxy_classification_CNN_final_model_26_08_20.ipynb
│ └── Galaxy_classification_CNN_model_comparisons_26_08_20.ipynb
├── Images
│ ├── Decisiontree2.PNG
│ ├── Decisiontree.PNG
│ ├── final_model.png
│ ├── hubble_t.jpg
│ └── README.md
├── LICENSE
├── Model testing
│ ├── 24_08_20_Model2_Galaxy_classification_Early_St_&_MCh.ipynb
│ ├── 25_08_20_Model2_Galaxy_classification_Early_St_&_MCh(2).ipynb
│ ├── Copy_of_Copy_of_Model1_Galaxy_classification_sankalpa_v5.ipynb
│ ├── Copy_of_Model1_Galaxy_classification_augmentation.ipynb
│ ├── Copy_of_Model1_Galaxy_classification_sankalpa_v3.ipynb
│ ├── Copy_of_Model1_Galaxy_classification_sankalpa_v5.ipynb
│ ├── Copy_of_Model2_Galaxy_classification_Early_St_&_MCh.ipynb
│ ├── Copy_of_Untitled35.ipynb
│ ├── data_augmentation.ipynb
│ ├── Decisiontree_to_3classes.ipynb
│ ├── Decisiontree_to_3classes_sankalpa_v1.ipynb
│ ├── Galaxy_clasification.ipynb
│ ├── Galaxy_classification_sankalpa_v2.ipynb
│ ├── Model1_Galaxy_classification_sankalpa_v2.ipynb
│ ├── Model1_Galaxy_classification_sankalpa_v3-2.ipynb
│ ├── Model1_Galaxy_classification_sankalpa_v3.ipynb
│ ├── Model1_Galaxy_classification_sankalpa_v4.ipynb
│ ├── Model1_Galaxy_classification_sankalpa_v5.ipynb
│ ├── Model2_Galaxy_classification_Early_St_&_MCh.ipynb
│ ├── Model2_of_Galaxy_classification_sankalpa_v2.ipynb
│ ├── new
│ └── SayanDa_version.ipynb
├── Python files
│ ├── All_in_one.py
│ ├── CNN_model.py
│ ├── data_augmentation.py
│ ├── data_segmentation.py
│ ├── data_storing.py
│ ├── image_generator.py
│ ├── images_visualization.py
│ ├── intermediate_activation_vis.py
│ ├── model_compilation_training.py
│ ├── model_evaluation_visualization.py
│ └── modules.py
├── README.md
└── Weights and bias
├── best_model-27-08-2020-final.h5
└── best_model (2).h5
Galaxy-Classification CNN models-> folder contains the Notebooks corrosponding to the finalized models.
- Galaxy_classification_CNN_final_model_26_08_20.ipynb -- Final model
- Galaxy_classification_CNN_model_comparisons_26_08_20.ipynb -- Model comparison
Model testing contains all tested models and other notebooks.
Python files-> These files can be used to run the final Galaxy Classifier model locally:
- All_in_one.py -- Contains the implementation of all the steps corresponding to the final model building processes (mentioned below). In order to directly implement the trained model with the help of stored weights and biases replace the best_model.h5 with best_model-27-08-2020-final.h5 in the following code of "All_in_one.py", and evaluate.
model_param_file = 'best_model.h5'
- CNN_model.py -- Contains CNN model creation
- data_augmentation.py -- Data augmentation [For this Augmentor class needed
! pip install Augmentor
] - data_segmentation.py -- Segmenting images based on survey reult and decision tree discussed later.
- data_storing.py -- loading and stowrign raw data
- image_generator.py -- Implementing image_generator
- images_visualization.py -- Images visualzation steps
- intermediate_activation_vis.py -- Visualizing intermediate activation layers
- model_compilation_training.py -- Model compilation and training
- model_evaluation_visualization.py -- Trained model evaluation
- modules.py -- Contains the necessry python modules
Weights and bias-> Contains the weights and biases of the final model after optimization.
- best_model-27-08-2020-final.h5 -- Contains the weights and bias for final model, can be directly loaded and implemented in the notebooks and in the "All_in_one.py", for evaluation and classification.
Data preperation and segrigation are done based on decision tree referenced from Dataset.
- Dataset Source:
The Dataset is hosted on a kaggle challenge. Kaggle Galazy Zoo
- Decision tree Galaxy zoo2 Paper
The Galaxy zoo 2 survey was done based on some interconnected decision steps of significant questions, as shown:
Weighting the responses
For the first set of responses (smooth, features/disk, star/artifact), the values in each category are simply the likelihood of the galaxy falling in each category, are summed to 1.0. For each subsequent question, the probabilities are first computed (these will sum to 1.0) and then multiplied by the value which led to that new set of responses.
Example: Suppose for a galaxy 80% of users identify it as smooth, 15% as having features/disk, and 5% as a star/artifact.
Class1.1 = 0.80
Class1.2 = 0.15
Class1.3 = 0.05
For the 80% of users that identified the galaxy as "smooth", they also recorded responses for the galaxy's relative roundness. These votes were for 50% completely round, 25% in-between, and 25% cigar-shaped. The values in the solution file are thus:
Class 7.1 = 0.80 * 0.50 = 0.40
Class 7.2 = 0.80 * 0.25 = 0.20
Class 7.3 = 0.80 * 0.25 = 0.20
The reason for this weighting is to emphasize that a good solution must get the high-level, large-scale morphology categories correct. The best solutions, though, will also have high levels of accuracy on the detailed solutions that are further down the decision tree.
Based on that referenced decision tree concept the images are segregated into three main classes of Hubbles Tuning fork, which are
1. Elliptical, 2. Lenticular, 3. Spiral
- Image Data
Source contains 65000 images of galaxies. Image file (.jpeg) of 424 x 424 RGB Galaxy ids are used as galaxy image name.
- Labels
Based on the survey as recorded in the training_solutions_rev1.csv the galaxy ids are classified into three mentioned category with the help of referenced concepts in weighting and responses section, and the following decision tree architecture mentioned in the Galaxy zoo 2 paper.
NOTE: Galaxy-IDs are used to map images to the respective labels.
- Train-Test split
The dataset of images is segmented into Train
and Validation
sets. ~90% of the images are taken into Training
set and the remaining for the Validation
set.
-
Data Classification and image segregation
Classification
- The recorded survey is loaded into a dataframe from the csv.
- Based on the decision tree the galaxy ids corresponding to the three classes are stored into three lists.
Images segregation
- The galaxy images are segregated into three different folders of named corresponding to the classes with the help of lists returned from the previous step.
-
Image Augmentation
Image augmentation technique is used as preprocessing technique using the Augmentor class of tensorflow, which helps to reduce the overfitting problem.
Augmentation techniques used
1. Rotation
-
Rotate 90 degree(Probability = 0.5)
-
Rotate 270 degree(Probability = 0.5)
2. Mirroring
-
Horizontal flip(Probability = 0.5)
-
Vertical flip(Probability = 0.5)
3. Resizing
- Augmented image size = (150,150) | (Probability = 0.5)
Target training samples after augmentation = 8000 for each class,
Target validation samples after augmentation = 1000 for each class -
The galaxy classification problem is solved by Deep Learning classification methods, using Convolutional Neural Networks (CNN). The CNN models are known to work well with image classification problems. Here, different model architectures are used with different combinations of hyperparameters. The model architectures used are defined in details in the colab notebooks Tested models. After testing with several custom models including open source models like ResNet50. After evaluating the model performances, two optimal models are selected, whose architectures are given below-->
-
Filter shape: (3,3)
-
Pool shape(Constant): (2,2)
-
Input shape: (150,150)
-
Batch Normalization parameters(default):
-
momentum = 0.99,
-
epsilon = 0.001,
-
renorm_momentum = 0.99
-
Layer1:
a. Convolution: No. of filters: 32, Activation: RELU, Batch normalization
b. Max-pool
Layer2:
a. Convolution: No. of filters: 32, Activation: RELU, Batch normalization
b. Max-pool
Layer3:
a. Convolution: No. of filters: 64, Activation: RELU, Batch normalization
b. Max-pool
Layer4:
a. Convolution: No. of filters: 64, Activation: RELU, Batch normalization
b. Max-pool
Layer5:
a. Convolution: No. of filters: 128, Activation: RELU
b. Max-pool
Layer6: Flatten
Layer7:
Fully connected layer: No. of nodes: 512, Activation: RELU
Layer8:
Fully connected layer: No. of nodes: 128, Activation: RELU
Output layer:
Softmax layer: No. of units: 3, Activation: Softmax
-
Filter shape: (3,3)
-
Pool shape(Constant): (2,2)
-
Input shape: (150,150)
-
Batch Normalization parameters(default):
-
momentum = 0.99,
-
epsilon = 0.001,
-
renorm_momentum = 0.99
-
Layer1:
a. Convolution: No. of filters: 64, Activation: RELU,
b. Max-pool
Layer2:
a. Convolution: No. of filters: 64, Activation: RELU,
b. Max-pool
Layer3:
a. Convolution: No. of filters: 128, Activation: RELU,
b. Max-pool
Layer4:
a. Convolution: No. of filters: 128, Activation: RELU,
b. Max-pool
Layer5:
a. Convolution: No. of filters: 128, Activation: RELU
b. Max-pool
Layer6: Flatten
Layer7: Dropout layer(Probability = 0.5)
Layer8:
Fully connected layer: No. of nodes: 512, Activation: RELU
Output layer:
Softmax layer: No. of units: 3, Activation: Softmax
The two models are implemented using Keras and Tensorflow. The details of model implementations and performances comparison can be found in the Galaxy_classification_CNN_model_comparisons_26_08_20.
Model compilation and Introducing optimization technique
- Optimizer: Adam(Learning rate = 0.001, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-07
- Cost function: Categorical cross entropy
- metrics : Accuracy
Implementing Callbacks
-
In the custom callback function myCallback, baseline val_loss is taken as 0.2500, after that this custom callback will be executed to stop the training.
-
The best model will be stored inside best_model.h5 file with the help of ModelCheckpoint callback argument,
Training the Model
- Training is implemented using fit_generator
Arguments :
- Training data: train_generator (Contains train data)
- Validation data: validation_generator (Contains validation data)
- Epochs: 140 (Because google colab buffer memory restriction, the model was run for two times sequentially (70 + 70).
- Callbacks: Custom_callback, Model_checkpoint
- Verbose: 1
History of the model training is stored for the analysis.
Early Stopping object, where earlystopping baseline was taken 0.2791 for validation loss, but not used Instead, custom callback was implemented for the model
Hyperparameters:
- Minibatch size: 64
- Epochs: 10
- Steps per Epochs : 8
- Optimizer : Adam
- Learning Rate : 0.001
- Loss function : Categorical Cross Entropy
Both models were validated on the Validation
set, where the accuracy reaches upto ~89%.
The model trainings are visualized in the colab notebook, which indicates that the Model 2 performance is much stable than Model 1, which validates Model 2 to be finalized. The implementation of Model 2 has been done in a seperate notebook Galaxy_classification_CNN_final_model_26_08_20.
The final model training performance is given below:
- Galaxy zoo 2 paper: https://blog.galaxyzoo.org/category/paper/
- Kaggle Galaxy: https://github.com/benanne/kaggle-galaxies