Skip to content

Galaxy Classification using CNN based on SLOAN digital sky survey data.

License

Notifications You must be signed in to change notification settings

sayan0506/Galaxy-Classification-using-CNN

 
 

Repository files navigation

Galaxy-Classification-using-CNN

Authors Sayan Hazra & Sankalpa Chowdhury

Problem Statement :

Understanding how and why we are here is one of the fundamental questions for the human race. Part of the answer to this question lies in the origins of galaxies, such as our own Milky Way. Yet questions remain about how the Milky Way (or any of the other ~100 billion galaxies in our Universe) was formed and has evolved. Galaxies come in all shapes, sizes and colors: from beautiful spirals to huge ellipticals. Understanding the distribution, location and types of galaxies as a function of shape, size, and color are critical pieces for solving this puzzle. (Source)

With each passing day telescopes around and above the Earth capture more and more images of distant galaxies. As better and bigger telescopes continue to collect these images, the datasets begin to explode in size. In order to better understand how the different shapes (or morphologies) of galaxies relate to the physics that create them, such images need to be sorted and classified.

Galaxies in this set have already been classified once through the help of hundreds of thousands of volunteers, who collectively classified the shapes of these images by eye in a successful citizen science crowdsourcing project. However, this approach becomes less feasible as data sets grow to contain of hundreds of millions (or even billions) of galaxies. Here we implemet a deep learning model to classify huge number of galaxies with high accuracy.

How to use this repository?

To use the repository, clone the repository using $ git clone https://github.com/sankalpachowdhury/Galaxy-Classification-using-CNN.git

File Structure

├── File_Structure
│   └── generating_git_file_structure.ipynb
├── Galaxy-Classification CNN models
│   ├── Galaxy_classification_CNN_final_model_26_08_20.ipynb
│   └── Galaxy_classification_CNN_model_comparisons_26_08_20.ipynb
├── Images
│   ├── Decisiontree2.PNG
│   ├── Decisiontree.PNG
│   ├── final_model.png
│   ├── hubble_t.jpg
│   └── README.md
├── LICENSE
├── Model testing
│   ├── 24_08_20_Model2_Galaxy_classification_Early_St_&_MCh.ipynb
│   ├── 25_08_20_Model2_Galaxy_classification_Early_St_&_MCh(2).ipynb
│   ├── Copy_of_Copy_of_Model1_Galaxy_classification_sankalpa_v5.ipynb
│   ├── Copy_of_Model1_Galaxy_classification_augmentation.ipynb
│   ├── Copy_of_Model1_Galaxy_classification_sankalpa_v3.ipynb
│   ├── Copy_of_Model1_Galaxy_classification_sankalpa_v5.ipynb
│   ├── Copy_of_Model2_Galaxy_classification_Early_St_&_MCh.ipynb
│   ├── Copy_of_Untitled35.ipynb
│   ├── data_augmentation.ipynb
│   ├── Decisiontree_to_3classes.ipynb
│   ├── Decisiontree_to_3classes_sankalpa_v1.ipynb
│   ├── Galaxy_clasification.ipynb
│   ├── Galaxy_classification_sankalpa_v2.ipynb
│   ├── Model1_Galaxy_classification_sankalpa_v2.ipynb
│   ├── Model1_Galaxy_classification_sankalpa_v3-2.ipynb
│   ├── Model1_Galaxy_classification_sankalpa_v3.ipynb
│   ├── Model1_Galaxy_classification_sankalpa_v4.ipynb
│   ├── Model1_Galaxy_classification_sankalpa_v5.ipynb
│   ├── Model2_Galaxy_classification_Early_St_&_MCh.ipynb
│   ├── Model2_of_Galaxy_classification_sankalpa_v2.ipynb
│   ├── new
│   └── SayanDa_version.ipynb
├── Python files
│   ├── All_in_one.py
│   ├── CNN_model.py
│   ├── data_augmentation.py
│   ├── data_segmentation.py
│   ├── data_storing.py
│   ├── image_generator.py
│   ├── images_visualization.py
│   ├── intermediate_activation_vis.py
│   ├── model_compilation_training.py
│   ├── model_evaluation_visualization.py
│   └── modules.py
├── README.md
└── Weights and bias
    ├── best_model-27-08-2020-final.h5
    └── best_model (2).h5

Galaxy-Classification CNN models-> folder contains the Notebooks corrosponding to the finalized models.

  • Galaxy_classification_CNN_final_model_26_08_20.ipynb -- Final model
  • Galaxy_classification_CNN_model_comparisons_26_08_20.ipynb -- Model comparison

Model testing contains all tested models and other notebooks.

Python files-> These files can be used to run the final Galaxy Classifier model locally:

  • All_in_one.py -- Contains the implementation of all the steps corresponding to the final model building processes (mentioned below). In order to directly implement the trained model with the help of stored weights and biases replace the best_model.h5 with best_model-27-08-2020-final.h5 in the following code of "All_in_one.py", and evaluate.

model_param_file = 'best_model.h5'

  • CNN_model.py -- Contains CNN model creation
  • data_augmentation.py -- Data augmentation [For this Augmentor class needed ! pip install Augmentor]
  • data_segmentation.py -- Segmenting images based on survey reult and decision tree discussed later.
  • data_storing.py -- loading and stowrign raw data
  • image_generator.py -- Implementing image_generator
  • images_visualization.py -- Images visualzation steps
  • intermediate_activation_vis.py -- Visualizing intermediate activation layers
  • model_compilation_training.py -- Model compilation and training
  • model_evaluation_visualization.py -- Trained model evaluation
  • modules.py -- Contains the necessry python modules

Weights and bias-> Contains the weights and biases of the final model after optimization.

  • best_model-27-08-2020-final.h5 -- Contains the weights and bias for final model, can be directly loaded and implemented in the notebooks and in the "All_in_one.py", for evaluation and classification.

Dataset

Data preperation and segrigation are done based on decision tree referenced from Dataset.

  • Dataset Source:

The Dataset is hosted on a kaggle challenge. Kaggle Galazy Zoo

The Galaxy zoo 2 survey was done based on some interconnected decision steps of significant questions, as shown: img1

Weighting the responses

For the first set of responses (smooth, features/disk, star/artifact), the values in each category are simply the likelihood of the galaxy falling in each category, are summed to 1.0. For each subsequent question, the probabilities are first computed (these will sum to 1.0) and then multiplied by the value which led to that new set of responses.

Example: Suppose for a galaxy 80% of users identify it as smooth, 15% as having features/disk, and 5% as a star/artifact.

    Class1.1 = 0.80
    Class1.2 = 0.15
    Class1.3 = 0.05

For the 80% of users that identified the galaxy as "smooth", they also recorded responses for the galaxy's relative roundness. These votes were for 50% completely round, 25% in-between, and 25% cigar-shaped. The values in the solution file are thus:

    Class 7.1 = 0.80 * 0.50 = 0.40
    Class 7.2 = 0.80 * 0.25 = 0.20
    Class 7.3 = 0.80 * 0.25 = 0.20

The reason for this weighting is to emphasize that a good solution must get the high-level, large-scale morphology categories correct. The best solutions, though, will also have high levels of accuracy on the detailed solutions that are further down the decision tree.

Based on that referenced decision tree concept the images are segregated into three main classes of Hubbles Tuning fork, which are

1. Elliptical, 2. Lenticular, 3. Spiral

tuningfork

  • Image Data

Source contains 65000 images of galaxies. Image file (.jpeg) of 424 x 424 RGB Galaxy ids are used as galaxy image name.

  • Labels

Based on the survey as recorded in the training_solutions_rev1.csv the galaxy ids are classified into three mentioned category with the help of referenced concepts in weighting and responses section, and the following decision tree architecture mentioned in the Galaxy zoo 2 paper.

NOTE: Galaxy-IDs are used to map images to the respective labels.

Decision tree

  • Train-Test split

The dataset of images is segmented into Train and Validation sets. ~90% of the images are taken into Training set and the remaining for the Validation set.

Preprocessing

  • Data Classification and image segregation

    Classification

    • The recorded survey is loaded into a dataframe from the csv.
    • Based on the decision tree the galaxy ids corresponding to the three classes are stored into three lists.

    Images segregation

    • The galaxy images are segregated into three different folders of named corresponding to the classes with the help of lists returned from the previous step.
  • Image Augmentation

    Image augmentation technique is used as preprocessing technique using the Augmentor class of tensorflow, which helps to reduce the overfitting problem.

    Augmentation techniques used

    1. Rotation

    • Rotate 90 degree(Probability = 0.5)

    • Rotate 270 degree(Probability = 0.5)

    2. Mirroring

    • Horizontal flip(Probability = 0.5)

    • Vertical flip(Probability = 0.5)

    3. Resizing

    • Augmented image size = (150,150) | (Probability = 0.5)

    Target training samples after augmentation = 8000 for each class,
    Target validation samples after augmentation = 1000 for each class

Models

The galaxy classification problem is solved by Deep Learning classification methods, using Convolutional Neural Networks (CNN). The CNN models are known to work well with image classification problems. Here, different model architectures are used with different combinations of hyperparameters. The model architectures used are defined in details in the colab notebooks Tested models. After testing with several custom models including open source models like ResNet50. After evaluating the model performances, two optimal models are selected, whose architectures are given below-->

Model 1 architecture (sequential)

  • Filter shape: (3,3)

  • Pool shape(Constant): (2,2)

  • Input shape: (150,150)

  • Batch Normalization parameters(default):

    • momentum = 0.99,

    • epsilon = 0.001,

    • renorm_momentum = 0.99

Layer1:

a. Convolution: No. of filters: 32, Activation: RELU, Batch normalization

b. Max-pool

Layer2:

a. Convolution: No. of filters: 32, Activation: RELU, Batch normalization

b. Max-pool

Layer3:

a. Convolution: No. of filters: 64, Activation: RELU, Batch normalization

b. Max-pool

Layer4:

a. Convolution: No. of filters: 64, Activation: RELU, Batch normalization

b. Max-pool

Layer5:

a. Convolution: No. of filters: 128, Activation: RELU

b. Max-pool

Layer6: Flatten

Layer7:

Fully connected layer: No. of nodes: 512, Activation: RELU

Layer8:

Fully connected layer: No. of nodes: 128, Activation: RELU

Output layer:

Softmax layer: No. of units: 3, Activation: Softmax


Model 2 architecture (sequential)

  • Filter shape: (3,3)

  • Pool shape(Constant): (2,2)

  • Input shape: (150,150)

  • Batch Normalization parameters(default):

    • momentum = 0.99,

    • epsilon = 0.001,

    • renorm_momentum = 0.99

Layer1:

a. Convolution: No. of filters: 64, Activation: RELU,

b. Max-pool

Layer2:

a. Convolution: No. of filters: 64, Activation: RELU,

b. Max-pool

Layer3:

a. Convolution: No. of filters: 128, Activation: RELU,

b. Max-pool

Layer4:

a. Convolution: No. of filters: 128, Activation: RELU,

b. Max-pool

Layer5:

a. Convolution: No. of filters: 128, Activation: RELU

b. Max-pool

Layer6: Flatten

Layer7: Dropout layer(Probability = 0.5)

Layer8:

Fully connected layer: No. of nodes: 512, Activation: RELU

Output layer:

Softmax layer: No. of units: 3, Activation: Softmax

The two models are implemented using Keras and Tensorflow. The details of model implementations and performances comparison can be found in the Galaxy_classification_CNN_model_comparisons_26_08_20.

Training

Model compilation and Introducing optimization technique

  • Optimizer: Adam(Learning rate = 0.001, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-07
  • Cost function: Categorical cross entropy
  • metrics : Accuracy

Implementing Callbacks

  • In the custom callback function myCallback, baseline val_loss is taken as 0.2500, after that this custom callback will be executed to stop the training.

  • The best model will be stored inside best_model.h5 file with the help of ModelCheckpoint callback argument,

Training the Model

  • Training is implemented using fit_generator

Arguments :

  • Training data: train_generator (Contains train data)
  • Validation data: validation_generator (Contains validation data)
  • Epochs: 140 (Because google colab buffer memory restriction, the model was run for two times sequentially (70 + 70).
  • Callbacks: Custom_callback, Model_checkpoint
  • Verbose: 1

History of the model training is stored for the analysis.

Early Stopping object, where earlystopping baseline was taken 0.2791 for validation loss, but not used Instead, custom callback was implemented for the model

Hyperparameters:

  • Minibatch size: 64
  • Epochs: 10
  • Steps per Epochs : 8
  • Optimizer : Adam
  • Learning Rate : 0.001
  • Loss function : Categorical Cross Entropy

Testing and Evaluation

Both models were validated on the Validation set, where the accuracy reaches upto ~89%.

Analysis

The model trainings are visualized in the colab notebook, which indicates that the Model 2 performance is much stable than Model 1, which validates Model 2 to be finalized. The implementation of Model 2 has been done in a seperate notebook Galaxy_classification_CNN_final_model_26_08_20.

The final model training performance is given below: graph1

Reference

  1. Galaxy zoo 2 paper: https://blog.galaxyzoo.org/category/paper/
  2. Kaggle Galaxy: https://github.com/benanne/kaggle-galaxies

About

Galaxy Classification using CNN based on SLOAN digital sky survey data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.9%
  • Python 0.1%