The task of Content Based Image Retrieval (CBIR) is becoming increasingly complex due to the large number of images available on the internet. This task involves retrieval of similar images based on an input image given by the user. To enable faster computation of similar images, the proposed work uses Apache Spark and Alluxio, previously known as Tachyon. Spark is an open-source software used for processing Big Data. It provides parallelism that reduces computational time. Alluxio on the other hand is a virtual distributed storage system. Although models using Spark for CBIR have been proposed earlier, the proposed model aims at reducing the retrieval time of images by optimizing this task by modifying the feature extraction mechanism. Histogram of oriented gradients (HOG) feature descriptor has been used to find the similarity between images. The K Nearest Neighbours (KNN) algorithm has been used and optimized to compute the top K similar images to query images.
- TinyImageNet.tar.xz - dataset zip
- README.md - readme file
- CBIR Report_19BCE1328_19BCE1295_19BCE1614.docx - project report
- Code - uploadIMages.py, feature_extraction.py, similarity.ipynb
- Reference Papers.zip - folder of reference paper
- CBIR REVIEW 3 PPT.pptx - final ppt
- CBIR_video - demonstration video
- System with RAM greater than 4 GB (> 8 GB is recommended) for better performance.
- Any Linux based Operating System (Ubuntu 20.04 preferred)
- Installed Apache saprk
- Installed Alluxio
- opencv-python
- numpy
- pandas
- pyspark
- scikit-image
- skimage
- pillow
- Clone the github Repository.
- Extract the file named
TinyImageNet.tar.xz
in the same repository to extract dataset. - Start alluxio using following commands :
$ cd <PATH_TO_ALLUXIO>
$ ./bin/alluxio format
$ ./bin/alluxio-start.sh local SudoMount
- Visit http://localhost:19999 and http://localhost:30000 to check whether alluxio is started or not.
- Run
UploadIMages.py
to store the images in Alluxio File System. - Run
feature_extraction.py
to extract HOG features from the images and store it in Alluxio in parquet format. - Finally run
similarity.ipynb
to run KNN to retrieve similar images. - To Stop alluxio :
$ ./bin/alluxio-stop.sh local
Refer this link to install spark on your system.
Download Alluxio from this page. Select the desired release followed by the distribution built for default Hadoop. Unpack the downloaded file with the following commands.
$ tar -xzf alluxio-2.7.2-bin.tar.gz
$ cd alluxio-2.7.2
In the ${ALLUXIO_HOME}/conf directory, create the conf/alluxio-site.properties configuration file by copying the template file.
$ cp conf/alluxio-site.properties.template conf/alluxio-site.properties
Set alluxio.master.hostname in conf/alluxio-site.properties to localhost.
$ echo "alluxio.master.hostname=localhost" >> conf/alluxio-site.properties
Alluxio provides commands to ensure the system environment is ready for running Alluxio services. Run the following command to validate the environment for running Alluxio locally:
$ ./bin/alluxio validateEnv local
Alluxio needs to be formatted before starting the process. The following command formats the Alluxio journal and worker storage directories.
$ ./bin/alluxio format
$ ./bin/alluxio-start.sh local SudoMount
The Alluxio client jar must be distributed across the all nodes where Spark drivers or executors are running. Place the client jar on the same local path (e.g. /<PATH_TO_ALLUXIO>/client/alluxio-2.7.2-client.jar) on each node.
spark.driver.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-2.7.2-client.jar
spark.executor.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-2.7.2-client.jar