Skip to content

SamanthaSHan/geocluster-kmeans-pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

K-means for Geo-location Clustering in Spark

This is a final group project for CSE427S (SP18) - Cloud Computing with Big Data Applications

We have implemented K-means algorithm for Euclidean and Great Circle distance measurement. Our algorithm iteratively updates the location of k-cluster centroids until it converges to α = 0.1 km, where α is converging distance. We processed our data on Amazon EMR. For the detailed result and full report, please refer to Final write-up.pdf

This is a driver that parses an input file and processes it. It takes following 4 command line arguments

<input_path> <output_path> <distance_measure> <k>

where input_path is the path to the input data (e.g. home/cloudera/cse427/final_project/dbpedia), output_path is the path where output data should be saved, distance_measure is either “euclidean” or “greatcircle” (case-insensitive), and k is an integer larger than 0.

This is a collection of helper functions used in kmeans.py

Our project was the top-ranked

Contributors

Releases

No releases published

Packages

No packages published

Languages