K-means for Geo-location Clustering in Spark

This is a final group project for CSE427S (SP18) - Cloud Computing with Big Data Applications

We have implemented K-means algorithm for Euclidean and Great Circle distance measurement. Our algorithm iteratively updates the location of k-cluster centroids until it converges to α = 0.1 km, where α is converging distance. We processed our data on Amazon EMR. For the detailed result and full report, please refer to Final write-up.pdf

kmeans.py

This is a driver that parses an input file and processes it. It takes following 4 command line arguments

<input_path> <output_path> <distance_measure> <k>

where input_path is the path to the input data (e.g. home/cloudera/cse427/final_project/dbpedia), output_path is the path where output data should be saved, distance_measure is either “euclidean” or “greatcircle” (case-insensitive), and k is an integer larger than 0.

point_utils.py

This is a collection of helper functions used in kmeans.py

Our project was the top-ranked

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Visualization		Visualization
427Final.png		427Final.png
Final Write-up.pdf		Final Write-up.pdf
README.md		README.md
kmeans.py		kmeans.py
point_utils.py		point_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-means for Geo-location Clustering in Spark

This is a final group project for CSE427S (SP18) - Cloud Computing with Big Data Applications

kmeans.py

point_utils.py

Contributors

About

Releases

Packages

Contributors 2

Languages

SamanthaSHan/geocluster-kmeans-pyspark

Folders and files

Latest commit

History

Repository files navigation

K-means for Geo-location Clustering in Spark

This is a final group project for CSE427S (SP18) - Cloud Computing with Big Data Applications

kmeans.py

point_utils.py

Contributors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages