We have implemented K-means algorithm for Euclidean and Great Circle distance measurement. Our algorithm iteratively updates the location of k-cluster centroids until it converges to α = 0.1 km, where α is converging distance. We processed our data on Amazon EMR. For the detailed result and full report, please refer to Final write-up.pdf
This is a driver that parses an input file and processes it. It takes following 4 command line arguments
<input_path> <output_path> <distance_measure> <k>
where input_path
is the path to the input data (e.g. home/cloudera/cse427/final_project/dbpedia), output_path
is the path where output data should be saved, distance_measure
is either “euclidean” or “greatcircle” (case-insensitive), and k
is an integer larger than 0.
This is a collection of helper functions used in kmeans.py
- Samantha Han([email protected])
- Hakkyung Lee ([email protected])
- Skylar Nam ([email protected])
- Zefang Tang ([email protected])