Skip to content

Workshop materials for AMLD2019 on PySpark.

Notifications You must be signed in to change notification settings

ginnocen/pyspark_amld2019

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PySpark Workshop

This repository includes the materials for the PySpark workshop in AMLD2019.

Part 1: PySpark for Big Data Processing

1.1 Installation:

1.1.1 Method 1: Running PySpark locally (e.g. on your laptop)

Mac OS or Linux:

See INSTALLATION_UNIX.md in the docs folder.

Windows

See INSTALLATION_WINDOWS.md in the docs folder.

1.1.2 Method 2: Running PySpark on Google Colab

See GOOGLECOLAB_README.md in the docs folder.

1.2 Agenda:

1.2.1 Data processing in PySpark

If you run PySpark on your laptop then start with the notebook data_processing_start.ipynb in the src folder.

If you run PySpark on Google Colab then start with the notebook data_processing_gc_start.ipynb in the src folder.

1.2.2 Machine learning in PySpark (MLlib)

If you run PySpark on your laptop then start with the notebook spark_mllib_start.ipynb in the src folder.

If you run PySpark on Google Colab then start with the notebook spark_mllib_gc_start.ipynb in the src folder.

Part 2: Running PySpark in Jupyter Notebook on Amazon Clusters

See AWS_README.md in the docs folder.

About

Workshop materials for AMLD2019 on PySpark.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.6%
  • Shell 0.4%