Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS EMR issues #30

Open
DallanQ opened this issue May 20, 2021 · 2 comments
Open

AWS EMR issues #30

DallanQ opened this issue May 20, 2021 · 2 comments

Comments

@DallanQ
Copy link

DallanQ commented May 20, 2021

I had some problems running on AWS EMR with the default mrjob.conf. In case anyone else is running into similar issues, I found that I needed to make two minor changes to mrjob.conf: change python2.7 and python2.7-pip to python2 and python2-pip, and change boto to boto3.

- sudo yum install -y python2 python2-pip python2-devel gcc-c++
- sudo pip-2.7 install boto3 mrjob simplejson warc

I really like this project - much simpler to use than cc-pyspark. I hope it continues to work past July 15 when boto3 no longer supports python 2.7.

@sebastian-nagel
Copy link

Thanks, @DallanQ! I'll update the mrjob.conf but should also get a Python3 version ready (see #11), just in case.

@DallanQ
Copy link
Author

DallanQ commented May 20, 2021

Thank you @sebastian-nagel! This really is a terrific project. It makes it so easy to process the CommonCrawl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants