-
Notifications
You must be signed in to change notification settings - Fork 4
R package for packaging workloads, uploading and running on AWS #34
Comments
I find RStudio server, running on an permanent AWS instance, works pretty On Thursday, 7 April 2016, Daniel-t [email protected] wrote:
|
I can see where @Daniel-t is coming from. Although it's easy to get R up on AWS that's really just the start. If I wanted to get set up for a big task on a meaty instance, and minimise costs, I would go about it like this:
I've never run a cluster job, but I imagine it would work similarly except with a second phase of testing on multiple micros after the imaging. I played around with Docker/Rocker on the weekend and it struck me that this could really cut down the effort to deploy big tasks. If there were some way I could dockerize an r-project with all its decencies from my dev environment, then I could do testing locally and just deploy the docker image directly on AWS. See: EC2 Container Service. This is something of use in the context of #12 as well. With this ability in hand, we could then focus our automation efforts on the ECS API to facilitate easy deployment/cleanup of containers from within R. Possible extensions are deployment on other cloud computing services. |
If you're looking to automate things, AWS have a command-line interface that is quite powerful, and if setup correctly, seems able to handle instance deployment. I use the aws-cli to manage a free micro instance, but presumably it can work to manage starting/stopping also. |
There's an R wrapper being written for that interface that seems to be coming along, as mentioned in #12:(https://github.com/cloudyr). I did a bit of poking around re docker and found this: dockertest which may be a good start for creating containers from R projects. On that same jaunt I discovered the Google Cloud Compute platform which also supports containers and has some different pricing models (maybe better?). |
I'm not sure docker would be the correct technology in this case. In the majority of situations you'd only be looking at running a single application on a server (e.g. 'R'), adding docker would be another layer of abstraction which I cant see providing much value over a straight server instance.
Regarding @MilesMcBain comment around process flow, I was thinking something more streamlined again that didn't require the prework to be done on a cloud (although it could be) I based this proposal on AWS because it's what I'm most familiar with, ideally the solution should be extensible to other platforms (e.g. MS Azure, Google, etc). For the (distant) future, it would also be nice to be able to able to leverage AWS spot instances (cheap excess capacity) for running jobs. For example an AWS server with 32 vCPUs and 60GB of memory (c3.x8large) normally costs U$2.117 an hour (in Sydney), however a spot price for the same is presently $0.3234 per hour. There are issues with Spot instances, in that they get stopped if the spot price goes over your current bid, however for some pieces of analysis they would serve nicely. |
Running a single application on a server is the canonical docker use-case. See: Dockerfile Best Practices. However if we don't want to muddy the waters with Docker right off that bat, at the very least we would want to ensure r package dependencies are migrated to the cloud automatically to save us setting them up. Packrat will be of use for this. @Daniel-t , you make a great point about spot instances. Google have similar concept with their pre-emptable instances. It would be cool if the API that helps you deploy your R project in the cloud could also help you make it resistant to being terminated before job completion. I think this in itself is a very interesting sub-problem. |
I like your idea @Daniel-t and I think it aligns well with the AWS/SNOW idea that was proposed previously for the conference. At present I do a lot of work on AWS and had similar thoughts to you: I wanted to be able to spin up EC2 instances using my ssh key, check whether they had launched, export functions to those R instances, give them jobs to do and get back the results (all from R!) with a collection of easy-to-use tools. I have achieved this using a combination of SNOW (you could just have a single AWS worker that is running your jobs or a cluster of workers), shell scripts and AWS CLI scripts, but I think this sort of thing could be packaged up into a much nicer R package than the collection of clunky scripts that I currently have running on my machine. In essence, I build a cluster of AWS workers but where my laptop operates as the head node. I'd love to talk to you and other like-minded cluster-holics about creating something that takes all the hard work out of running big jobs on AWS from R. It'd also be great if this idea could be generalised so that it would be useful to users of other cloud compute systems (Google, Microsoft etc). |
This sounds really cool. I'm familiar with snow and networking with ssh. I don't have much experience with AWS or Docker though. Here's just some random thoughts I have.
|
The biggest issue I have with my current approach (and one that I'm not network savy enough to have overcome so far) is that to operate a cluster in this way with my local machine as the head node, I have to open ports on my firewall so that the SNOW workers can send back their results. Obviously, this is fine when I am at home and am the network administrator, but doesn't work when I am at work. I'm wondering if anyone has any bright ideas (maybe using a reverse ssh tunnel for example) on how to overcome the firewall problem for SNOW without the need to forward ports? |
My original thoughts around this proposal (note I was thinking single instances, not clusters) was to:
Regarding spot instance resiliency mentioned by @MilesMcBain, AWS recently introduced classes of spot instances which are guaranteed to run for a certain period of time (from memory: 1/3/6 hours), the other options for being resistant to termination are 'bidding' at a higher price than everyone else (you only pay the minimum required to get your instance) or having some checkpointing in the code so that if the server does get terminated it can restart with minimal losses. I'm not familiar with SNOW and only passing familiarity with Docker (I wrote my first docker file last weekend). Regarding the use of Docker in this use case, I'm still not convinced that it provides much benefit on a server that is intended to exist only for the duration of the job, however I don't see it as being a hindrance so happy to go with thoughts of the group. (FYI, installing rStudio with dependencies on most linux distributions is two commands). |
I was very sorry to miss the auunconf event, but hopefully next time. Anyway, good news is that over the last 2 years, I've been working in a project that has among other things built tools that enable the type of work flows being discussed in this thread. In particular:
To spin up the cluster, we use a new tool called clusterous, developed by a team at SIRCA. The aim we set the team at SIRCA was to make this process easier for scientists, and enable them to easily upload their code, data and retrieve results. Clusterous is language agnostic, can be used with any number of workflows. To enable R-based workflows, @richfitz developed a couple of tools:
Other features include logging of output, ability to query task run times, etc. It's all still a little rough, but we have now successfully used the tools and workflow in several projects. @jscamac and I will hopefully put together a minimal demo in the near future, demonstrating how they all come together. I'd be keen to hear from anyone who is interested in using these tools, what features might be missing, and whether we might build on this basis in future unconf events. |
In past projects I've often found myself constrained by the resources of my local machine and wondered why can't I package up my scripts and data (similar to publishing a shiny app) upload it to Amazon, start an EC2 instance start and run my job, then download the result.
Of course I could do this manually, but that would get tedious, I often work offline so I can't work in AWS full time, and I only need the full compute power/memory of an EC2 instance until I'm ready to work on the full dataset.
So my proposal is an R package which:
There may be some points of convergence with #12.
The text was updated successfully, but these errors were encountered: