Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Cosmoflow running on Rivanna via a prebuilt singularity image #1

Open
varunpav opened this issue Feb 17, 2023 · 7 comments
Open

Comments

@varunpav
Copy link
Collaborator

Email Chain:

Thank you for pointing this out. This dockerfile is the most recent:
https://github.com/mlcommons/hpc/blob/main/cosmoflow/builds/Dockerfile
The other dockerfile was for running on Cori CPU, and yeah it's a bit old and should be removed.

I also think we may want to add a requirements.txt file in general to the code so that that is used within the images or natively

I like that idea. At this point I should tell you that we may swap this tensorflow implementation out for a pytorch one. It is not finalized yet but I aim to have it ready and validated before we freeze mlperf hpc v3 in June. I can similarly try to prepare that one with requirements.txt + dockerfile.

Please not we only have singularity on the machine and not docker

You can use dockerfiles and/or docker images, though, right?
We run shifter at NERSC and just convert the docker images into shifter images.

@varunpav
Copy link
Collaborator Author

@laszewsk
Copy link
Member

as discussed its just a matter of hopefully reading and understanding documentation now

the second part is here
https://www.rc.virginia.edu/userinfo/rivanna/software/containers/

Running Image Non-Interactively as Slurm jobs
Example script:

#!/usr/bin/env bash
#SBATCH -J tftest
#SBATCH -o tftest-%A.out
#SBATCH -e tftest-%A.err
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH -c 1
#SBATCH -t 00:01:00
#SBATCH -A mygroup

module purge
module load singularity

containerdir=~
singularity run --nv $containerdir/tensorflow-2.1.0-py37.sif tensorflowtest.py

.sif image needs to be replaced
look as cori_shifter script. replace shifter with singularity command similar to above

@laszewsk
Copy link
Member

sbatch parameters i distriuted previously. Nate successfully uses them

@laszewsk
Copy link
Member

create completely new README-singularity.md

that does this. YOu do not have to modify the main/README.md for now. FOcus on singularity. You can copy the portions on how to do ssh and log into rivanna from main, as well as the git code management.

remember you have two repos dsc-spidal/mlcommons-cosmoflow and mlcommons/hpc

@varunpav
Copy link
Collaborator Author

Install docker on local machine, test out the docker pull on steve's image
Verify that the hello world image can be run
Update readme and document steps

@varunpav
Copy link
Collaborator Author

varunpav commented Mar 5, 2023

Utilize the image to run the train.py script on a small dataset

@varunpav
Copy link
Collaborator Author

utilized properly on a small dataset, get it working on the larger dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants