diff --git a/archived/docs/ADOPTERS.md b/archived/docs/ADOPTERS.md deleted file mode 100644 index fc3838e8d..000000000 --- a/archived/docs/ADOPTERS.md +++ /dev/null @@ -1,16 +0,0 @@ -# Adopters Of Arena - -Below are the adopters of project Arena. If you are using Arena to improve efficiency and productivity in Machine Learning with Kubernetes, please feel free to add yourself into the following list by a pull request. There're several phases as follow: - -* **Evaluation:** Known Arena, that's interesting; evaluating the features/scopes of Arena -* **Testing:** Take Arena as one of candidates, testing Kubernetes cluster with Arena -* **Staging:** Decide to use Arena, testing it in pre-product environment -* **Production:** Already put Arena into product environment - -| Organization | Contact | Phases | Description of Use | -| ------------ | ------- | ----------- | ------------------ | -| [Weibo](https://www.weibo.com) | [@phoenixwu0229](https://github.com/phoenixwu0229) | **Production** | Weibo ML Platform | -| [HUYA](https://www.huya.com) | [@BobLiu20](https://github.com/bobliu20) | **Production** | HUYA AI Platform | -| [Microsoft](https://www.microsoft.com) | [@chaowangnk1](https://github.com/chaowangnk1) | **Testing** | AzureML DataCache internal benchmark system | -| [Unisound](https://www.unisound.com) | [@xieydd](https://github.com/xieydd) | **Production** | Unisound ATLAS AI Platform | -| [DOUYU](https://www.douyu.com) | [@gongcan1219](https://github.com/gongcan1219) | **Production** | DOUYU AI Platform | diff --git a/archived/docs/cli/arena.md b/archived/docs/cli/arena.md deleted file mode 100644 index 060bf5acd..000000000 --- a/archived/docs/cli/arena.md +++ /dev/null @@ -1,40 +0,0 @@ -## arena - -arena is the command line interface to Arena - -### Synopsis - -arena is the command line interface to Arena - -``` -arena [flags] -``` - -### Options - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - -h, --help help for arena - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena completion](arena_completion.md) - output shell completion code for the specified shell (bash or zsh) -* [arena data](arena_data.md) - manage data. -* [arena delete](arena_delete.md) - delete a training job and its associated pods -* [arena get](arena_get.md) - display details of a training job -* [arena list](arena_list.md) - list all the training jobs -* [arena logs](arena_logs.md) - print the logs for a task of the training job -* [arena logviewer](arena_logviewer.md) - display Log Viewer URL of a training job -* [arena prune](arena_prune.md) - prune history job -* [arena serve](arena_serve.md) - Serve a job. -* [arena submit](arena_submit.md) - Submit a job. -* [arena top](arena_top.md) - Display Resource (GPU) usage. -* [arena version](arena_version.md) - Print version information - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_completion.md b/archived/docs/cli/arena_completion.md deleted file mode 100644 index 0770281a7..000000000 --- a/archived/docs/cli/arena_completion.md +++ /dev/null @@ -1,43 +0,0 @@ -## arena completion - -output shell completion code for the specified shell (bash or zsh) - -### Synopsis - -Write bash or zsh shell completion code to standard output. - -For bash, ensure you have bash completions installed and enabled. -To access completions in your current shell, run -$ source <(arena completion bash) -Alternatively, write it to a file and source in .bash_profile - -For zsh, output to a file in a directory referenced by the $fpath shell -variable. - - -``` -arena completion SHELL [flags] -``` - -### Options - -``` - -h, --help help for completion -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_data.md b/archived/docs/cli/arena_data.md deleted file mode 100644 index c15e31736..000000000 --- a/archived/docs/cli/arena_data.md +++ /dev/null @@ -1,39 +0,0 @@ -## arena data - -manage data. - -### Synopsis - -manage data volumes. - -Available Commands: - list,ls List the data volumes. - - -``` -arena data [flags] -``` - -### Options - -``` - -h, --help help for data -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena -* [arena data list](arena_data_list.md) - list all the data volume. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_data_list.md b/archived/docs/cli/arena_data_list.md deleted file mode 100644 index 466d66a53..000000000 --- a/archived/docs/cli/arena_data_list.md +++ /dev/null @@ -1,35 +0,0 @@ -## arena data list - -list all the data volume. - -### Synopsis - -list all the data volume. - -``` -arena data list [flags] -``` - -### Options - -``` - --allNamespaces show all the namespaces - -h, --help help for list -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena data](arena_data.md) - manage data. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_delete.md b/archived/docs/cli/arena_delete.md deleted file mode 100644 index 41ad004b9..000000000 --- a/archived/docs/cli/arena_delete.md +++ /dev/null @@ -1,35 +0,0 @@ -## arena delete - -delete a training job and its associated pods - -### Synopsis - -delete a training job and its associated pods - -``` -arena delete a training job [flags] -``` - -### Options - -``` - -h, --help help for delete - --type string The training type to delete, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional) -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_get.md b/archived/docs/cli/arena_get.md deleted file mode 100644 index f91dfba3f..000000000 --- a/archived/docs/cli/arena_get.md +++ /dev/null @@ -1,37 +0,0 @@ -## arena get - -display details of a training job - -### Synopsis - -display details of a training job - -``` -arena get training job [flags] -``` - -### Options - -``` - -e, --events Specify if show pending pod's events. - -h, --help help for get - -o, --output string Output format. One of: json|yaml|wide - --type string The training type to delete, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional) -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_list.md b/archived/docs/cli/arena_list.md deleted file mode 100644 index f5f4d0838..000000000 --- a/archived/docs/cli/arena_list.md +++ /dev/null @@ -1,35 +0,0 @@ -## arena list - -list all the training jobs - -### Synopsis - -list all the training jobs - -``` -arena list [flags] -``` - -### Options - -``` - --allNamespaces show all the namespaces - -h, --help help for list -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_logs.md b/archived/docs/cli/arena_logs.md deleted file mode 100644 index 9c6027fd0..000000000 --- a/archived/docs/cli/arena_logs.md +++ /dev/null @@ -1,41 +0,0 @@ -## arena logs - -print the logs for a task of the training job - -### Synopsis - -print the logs for a task of the training job - -``` -arena logs training job [flags] -``` - -### Options - -``` - -f, --follow Specify if the logs should be streamed. - -h, --help help for logs - -i, --instance string Specify the task instance to get log - --since string Only return logs newer than a relative duration like 5s, 2m, or 3h. Defaults to all logs. Only one of since-time / since may be used. - --since-time string Only return logs after a specific date (RFC3339). Defaults to all logs. Only one of since-time / since may be used. - --tail int Lines of recent log file to display. Defaults to -1 with no selector, showing all log lines otherwise 10, if a selector is provided. (default -1) - --timestamps Include timestamps on each line in the log output - --type string The training type to show logging, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional) -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_logviewer.md b/archived/docs/cli/arena_logviewer.md deleted file mode 100644 index 28a2086fc..000000000 --- a/archived/docs/cli/arena_logviewer.md +++ /dev/null @@ -1,34 +0,0 @@ -## arena logviewer - -display Log Viewer URL of a training job - -### Synopsis - -display Log Viewer URL of a training job - -``` -arena logviewer job [flags] -``` - -### Options - -``` - -h, --help help for logviewer -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_prune.md b/archived/docs/cli/arena_prune.md deleted file mode 100644 index ab5c9d799..000000000 --- a/archived/docs/cli/arena_prune.md +++ /dev/null @@ -1,35 +0,0 @@ -## arena prune - -prune history job - -### Synopsis - -prune history job - -``` -arena prune history job [flags] -``` - -### Options - -``` - -h, --help help for prune - -s, --since duration Clean job that live longer than relative duration like 5s, 2m, or 3h. (default -1ns) -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_serve.md b/archived/docs/cli/arena_serve.md deleted file mode 100644 index ecbe327ae..000000000 --- a/archived/docs/cli/arena_serve.md +++ /dev/null @@ -1,43 +0,0 @@ -## arena serve - -Serve a job. - -### Synopsis - -serve a job. - -Available Commands: - tensorflow,tf Submit a TensorFlow Serving Job. - tensorrt,trt Submit a TensorRT Job - -``` -arena serve [flags] -``` - -### Options - -``` - -h, --help help for serve -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena -* [arena serve delete](arena_serve_delete.md) - delete a serving job and its associated pods -* [arena serve list](arena_serve_list.md) - list all the serving jobs -* [arena serve tensorflow](arena_serve_tensorflow.md) - Submit tensorflow serving job to deploy and serve machine learning models. -* [arena serve tensorrt](arena_serve_tensorrt.md) - Submit tensorRT inference serving job to deploy and serve machine learning models. -* [arena serve traffic-split](arena_serve_traffic-split.md) - Adjust traffic routing dynamically for tfserving jobs - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_serve_delete.md b/archived/docs/cli/arena_serve_delete.md deleted file mode 100644 index 4e369297d..000000000 --- a/archived/docs/cli/arena_serve_delete.md +++ /dev/null @@ -1,34 +0,0 @@ -## arena serve delete - -delete a serving job and its associated pods - -### Synopsis - -delete a serving job and its associated pods - -``` -arena serve delete a serving job [flags] -``` - -### Options - -``` - -h, --help help for delete -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena serve](arena_serve.md) - Serve a job. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_serve_list.md b/archived/docs/cli/arena_serve_list.md deleted file mode 100644 index f747f531e..000000000 --- a/archived/docs/cli/arena_serve_list.md +++ /dev/null @@ -1,34 +0,0 @@ -## arena serve list - -list all the serving jobs - -### Synopsis - -list all the serving jobs - -``` -arena serve list [flags] -``` - -### Options - -``` - -h, --help help for list -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena serve](arena_serve.md) - Serve a job. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_serve_tensorflow.md b/archived/docs/cli/arena_serve_tensorflow.md deleted file mode 100644 index e5696ce63..000000000 --- a/archived/docs/cli/arena_serve_tensorflow.md +++ /dev/null @@ -1,54 +0,0 @@ -## arena serve tensorflow - -Submit tensorflow serving job to deploy and serve machine learning models. - -### Synopsis - -Submit tensorflow serving job to deploy and serve machine learning models. - -``` -arena serve tensorflow [flags] -``` - -### Options - -``` - --command string the command will inject to container's command. - --cpu string the request cpu of each replica to run the serve. - -d, --data stringArray specify the trained models datasource to mount for serving, like : - --enableIstio enable Istio for serving or not (disable Istio by default) - -e, --envs stringArray the environment variables - --exposeService expose service using Istio gateway for external access or not (not expose by default) - --gpumemory int the limit GPU memory of each replica to run the serve. - --gpus int the limit GPU count of each replica to run the serve. - -h, --help help for tensorflow - --image string the docker image name of serve job, and the default image is tensorflow/serving:latest (default "tensorflow/serving:latest") - --imagePullPolicy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent") - --memory string the request memory of each replica to run the serve. - --modelConfigFile string Corresponding with --model_config_file in tensorflow serving - --modelName string the model name for serving - --modelPath string the model path for serving in the container - --port int the port of tensorflow gRPC listening port (default 8500) - --replicas int the replicas number of the serve job. (default 1) - --restfulPort int the port of tensorflow RESTful listening port (default 8501) - --servingName string the serving name - --servingVersion string the serving version - --versionPolicy string support latest, latest:N, specific:N, all -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena serve](arena_serve.md) - Serve a job. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_serve_tensorrt.md b/archived/docs/cli/arena_serve_tensorrt.md deleted file mode 100644 index 3e595c767..000000000 --- a/archived/docs/cli/arena_serve_tensorrt.md +++ /dev/null @@ -1,55 +0,0 @@ -## arena serve tensorrt - -Submit tensorRT inference serving job to deploy and serve machine learning models. - -### Synopsis - -Submit tensorRT inference serving job to deploy and serve machine learning models. - -``` -arena serve tensorrt [flags] -``` - -### Options - -``` - --allowMetrics Open Metric - --command string the command will inject to container's command. - --cpu string the request cpu of each replica to run the serve. - -d, --data stringArray specify the trained models datasource to mount for serving, like : - --enableIstio enable Istio for serving or not (disable Istio by default) - -e, --envs stringArray the environment variables - --exposeService expose service using Istio gateway for external access or not (not expose by default) - --gpumemory int the limit GPU memory of each replica to run the serve. - --gpus int the limit GPU count of each replica to run the serve. - --grpcPort int the port of grpc serving server (default 8001) - -h, --help help for tensorrt - --httpPort int the port of http serving server (default 8000) - --image string the docker image name of serve job, and the default image is registry.cn-beijing.aliyuncs.com/xiaozhou/tensorrt-serving:18.12-py3 (default "registry.cn-beijing.aliyuncs.com/xiaozhou/tensorrt-serving:18.12-py3") - --imagePullPolicy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent") - --memory string the request memory of each replica to run the serve. - --metricPort int the port of metrics server (default 8002) - --modelName string the model name for serving - --modelPath string the model path for serving in the container - --modelStore string the path of tensorRT model path - --replicas int the replicas number of the serve job. (default 1) - --servingName string the serving name - --servingVersion string the serving version -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena serve](arena_serve.md) - Serve a job. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_serve_traffic-router-split.md b/archived/docs/cli/arena_serve_traffic-router-split.md deleted file mode 100644 index c3da38a18..000000000 --- a/archived/docs/cli/arena_serve_traffic-router-split.md +++ /dev/null @@ -1,36 +0,0 @@ -## arena serve traffic-router-split - -Adjust traffic routing dynamically for tfserving jobs - -### Synopsis - -Adjust traffic routing dynamically for tfserving jobs - -``` -arena serve traffic-router-split [flags] -``` - -### Options - -``` - -h, --help help for traffic-router-split - --servingName string the serving name - --versions string Model versions which the traffic will be routed to, e.g. [1,2,3] (default "[]") - --weights string Weight percentage values for each model version which the traffic will be routed to,e.g. [70,20,10] (default "[]") -``` - -### Options inherited from parent commands - -``` - --arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - --namespace string the namespace of the job (default "default") - --pprof enable cpu profile -``` - -### SEE ALSO - -* [arena serve](arena_serve.md) - Serve a job. - -###### Auto generated by spf13/cobra on 7-Sep-2018 diff --git a/archived/docs/cli/arena_serve_traffic-split.md b/archived/docs/cli/arena_serve_traffic-split.md deleted file mode 100644 index 0012c919d..000000000 --- a/archived/docs/cli/arena_serve_traffic-split.md +++ /dev/null @@ -1,37 +0,0 @@ -## arena serve traffic-split - -Adjust traffic routing dynamically for tfserving jobs - -### Synopsis - -Adjust traffic routing dynamically for tfserving jobs - -``` -arena serve traffic-split [flags] -``` - -### Options - -``` - -h, --help help for traffic-split - --servingName string the serving name - --servingVersions string Model versions which the traffic will be routed to, e.g. 1,2,3 - --weights string Weight percentage values for each model version which the traffic will be routed to,e.g. 70,20,10 -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena serve](arena_serve.md) - Serve a job. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_submit.md b/archived/docs/cli/arena_submit.md deleted file mode 100644 index 70c845e12..000000000 --- a/archived/docs/cli/arena_submit.md +++ /dev/null @@ -1,47 +0,0 @@ -## arena submit - -Submit a job. - -### Synopsis - -Submit a job. - -Available Commands: - tfjob,tf Submit a TFJob. - horovod,hj Submit a Horovod Job. - mpijob,mpi Submit a MPIJob. - standalonejob,sj Submit a standalone Job. - tfserving,tfserving Submit a Serving Job. - sparkjob,spark Submit a Spark Job. - -``` -arena submit [flags] -``` - -### Options - -``` - -h, --help help for submit -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena -* [arena submit horovodjob](arena_submit_horovodjob.md) - Submit horovodjob as training job. -* [arena submit mpijob](arena_submit_mpijob.md) - Submit MPIjob as training job. -* [arena submit standalonejob](arena_submit_standalonejob.md) - Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead. -* [arena submit tfjob](arena_submit_tfjob.md) - Submit TFJob as training job. -* [arena submit sparkjob](arena_submit_sparkjob.md) - Submit SparkJob as training job. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_submit_horovodjob.md b/archived/docs/cli/arena_submit_horovodjob.md deleted file mode 100644 index 638a0fbb4..000000000 --- a/archived/docs/cli/arena_submit_horovodjob.md +++ /dev/null @@ -1,51 +0,0 @@ -## arena submit horovodjob - -Submit horovodjob as training job. - -### Synopsis - -Submit horovodjob as training job. - -``` -arena submit horovodjob [flags] -``` - -### Options - -``` - -a, --annotation stringArray the annotations - --cpu string the cpu resource to use for the training, like 1 for 1 core. - -d, --data stringArray specify the datasource to mount to the job, like : - --data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data - -e, --env stringArray the environment variables - --gpus int the GPU count of each worker to run the training. - -h, --help help for horovodjob - --image string the docker image name of training job - --memory string the memory resource to use for the training, like 1Gi. - --name string override name - --rdma enable RDMA - --retry int retry times. - --sshPort int ssh port. - --sync-image string the docker image of syncImage - --sync-mode string syncMode: support rsync, hdfs, git - --sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git - --workers int the worker number to run the distributed training. (default 1) - --working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root") -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena submit](arena_submit.md) - Submit a job. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_submit_mpijob.md b/archived/docs/cli/arena_submit_mpijob.md deleted file mode 100644 index cc9d97b8f..000000000 --- a/archived/docs/cli/arena_submit_mpijob.md +++ /dev/null @@ -1,53 +0,0 @@ -## arena submit mpijob - -Submit MPIjob as training job. - -### Synopsis - -Submit MPIjob as training job. - -``` -arena submit mpijob [flags] -``` - -### Options - -``` - -a, --annotation stringArray the annotations - --cpu string the cpu resource to use for the training, like 1 for 1 core. - -d, --data stringArray specify the datasource to mount to the job, like : - --data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data - -e, --env stringArray the environment variables - --gpus int the GPU count of each worker to run the training. - -h, --help help for mpijob - --image string the docker image name of training job - --logdir string the training logs dir, default is /training_logs (default "/training_logs") - --memory string the memory resource to use for the training, like 1Gi. - --name string override name - --rdma enable RDMA - --retry int retry times. - --sync-image string the docker image of syncImage - --sync-mode string syncMode: support rsync, hdfs, git - --sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git - --tensorboard enable tensorboard - --tensorboard-image string the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel") - --workers int the worker number to run the distributed training. (default 1) - --working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root") -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena submit](arena_submit.md) - Submit a job. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_submit_sparkjob.md b/archived/docs/cli/arena_submit_sparkjob.md deleted file mode 100644 index b548774d4..000000000 --- a/archived/docs/cli/arena_submit_sparkjob.md +++ /dev/null @@ -1,37 +0,0 @@ -## arena submit sparkjob - -Submit SparkJob as training job. - -### Synopsis - -Submit SparkJob as training job. - -``` -arena submit tfjob [flags] -``` - -### Options - -``` - --image string the docker image name of training job - --jar string jar path in image - --main-class string main class of your jar - --name string override name - --workers int the worker number to run the distributed training. (default 1) -``` - -### Options inherited from parent commands - -``` - --arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena submit](arena_submit.md) - Submit a job. - diff --git a/archived/docs/cli/arena_submit_standalonejob.md b/archived/docs/cli/arena_submit_standalonejob.md deleted file mode 100644 index 2d7871d59..000000000 --- a/archived/docs/cli/arena_submit_standalonejob.md +++ /dev/null @@ -1,52 +0,0 @@ -## arena submit standalonejob(deprecated) - -**Warning: standalonejob has been deprecated,please use [tfjob](../userguide/1-tfjob-standalone.md) instead.** - -Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead. - -### Synopsis - -Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead. - -``` -arena submit standalonejob [flags] -``` - -### Options - -``` - -a, --annotation stringArray the annotations - --cpu string the cpu resource to use for the training, like 1 for 1 core. - -d, --data stringArray specify the datasource to mount to the job, like : - --data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data - -e, --env stringArray the environment variables - --gpus int the GPU count of each worker to run the training. - -h, --help help for standalonejob - --image string the docker image name of training job - --memory string the memory resource to use for the training, like 1Gi. - --name string override name - --rdma enable RDMA - --retry int retry times. - --sync-image string the docker image of syncImage - --sync-mode string syncMode: support rsync, hdfs, git - --sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git - --workers int the worker number to run the distributed training. (default 1) - --working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root") -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena submit](arena_submit.md) - Submit a job. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_submit_tfjob.md b/archived/docs/cli/arena_submit_tfjob.md deleted file mode 100644 index f83f7f161..000000000 --- a/archived/docs/cli/arena_submit_tfjob.md +++ /dev/null @@ -1,68 +0,0 @@ -## arena submit tfjob - -Submit TFJob as training job. - -### Synopsis - -Submit TFJob as training job. - -``` -arena submit tfjob [flags] -``` - -### Options - -``` - -a, --annotation stringArray the annotations - --chief enable chief, which is required for estimator. - --chief-cpu string the cpu resource to use for the Chief, like 1 for 1 core. - --chief-memory string the memory resource to use for the Chief, like 1Gi. - --chief-port int the port of the chief. - --clean-task-policy string How to clean tasks after Training is done, only support Running, None. (default "Running") - -d, --data stringArray specify the datasource to mount to the job, like : - --data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data - -e, --env stringArray the environment variables - --evaluator enable evaluator, which is optional for estimator. - --evaluator-cpu string the cpu resource to use for the evaluator, like 1 for 1 core. - --evaluator-memory string the memory resource to use for the evaluator, like 1Gi. - --gpus int the GPU count of each worker to run the training. - -h, --help help for tfjob - --image string the docker image name of training job - --logdir string the training logs dir, default is /training_logs (default "/training_logs") - --name string override name - --ps int the number of the parameter servers. - --ps-cpu string the cpu resource to use for the parameter servers, like 1 for 1 core. - --ps-image string the docker image for tensorflow workers - --ps-memory string the memory resource to use for the parameter servers, like 1Gi. - --ps-port int the port of the parameter server. - --rdma enable RDMA - --retry int retry times. - --sync-image string the docker image of syncImage - --sync-mode string syncMode: support rsync, hdfs, git - --sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git - --tensorboard enable tensorboard - --tensorboard-image string the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel") - --worker-cpu string the cpu resource to use for the worker, like 1 for 1 core. - --worker-image string the docker image for tensorflow workers - --worker-memory string the memory resource to use for the worker, like 1Gi. - --worker-port int the port of the worker. - --workers int the worker number to run the distributed training. (default 1) - --working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root") -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena submit](arena_submit.md) - Submit a job. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_top.md b/archived/docs/cli/arena_top.md deleted file mode 100644 index 98fdffb7e..000000000 --- a/archived/docs/cli/arena_top.md +++ /dev/null @@ -1,41 +0,0 @@ -## arena top - -Display Resource (GPU) usage. - -### Synopsis - -Display Resource (GPU) usage. - -Available Commands: - node Display Resource (GPU) usage of nodes - job Display Resource (GPU) usage of pods - - -``` -arena top [flags] -``` - -### Options - -``` - -h, --help help for top -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena -* [arena top job](arena_top_job.md) - Display Resource (GPU) usage of jobs. -* [arena top node](arena_top_node.md) - Display Resource (GPU) usage of nodes. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_top_job.md b/archived/docs/cli/arena_top_job.md deleted file mode 100644 index 75847d742..000000000 --- a/archived/docs/cli/arena_top_job.md +++ /dev/null @@ -1,37 +0,0 @@ -## arena top job - -Display Resource (GPU) usage of jobs. - -### Synopsis - -Display Resource (GPU) usage of jobs. - -``` -arena top job [flags] -``` - -### Options - -``` - --allNamespaces show all the namespaces - -h, --help help for job - -i, --instance string Display instance top info - -r, --refresh Display continuously -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena top](arena_top.md) - Display Resource (GPU) usage. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_top_node.md b/archived/docs/cli/arena_top_node.md deleted file mode 100644 index 3d17a5c88..000000000 --- a/archived/docs/cli/arena_top_node.md +++ /dev/null @@ -1,35 +0,0 @@ -## arena top node - -Display Resource (GPU) usage of nodes. - -### Synopsis - -Display Resource (GPU) usage of nodes. - -``` -arena top node [flags] -``` - -### Options - -``` - -d, --details Display details - -h, --help help for node -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena top](arena_top.md) - Display Resource (GPU) usage. - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/cli/arena_version.md b/archived/docs/cli/arena_version.md deleted file mode 100644 index 94093345e..000000000 --- a/archived/docs/cli/arena_version.md +++ /dev/null @@ -1,35 +0,0 @@ -## arena version - -Print version information - -### Synopsis - -Print version information - -``` -arena version [flags] -``` - -### Options - -``` - -h, --help help for version - --short print just the version number -``` - -### Options inherited from parent commands - -``` - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -### SEE ALSO - -* [arena](arena.md) - arena is the command line interface to Arena - -###### Auto generated by spf13/cobra on 24-Apr-2019 diff --git a/archived/docs/dev/dev_tf_plugin.md b/archived/docs/dev/dev_tf_plugin.md deleted file mode 100644 index 9eccda780..000000000 --- a/archived/docs/dev/dev_tf_plugin.md +++ /dev/null @@ -1,50 +0,0 @@ -## The TFJob plugin framework - -If you'd like to customize or enhance the TFJob with your own chart or code. - - -## Developer Workflow - -### Step 1: Implement the following function (optional) - -``` -// Customized runtime for tf training training -type tfRuntime interface { - // check the tfjob args - check(tf *submitTFJobArgs) (err error) - // transform the tfjob - transform(tf *submitTFJobArgs) (err error) - - getChartName() string -} -``` - -You can refer the implmentation of default tf runtime [../../cmd/arena/commands/training_plugin_interface.go](training_plugin_interface.go) - - -### Step 2. Create your own chart - -If you don't need to create your code for `check` or `transform`, you can create the chart in the same directory of tfjob, mpijob. For example, the chart name is `mock`. - -``` -cd /charts -cp -r tfjob mock -``` - -## User Workflow - -Just run with the command by specifying annotation `runtime={your runtime}` - -``` -arena submit tf \ ---name=test \ ---annotation="runtime=mock" \ ---workers=1 \ ---chief \ ---chief-cpu=4 \ ---evaluator \ ---evaluator-cpu=4 \ ---worker-cpu=2 \ -"python test.py" -``` - diff --git a/archived/docs/installation/INSTALL_FROM_BINARY.md b/archived/docs/installation/INSTALL_FROM_BINARY.md deleted file mode 100644 index f60888766..000000000 --- a/archived/docs/installation/INSTALL_FROM_BINARY.md +++ /dev/null @@ -1,118 +0,0 @@ -## Setup - -This documentation assumes you have a Kubernetes cluster already available. - -If you need help setting up a Kubernetes cluster please refer to [Kubernetes Setup](https://kubernetes.io/docs/setup/). - -If you want to use GPUs, be sure to follow the Kubernetes [instructions for enabling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/). - -Arena doesn't have to run can be run within Kubernetes cluster. It can also be run in your laptop. If you can run `kubectl` to manage the Kubernetes cluster there, you can also use `arena` to manage Training Jobs. - -### Requirements - - * Linux OS - * Kubernetes >= 1.11, kubectl >= 1.11 - * helm version [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) or later - * tiller with ths same version of helm should be also installed (https://docs.helm.sh/using_helm/#installing-tiller) - -### Steps - -1\. Prepare kubeconfig file by using `export KUBECONFIG=/etc/kubernetes/admin.conf` or creating a `~/.kube/config` - -2\. Download the latest installer from [Release Page](https://github.com/kubeflow/arena/releases), and rename it to `arena-installer.tar.gz` - -3\. Untar the installer package - -``` -# tar -xvf arena-installer.tar.gz -``` - -4\. Setup Environment Varaibles for customization - -4.1\. If you'd like to train and serving in hostNetwork - -``` -export USE_HOSTNETWORK=true -``` - -4.2\. If you'd like to customize Kubernetes namespace of arena infrastructure - -``` -export NAMESPACE={your namespace} -``` - -4.3\. If you'd like to use your private docker registry instead of `ACR(Alibaba Cloud Container Registry)`: - -``` -export DOCKER_REGISTRY={your docker registry} -``` - -4.4\. If you'd like to deploy prometheus in `ACK(Alibaba Container Service for Kubernetes)` - -``` -export USE_PROMETHEUS=true -export PLATFORM=ack -``` - -4.5\. If you'd like to use Cloud loadbalancer - -``` -export USE_LOADBALANCER=true -``` - -5\. Install arena - -``` -# cd arena-installer -# sudo ./install.sh -``` - -6\. Enable shell autocompletion - -On Linux, please use bash - -On CentOS Linux, you may need to install the bash-completion package which is not installed by default. - -``` -yum install bash-completion -y -``` - -On Debian or Ubuntu Linux you may need to install with - -``` -apt-get install bash-completion -``` - -To add arena autocompletion to your current shell, run `source <(arena completion bash)`. - -On MacOS, please use bash - -You can install it with Homebrew: - -``` -brew install bash-completion@2 -``` - -To add arena autocompletion to your profile, so it is automatically loaded in future shells run: - -``` -echo "source <(arena completion bash)" >> ~/.bashrc -chmod u+x ~/.bashrc -``` - -For MacOS, add the following to your `~/.bashrc` file: - -``` -echo "source $(brew --prefix)/etc/profile.d/bash_completion.sh" >> ~/.bashrc -``` - -Then you can use [tab] to auto complete the command - -``` -# arena list -NAME STATUS TRAINER AGE NODE -tf1 PENDING TFJOB 0s N/A -caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120 -# arena get [tab] -caffe-1080ti-1 tf1 -``` diff --git a/archived/docs/installation/INSTALL_FROM_SOURCE.md b/archived/docs/installation/INSTALL_FROM_SOURCE.md deleted file mode 100644 index 4c66761b2..000000000 --- a/archived/docs/installation/INSTALL_FROM_SOURCE.md +++ /dev/null @@ -1,157 +0,0 @@ -## Setup - -This documentation assumes you have a Kubernetes cluster already available. - -If you need help setting up a Kubernetes cluster please refer to [Kubernetes Setup](https://kubernetes.io/docs/setup/). - -If you want to use GPUs, be sure to follow the Kubernetes [instructions for enabling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/). - -Arena doesn't have to run can be run within Kubernetes cluster. It can also be run in your laptop. If you can run `kubectl` to manage the Kubernetes cluster there, you can also use `arena` to manage Training Jobs. - -### Requirements - - * Kubernetes >= 1.11, kubectl >= 1.11 - * helm version [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) or later - * tiller with ths same version of helm should be also installed (https://docs.helm.sh/using_helm/#installing-tiller) - -### Steps - -1\. Prepare kubeconfig file by using `export KUBECONFIG=/etc/kubernetes/admin.conf` or creating a `~/.kube/config` - -2\. Install kubectl client - -Please follow [kubectl installation guide](https://kubernetes.io/docs/tasks/tools/install-kubectl/) - -3\. Install Helm client - -- Download Helm client from [github.com](https://github.com/helm/helm/releases) -- Unpack it (tar -zxvf helm-v2.14.1-linux-amd64.tgz) -- Find the `helm` binary in the unpacked directory, and move it to its desired destination (mv linux-amd64/helm /usr/local/bin/arena-helm) - -Then run `helm list` to check if the the kubernetes can be managed successfully by helm. - -``` -# arena-helm list -# echo $? -0 -``` - -4\. Download the charts - -``` -mkdir /charts -git clone https://github.com/kubeflow/arena.git -cp -r arena/charts/* /charts -``` - -5\. Install TFJob Controller - -``` -kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml -kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml -kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml -``` - -6\. Install Dashboard - -``` -kubectl create -f arena/kubernetes-artifacts/dashboard/dashboard.yaml -``` - -7\. Install MPIJob Controller - -``` -kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml -``` - -8\. Build arena - -Prerequisites: - -- Go >= 1.8 - -``` -mkdir -p $(go env GOPATH)/src/github.com/kubeflow -cd $(go env GOPATH)/src/github.com/kubeflow -git clone https://github.com/kubeflow/arena.git -cd arena -make -``` - -`arena` binary is located in directory `arena/bin`. You may want add the directory to `$PATH`. - - -9\. Install and configure kube-arbitrator for gang scheduling(optional) - -``` -kubectl create -f arena/kubernetes-artifacts/kube-batchd/kube-batched.yaml -``` - -10\. Enable shell autocompletion - -On Linux, please use bash - -On CentOS Linux, you may need to install the bash-completion package which is not installed by default. - -``` -yum install bash-completion -y -``` - -To add arena autocompletion to your current shell, run source <(arena completion bash). - -To add arena autocompletion to your profile, so it is automatically loaded in future shells run: - -``` -echo "source <(arena completion bash)" >> ~/.bashrc -``` - -Then you can use [tab] to auto complete the command - -``` -# arena list -NAME STATUS TRAINER AGE NODE -tf1 PENDING TFJOB 0s N/A -caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120 -# arena get [tab] -caffe-1080ti-1 tf1 -``` - - -11\. Enable Host network for training (optional) - -The training is not `useHostNetwork` by default. If you'd like to run the training in HostNetwork. You can run the command below: - -``` -find /charts/ -name values.yaml | xargs sed -i "/useHostNetwork/s/false/true/g" -``` - -12\. Enable Loadbalancer in the public cloud (optional) - - Kubernetes can be run on AWS, GCE, Azure and Alibaba Cloud, and `LoadBalancer` is supported in their cloud provider. If you want to access tensorboard on the internet directly, you can run the command below: - - -``` -find /charts/ -name "*.yaml" | xargs sed -i "s/NodePort/LoadBalancer/g" -``` - -> Warning: it's not encouraged to expose the service to the internet, because the service can be attacked by hacker easily. - - -13\. Enable Ingress in the public cloud (optional) - -If you have ingress controller configured, you are able to access tensorboard through ingress. You can run the command below: - -``` -find /charts/ -name values.yaml | xargs sed -i "/ingress/s/false/true/g" -``` - -> Warning: it's not encouraged to expose the service to the internet, because the service can be attacked by hacker easily. - - -14\. Change imagePullPolicy from `Always` to `IfNotPresent` (optional) - -``` -find /charts/ -name values.yaml| xargs sed -i "s/Always/IfNotPresent/g" -``` - -> Warning: this may cause the docker images are not up to date if it's already downloaded in node. diff --git a/archived/docs/installation_cn/README.md b/archived/docs/installation_cn/README.md deleted file mode 100755 index 5752ee835..000000000 --- a/archived/docs/installation_cn/README.md +++ /dev/null @@ -1,154 +0,0 @@ -## 部署 - -本文档假设您已经有可用的 Kubernetes 集群。 - -如果您需要有关 Kubernetes 集群设置的帮助,请参阅 [Kubernetes 设置](https://kubernetes.io/docs/setup/)。 - -如果您希望使用 GPU,请务必按照 Kubernetes [GPU 启用说明](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) 操作。 - -Arena 并非必需在 Kubernetes 集群内运行。它也可以在您的笔记本电脑中运行。如果您可以运行 `kubectl` 以管理 Kubernetes 集群,那么也可以使用 `arena` 管理训练作业。 - -### 要求 - - * Kubernetes >= 1.11, kubectl >= 1.11 - * helm 版本 [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) 或更新版本 - * 此外还要部署与 helm 版本相同的 tiller(https://docs.helm.sh/using_helm/#installing-tiller) - -### 步骤 - -1\.通过使用 `export KUBECONFIG=/etc/kubernetes/admin.conf` 或创建一个 `~/.kube/config` 来准备 kubeconfig 文件 - -2\.安装 kubectl 客户端 - -请按照 [kubectl 安装指南] 操作(https://kubernetes.io/docs/tasks/tools/install-kubectl/) - -3\.安装 Helm 客户端 - -- 从 [github.com] 下载 Helm 客户端(https://github.com/helm/helm/releases) -- 将下载到的文件解压缩 (tar -zxvf helm-v2.8.2-linux-amd64.tgz) -- 在解压缩目录中找到 `helm` 二进制文件,将其移到所需目标位置 (mv linux-amd64/helm /usr/local/bin/arena-helm) - -然后运行 `helm list` 以检查 helm 能否成功管理 kubernetes。 - -``` -#helm list -#echo $? -0 -``` - -4\.下载 Chart - -``` -mkdir /charts -git clone https://github.com/kubeflow/arena.git -cp -r arena/charts/* /charts -``` - -5\.安装 TFJob 控制器 - -``` -kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml -kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml -kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml -``` - -6\.安装控制台 (可选) - -``` -kubectl create -f arena/kubernetes-artifacts/dashboard/dashboard.yaml -``` - -7\.安装 MPIJob 控制器 - -``` -kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml -``` - -8\.安装 arena - -先决条件: - -- Go >= 1.8 - -``` -mkdir -p $(go env GOPATH)/src/github.com/kubeflow -cd $(go env GOPATH)/src/github.com/kubeflow -git clone https://github.com/kubeflow/arena.git -cd arena -make -``` - -`arena` 二进制文件位于 `arena/bin` 目录下。您可能希望将目录添加到 `$PATH`。 - - -9\.安装并为群调度配置 kube-arbitrator(可选) - -``` -kubectl create -f arena/kubernetes-artifacts/kube-batchd/kube-batched.yaml -``` - -10\.启用 shell 自动完成 - -在 Linux 上,请使用 bash - -在 CentOS Linux 上,您可能需要安装默认并未安装的 bash-completion 包。 - -``` -yum install bash-completion -y -``` - -要为当前 shell 添加 arena 自动完成,请运行 source <(arena completion bash)。 - -通过如下方法向您的配置文件添加 arena 自动完成功能,以便将来 shell 运行时可以自动加载此功能: - -``` -echo "source <(arena completion bash)" >> ~/.bashrc -``` - -然后,你可以使用 [TAB] 来自动完成命令 - -``` -#arena list -NAME STATUS TRAINER AGE NODE -tf1 PENDING TFJOB 0s N/A -caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120 -#arena get [tab] -caffe-1080ti-1 tf1 -``` - - -11\.为训练启用主机网络(可选) - -默认情况下,训练并非 `useHostNetwork`。如果您希望在 HostNetwork 中运行训练。可以运行如下命令: - -``` -find /charts/ -name values.yaml | xargs sed -i "/useHostNetwork/s/false/true/g" -``` - -12\.在公共云中启用 Loadbalancer - - Kubernetes 可在 AWS、GCE、Azure 和阿里云中运行,其云提供商支持 `LoadBalancer`。如果您希望在互联网上直接访问 tensorboard,可以运行如下代码: - -``` -find /charts/ -name "*.yaml" | xargs sed -i "s/NodePort/LoadBalancer/g" -``` - -> 警告:我们不鼓励将服务公开给互联网,因为这种做法会导致服务受黑客攻击。 - -13\. 在公共云中启用 Ingress - -Kubernetes 可在 AWS、GCE、Azure 和阿里云中运行,其云提供商支持 `Ingress`。如果您希望在互联网上直接通过统一入口访问 tensorboard,可以运行如下代码: - -``` -find /charts/ -name values.yaml | xargs sed -i "/ingress/s/false/true/g" -``` - -> 警告:我们不鼓励将服务公开给互联网,因为这种做法会导致服务受黑客攻击。 - -14\. 将 imagePullPolicy 策略由 `Always` 修改为 `IfNotPresent` (可选) - -``` -find /charts/ -name values.yaml| xargs sed -i "s/Always/IfNotPresent/g" -``` - -> 警告: 这会导致容器镜像可能不是最新更新版本。 diff --git a/archived/docs/userguide/1-tfjob-logviewer.jpg b/archived/docs/userguide/1-tfjob-logviewer.jpg deleted file mode 100644 index 22cd1222a..000000000 Binary files a/archived/docs/userguide/1-tfjob-logviewer.jpg and /dev/null differ diff --git a/archived/docs/userguide/1-tfjob-standalone.md b/archived/docs/userguide/1-tfjob-standalone.md deleted file mode 100644 index 9c96f72da..000000000 --- a/archived/docs/userguide/1-tfjob-standalone.md +++ /dev/null @@ -1,138 +0,0 @@ - -Here is an example how you can use `Arena` for the machine learning training. It will download the source code from git url. - -1. the first step is to check the available resources - -``` -arena top node -NAME IPADDRESS ROLE GPU(Total) GPU(Allocated) -i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0 -i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0 -i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0 -i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0 -i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0 -i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0 ------------------------------------------------------------------------------------------ -Allocated/Total GPUs In Cluster: -0/3 (0%) -``` - -There are 3 available nodes with GPU for running training jobs. - - -2\. Now we can submit a training job with `arena`, it will download the source code from github - -``` -# arena submit tf \ - --name=tf-git \ - --gpus=1 \ - --image=tensorflow/tensorflow:1.5.0-devel-gpu \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 10000 --data_dir=code/tensorflow-sample-code/data" -configmap/tf-git-tfjob created -configmap/tf-git-tfjob labeled -tfjob.kubeflow.org/tf-git created -INFO[0000] The Job tf-git has been submitted successfully -INFO[0000] You can run `arena get tf-git --type tfjob` to check the job status -``` - -> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`. Also, you may specify the branch you are pulling code from by addding `--env GIT_SYNC_BRANCH=main` to the paramasters while submitting the job. -> If you are using the private git repo, you can use the following command: - -``` -# arena submit tf \ - --name=tf-git \ - --gpus=1 \ - --image=tensorflow/tensorflow:1.5.0-devel-gpu \ - --syncMode=git \ - --syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - --env=GIT_SYNC_USERNAME=yourname \ - --env=GIT_SYNC_PASSWORD=yourpwd \ - "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py" -``` - -Notice: `arena` is using [git-sync](https://github.com/kubernetes/git-sync/blob/master/cmd/git-sync/main.go) to sync up source code. You can set the environment variables defined in git-sync project. - -3\. List all the jobs - -``` -# arena list -NAME STATUS TRAINER AGE NODE -tf-git RUNNING tfjob 0s 192.168.1.120 -``` - -4\. Check the resource usage of the job - -``` -# arena top job -NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated) -tf-git RUNNING TFJOB 17s 192.168.1.120 1 1 - - -Total Allocated GPUs of Training Job: -1 - -Total Requested GPUs of Training Job: -1 -``` - -5\. Check the resource usage of the cluster - -``` -# arena top node -NAME IPADDRESS ROLE GPU(Total) GPU(Allocated) -i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0 -i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0 -i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 1 -i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0 -i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0 -i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0 ------------------------------------------------------------------------------------------ -Allocated/Total GPUs In Cluster: -1/3 (33%) -``` - - -6\. Get the details of the specific job - -``` -# arena get tf-git -NAME STATUS TRAINER AGE INSTANCE NODE -tf-git RUNNING TFJOB 5s tf-git-tfjob-worker-0 192.168.1.120 -``` - -7\. Check logs - -``` -# arena logs tf-git -2018-07-22T23:56:20.841129509Z WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:119: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. -2018-07-22T23:56:20.841211064Z Instructions for updating: -2018-07-22T23:56:20.841217002Z -2018-07-22T23:56:20.841221287Z Future major versions of TensorFlow will allow gradients to flow -2018-07-22T23:56:20.841225581Z into the labels input on backprop by default. -2018-07-22T23:56:20.841229492Z -... -2018-07-22T23:57:11.842929868Z Accuracy at step 920: 0.967 -2018-07-22T23:57:11.842933859Z Accuracy at step 930: 0.9646 -2018-07-22T23:57:11.842937832Z Accuracy at step 940: 0.967 -2018-07-22T23:57:11.842941362Z Accuracy at step 950: 0.9674 -2018-07-22T23:57:11.842945487Z Accuracy at step 960: 0.9693 -2018-07-22T23:57:11.842949067Z Accuracy at step 970: 0.9687 -2018-07-22T23:57:11.842952818Z Accuracy at step 980: 0.9688 -2018-07-22T23:57:11.842956775Z Accuracy at step 990: 0.9649 -2018-07-22T23:57:11.842961076Z Adding run metadata for 999 -``` - -8\. More information about the training job in the logviewer - -``` -# arena logviewer tf-git -Your LogViewer will be available on: -192.168.1.120:8080/tfjobs/ui/#/default/tf-git-tfjob -``` - -![](1-tfjob-logviewer.jpg) - - -Congratulations! You've run the first training job with `arena` successfully. diff --git a/archived/docs/userguide/10-rdma-integration.md b/archived/docs/userguide/10-rdma-integration.md deleted file mode 100644 index 272250393..000000000 --- a/archived/docs/userguide/10-rdma-integration.md +++ /dev/null @@ -1,45 +0,0 @@ -Arena supports RDMA For distributed Training. We can allocate RDMA device for worker jobs by adding parameter `--rdma` - -1. Deploy rdma device plugin - -``` -# Deploy RDMA device plugin -kubectl create -f kubernetes-artifacts/rdma/rdma-config.yaml -kubectl create -f kubernetes-artifacts/rdma/device-plugin.yaml -``` - -2\. Label your node with infiniband device - -``` -# Label RDMA NODE -kubectl label node accelerator/rdma=true -``` - -``` -# Check Device plugin status -kubectl -n arena-system get ds -NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE -rdma-sriov-dp-ds 1 1 1 1 1 accelerator/rdma=true 46d -``` - -3\. Enable arena RDMA config - -``` -find /charts/ -name values.yaml | xargs sed -i "/enableRDMA/s/false/true/g" -``` - -4\. Submit a Tensorflow training job using RDMA - -``` -# arena submit mpi --name=mpi-dist \ - --rdma \ - --gpus=1 \ - --workers=2 \ - --image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - --env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \ - --syncMode=git \ - --syncSource=https://github.com/tensorflow/benchmarks.git \ - --tensorboard \ - "mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 - --save_summaries_steps=10" -``` diff --git a/archived/docs/userguide/11-sparkjob-distributed.md b/archived/docs/userguide/11-sparkjob-distributed.md deleted file mode 100644 index fe5c8fd81..000000000 --- a/archived/docs/userguide/11-sparkjob-distributed.md +++ /dev/null @@ -1,201 +0,0 @@ - -Arena supports and simplifies distributed spark job. - -### 1. To run a distributed spark job, you need to specify: -- The spark job image which contains the main class jar. (required) -- Main class of your jar. (required) -- Jar path in the container.(required) -- The number of executors.(default: 1) -- The resource cpu request of driver pod (default: 1) -- The resource memory request of driver pod (default: 500m) -- The resource cpu request of executor pod (default: 1) -- The resource memory request of executor pod (default: 500m) - -### 2. How to create spark job image. - -Arena spark job is based on spark-on-k8s-operator(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator).You can create spark job image with tool `docker-image-tool` (https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images) - -### 3. How to use Arena spark job - -##### install spark operator -```$xslt -# arena-system is the default namespace,if not exist please create it. -kubectl create -f arena/kubernetes-artifacts/spark-operator/spark-operator.yaml -``` - -##### create rbac of spark job -The spark job need service account `spark` to create executors. -```$xslt -kubectl create -f arena/kubernetes-artifacts/spark-operator/spark-rbac.yaml -``` -The default namespace is `default`. If you want to run spark job in other namespaces. You can change namespace in spark-rbac.yaml and create a new service account. -##### submit a spark job -```$xslt -arena submit sparkjob --name=demo --image=registry.aliyuncs.com/acs/spark:v2.4.0 --main-class=org.apache.spark.examples.SparkPi --jar=local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar -``` -The result is like below. -```$xslt -configmap/demo-sparkjob created -configmap/demo-sparkjob labeled -sparkapplication.sparkoperator.k8s.io/demo created -INFO[0005] The Job demo has been submitted successfully -INFO[0005] You can run `arena get demo --type sparkjob` to check the job status -``` -##### get spark job status -```$xslt -arena get --type=sparkjob demo -``` -When the job succeed,you will see the result below. -```$xslt -STATUS: SUCCEEDED -NAMESPACE: default -TRAINING DURATION: 15s - -NAME STATUS TRAINER AGE INSTANCE NODE -demo1 SUCCEEDED SPARKJOB 1h demo1-driver N/A -``` - -##### watch log of spark job -```$xslt -arena logs -f demo -``` -You will get the log of spark driver pod. -```$xslt -2019-05-08T08:25:21.904409561Z ++ id -u -2019-05-08T08:25:21.904639867Z + myuid=0 -2019-05-08T08:25:21.904649704Z ++ id -g -2019-05-08T08:25:21.904901542Z + mygid=0 -2019-05-08T08:25:21.904909072Z + set +e -2019-05-08T08:25:21.905241846Z ++ getent passwd 0 -2019-05-08T08:25:21.905608733Z + uidentry=root:x:0:0:root:/root:/bin/ash -2019-05-08T08:25:21.905623028Z + set -e -2019-05-08T08:25:21.905629226Z + '[' -z root:x:0:0:root:/root:/bin/ash ']' -2019-05-08T08:25:21.905633894Z + SPARK_K8S_CMD=driver -2019-05-08T08:25:21.905757494Z + case "$SPARK_K8S_CMD" in -2019-05-08T08:25:21.90622059Z + shift 1 -2019-05-08T08:25:21.906232126Z + SPARK_CLASSPATH=':/opt/spark/jars/*' -2019-05-08T08:25:21.906236316Z + env -2019-05-08T08:25:21.906239651Z + grep SPARK_JAVA_OPT_ -2019-05-08T08:25:21.90624307Z + sort -t_ -k4 -n -2019-05-08T08:25:21.906585896Z + sed 's/[^=]*=\(.*\)/\1/g' -2019-05-08T08:25:21.906908601Z + readarray -t SPARK_EXECUTOR_JAVA_OPTS -2019-05-08T08:25:21.906917535Z + '[' -n '' ']' -2019-05-08T08:25:21.906999069Z + '[' -n '' ']' -2019-05-08T08:25:21.907003871Z + PYSPARK_ARGS= -2019-05-08T08:25:21.907006605Z + '[' -n '' ']' -2019-05-08T08:25:21.907008951Z + R_ARGS= -2019-05-08T08:25:21.907012105Z + '[' -n '' ']' -2019-05-08T08:25:21.907148385Z + '[' '' == 2 ']' -2019-05-08T08:25:21.907994286Z + '[' '' == 3 ']' -2019-05-08T08:25:21.908014459Z + case "$SPARK_K8S_CMD" in -2019-05-08T08:25:21.908018653Z + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@") -2019-05-08T08:25:21.908023924Z + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.20.90.160 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal -2019-05-08T08:25:23.326681135Z 2019-05-08 08:25:23 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable -2019-05-08T08:25:23.829843117Z 2019-05-08 08:25:23 INFO SparkContext:54 - Running Spark version 2.4.0 -2019-05-08T08:25:23.8529898Z 2019-05-08 08:25:23 INFO SparkContext:54 - Submitted application: Spark Pi -2019-05-08T08:25:23.94670344Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing view acls to: root -2019-05-08T08:25:23.946735076Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing modify acls to: root -2019-05-08T08:25:23.946740267Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing view acls groups to: -2019-05-08T08:25:23.946744543Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing modify acls groups to: -2019-05-08T08:25:23.946748767Z 2019-05-08 08:25:23 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() -2019-05-08T08:25:24.273960575Z 2019-05-08 08:25:24 INFO Utils:54 - Successfully started service 'sparkDriver' on port 7078. -2019-05-08T08:25:24.307632934Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering MapOutputTracker -2019-05-08T08:25:24.339548141Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering BlockManagerMaster -2019-05-08T08:25:24.339577986Z 2019-05-08 08:25:24 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information -2019-05-08T08:25:24.340887925Z 2019-05-08 08:25:24 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up -2019-05-08T08:25:24.359682519Z 2019-05-08 08:25:24 INFO DiskBlockManager:54 - Created local directory at /var/data/spark-118b216d-2d39-4287-ad71-5b5d7c7195c9/blockmgr-5532fd8b-64b9-492c-b94d-308b55d60a71 -2019-05-08T08:25:24.388529744Z 2019-05-08 08:25:24 INFO MemoryStore:54 - MemoryStore started with capacity 110.0 MB -2019-05-08T08:25:24.413347888Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering OutputCommitCoordinator -2019-05-08T08:25:24.560654618Z 2019-05-08 08:25:24 INFO log:192 - Logging initialized @2462ms -2019-05-08T08:25:24.654721075Z 2019-05-08 08:25:24 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown -2019-05-08T08:25:24.680943254Z 2019-05-08 08:25:24 INFO Server:419 - Started @2586ms -2019-05-08T08:25:24.715867156Z 2019-05-08 08:25:24 INFO AbstractConnector:278 - Started ServerConnector@7e97551f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} -2019-05-08T08:25:24.715897312Z 2019-05-08 08:25:24 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040. -2019-05-08T08:25:24.76123501Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1450078a{/jobs,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.762173789Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@534ca02b{/jobs/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.763361524Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@29a23c3d{/jobs/job,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.764374535Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6fe46b62{/jobs/job/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.764919809Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@591fd34d{/stages,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.765687152Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@61e45f87{/stages/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.766434602Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c9b78e3{/stages/stage,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.769934319Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5491f68b{/stages/stage/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.769949155Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@736ac09a{/stages/pool,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.769966711Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ecd665{/stages/pool/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.77037559Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@45394b31{/storage,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.772696599Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1ec7d8b3{/storage/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.772709487Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b0ca5e1{/storage/rdd,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.773014833Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5bb3131b{/storage/rdd/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.77546416Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@54dcbb9f{/environment,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.775478151Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@74fef3f7{/environment/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.775882882Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2a037324{/executors,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.780702953Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@69eb86b4{/executors/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.780717178Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@585ac855{/executors/threadDump,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.78072195Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5bb8f9e2{/executors/threadDump/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.793805533Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6a933be2{/static,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.808511998Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@378bd86d{/,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.808532751Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2189e7a7{/api,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.808537695Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@644abb8f{/jobs/job/kill,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.80854206Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a411233{/stages/stage/kill,null,AVAILABLE,@Spark} -2019-05-08T08:25:24.808546336Z 2019-05-08 08:25:24 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://demo1-1557303918993-driver-svc.default.svc:4040 -2019-05-08T08:25:24.834767942Z 2019-05-08 08:25:24 INFO SparkContext:54 - Added JAR file:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar at spark://demo1-1557303918993-driver-svc.default.svc:7078/jars/spark-examples_2.11-2.4.0.jar with timestamp 1557303924832 -2019-05-08T08:25:26.274526541Z 2019-05-08 08:25:26 INFO ExecutorPodsAllocator:54 - Going to request 1 executors from Kubernetes. -2019-05-08T08:25:26.455658752Z 2019-05-08 08:25:26 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079. -2019-05-08T08:25:26.47651031Z 2019-05-08 08:25:26 INFO NettyBlockTransferService:54 - Server created on demo1-1557303918993-driver-svc.default.svc:7079 -2019-05-08T08:25:26.476533172Z 2019-05-08 08:25:26 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy -2019-05-08T08:25:26.503099521Z 2019-05-08 08:25:26 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None) -2019-05-08T08:25:26.506168762Z 2019-05-08 08:25:26 INFO BlockManagerMasterEndpoint:54 - Registering block manager demo1-1557303918993-driver-svc.default.svc:7079 with 110.0 MB RAM, BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None) -2019-05-08T08:25:26.529524775Z 2019-05-08 08:25:26 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None) -2019-05-08T08:25:26.529543725Z 2019-05-08 08:25:26 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None) -2019-05-08T08:25:26.661414752Z 2019-05-08 08:25:26 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4c777e7b{/metrics/json,null,AVAILABLE,@Spark} -2019-05-08T08:25:30.459756195Z 2019-05-08 08:25:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.20.90.161:52168) with ID 1 -2019-05-08T08:25:30.534179215Z 2019-05-08 08:25:30 INFO KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 -2019-05-08T08:25:30.679510273Z 2019-05-08 08:25:30 INFO BlockManagerMasterEndpoint:54 - Registering block manager 172.20.90.161:36718 with 110.0 MB RAM, BlockManagerId(1, 172.20.90.161, 36718, None) -2019-05-08T08:25:30.906713226Z 2019-05-08 08:25:30 INFO SparkContext:54 - Starting job: reduce at SparkPi.scala:38 -2019-05-08T08:25:30.93537711Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions -2019-05-08T08:25:30.936000643Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Final stage: ResultStage 0 (reduce at SparkPi.scala:38) -2019-05-08T08:25:30.936506781Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Parents of final stage: List() -2019-05-08T08:25:30.938152322Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Missing parents: List() -2019-05-08T08:25:30.958509715Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents -2019-05-08T08:25:31.128459296Z 2019-05-08 08:25:31 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 110.0 MB) -2019-05-08T08:25:31.172704042Z 2019-05-08 08:25:31 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 110.0 MB) -2019-05-08T08:25:31.178025215Z 2019-05-08 08:25:31 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on demo1-1557303918993-driver-svc.default.svc:7079 (size: 1256.0 B, free: 110.0 MB) -2019-05-08T08:25:31.182000364Z 2019-05-08 08:25:31 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1161 -2019-05-08T08:25:31.202640906Z 2019-05-08 08:25:31 INFO DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1)) -2019-05-08T08:25:31.203502967Z 2019-05-08 08:25:31 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks -2019-05-08T08:25:31.245126257Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, 172.20.90.161, executor 1, partition 0, PROCESS_LOCAL, 7878 bytes) -2019-05-08T08:25:31.805815672Z 2019-05-08 08:25:31 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 172.20.90.161:36718 (size: 1256.0 B, free: 110.0 MB) -2019-05-08T08:25:31.946492966Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, 172.20.90.161, executor 1, partition 1, PROCESS_LOCAL, 7878 bytes) -2019-05-08T08:25:31.957903365Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 727 ms on 172.20.90.161 (executor 1) (1/2) -2019-05-08T08:25:31.99308236Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 47 ms on 172.20.90.161 (executor 1) (2/2) -2019-05-08T08:25:31.994764897Z 2019-05-08 08:25:31 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool -2019-05-08T08:25:31.995390219Z 2019-05-08 08:25:31 INFO DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.998 s -2019-05-08T08:25:32.003622135Z 2019-05-08 08:25:32 INFO DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 1.094511 s -2019-05-08T08:25:32.005407995Z Pi is roughly 3.1436157180785904 -2019-05-08T08:25:32.011499948Z 2019-05-08 08:25:32 INFO AbstractConnector:318 - Stopped Spark@7e97551f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} -2019-05-08T08:25:32.014105609Z 2019-05-08 08:25:32 INFO SparkUI:54 - Stopped Spark web UI at http://demo1-1557303918993-driver-svc.default.svc:4040 -2019-05-08T08:25:32.01861939Z 2019-05-08 08:25:32 INFO KubernetesClusterSchedulerBackend:54 - Shutting down all executors -2019-05-08T08:25:32.019973046Z 2019-05-08 08:25:32 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asking each executor to shut down -2019-05-08T08:25:32.025136562Z 2019-05-08 08:25:32 WARN ExecutorPodsWatchSnapshotSource:87 - Kubernetes client has been closed (this is expected if the application is shutting down.) -2019-05-08T08:25:32.087137746Z 2019-05-08 08:25:32 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped! -2019-05-08T08:25:32.097659039Z 2019-05-08 08:25:32 INFO MemoryStore:54 - MemoryStore cleared -2019-05-08T08:25:32.098360561Z 2019-05-08 08:25:32 INFO BlockManager:54 - BlockManager stopped -2019-05-08T08:25:32.104432515Z 2019-05-08 08:25:32 INFO BlockManagerMaster:54 - BlockManagerMaster stopped -2019-05-08T08:25:32.10761075Z 2019-05-08 08:25:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped! -2019-05-08T08:25:32.114734944Z 2019-05-08 08:25:32 INFO SparkContext:54 - Successfully stopped SparkContext -2019-05-08T08:25:32.117170277Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Shutdown hook called -2019-05-08T08:25:32.118273045Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bdb4e416-5ab7-420c-905e-ef43c30fb187 -2019-05-08T08:25:32.120019227Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Deleting directory /var/data/spark-118b216d-2d39-4287-ad71-5b5d7c7195c9/spark-06dbab1f-13aa-474c-a1db-8845e14627bf -``` - -##### delete spark job -```$xslt -arena delete --type=sparkjob demo -``` -You will found the spark job is deleted. -```$xslt -sparkapplication.sparkoperator.k8s.io "demo1" deleted -time="2019-05-08T17:27:06+08:00" level=info msg="The Job demo1 has been deleted successfully" -configmap "demo1-sparkjob" deleted -``` - -Congratulations! You've run the distributed spark job with `arena` successfully. \ No newline at end of file diff --git a/archived/docs/userguide/12-volcanojob.md b/archived/docs/userguide/12-volcanojob.md deleted file mode 100644 index 6316d4f67..000000000 --- a/archived/docs/userguide/12-volcanojob.md +++ /dev/null @@ -1,156 +0,0 @@ - -# Arena supports and simplifies volcano job. - -Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms currently missing from -Kubernetes that are commonly required by many classes of batch & elastic workload including: - -1. machine learning/deep learning, -2. bioinformatics/genomics, and -3. other "big data" applications. - -## pre requisites - -- k8s deployment -- deploy the volcano following the steps from kubernetes-artifacts/volcano-operator/README.md - -### 1. To run a batch/distributed volcano job, you may need to specify: - -``` ---minAvailable int The minimal available pods to run for this Job. default value is 1 (default 1) ---name string override name ---queue string Specifies the queue that will be used in the scheduler, default queue is used this leaves empty (default "default") ---schedulerName string Specifies the scheduler Name, default is volcano when not specified (default "volcano") ---taskCPU string cpu request for each task replica / pod. default value is 250m (default "250m") ---taskImages strings the docker images of different tasks of volcano job. default used 3 tasks with ubuntu,nginx and busybox images (default [ubuntu,nginx,busybox]) ---taskMemory string memory request for each task replica/pod.default value is 128Mi) (default "128Mi") ---taskName string the task name of volcano job, default value is task (default "task") ---taskPort int the task port number. default value is 2222 (default 2222) ---taskReplicas int the task replica's number to run the distributed tasks. default value is 1 (default 1) -``` - -### 2. More information related to volcano job. - -Arena volcano job is based on (https://github.com/volcano-sh/volcano). -You can get more information related to volcano from https://volcano.sh/ - -### 3. How to use Arena volcano job - -##### install volcano - -deploy the volcano following the steps from kubernetes-artifacts/volcano-operator/README.md - -To install the chart with the release name `volcano-release` - -```bash -$ helm install --name volcano-release kubernetes-artifacts/volcano-operator -``` - -TO verify all deployments are running use the below command - -```bash - kubectl get deployment --all-namespaces | grep {release_name} -``` -We should get similar output like given below, where three deployments for controller, admission, scheduler should be running. - -```bash -NAME READY UP-TO-DATE AVAILABLE AGE -{release_name}-admission 1/1 1 1 4s -{release_name}-controllers 1/1 1 1 4s -{release_name}-scheduler 1/1 1 1 4s -``` - -TO verify all pods are running use the below command - -```bash - kubectl get pods --all-namespaces | grep {release_name} -``` - -We should get similar output like given below, where three pods for controller, admission,admissioninit, scheduler should be running. - -```bash -NAMESPACE NAME READY STATUS RESTARTS AGE -default volcano-release-admission-cbfdb8549-dz5hg 1/1 Running 0 33s -default volcano-release-admission-init-7xmzd 0/1 Completed 0 33s -default volcano-release-controllers-7967fffb8d-7vnn9 1/1 Running 0 33s -default volcano-release-scheduler-746f6557d8-9pfg6 1/1 Running 0 33s -``` - -##### submit a volcano job - -```$xslt -arena submit volcanojob --name=demo -``` - -The result is like below. -```$xslt - -configmap/demo-volcanojob created -configmap/demo-volcanojob labeled -job.batch.volcano.sh/demo created -INFO[0003] The Job demo has been submitted successfully -INFO[0003] You can run `arena get demo --type volcanojob` to check the job status - -``` - -if we want to provide more command line parameters then -```$xslt -./bin/arena submit volcanojob --name demo12 --taskImages busybox,busybox --taskReplicas 2 -``` - -in above case it creates two tasks each with 2 replicas as shown below -```$xslt -arena get --type volcanojob demo12 -``` -the result is as below -```$xslt -STATUS: SUCCEEDED -NAMESPACE: default -TRAINING DURATION: 2m - -NAME STATUS TRAINER AGE INSTANCE NODE -demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-0-0 11.245.101.184 -demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-0-1 11.245.101.184 -demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-1-0 11.245.101.184 -demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-1-1 11.245.101.184 -``` -##### get volcano job status - -```$xslt -arena get --type=volcanojob demo -``` -When the job running/succeed,you will see the result below. -```$xslt -STATUS: RUNNING/SUCCEEDED -NAMESPACE: default -TRAINING DURATION: 45s - -NAME STATUS TRAINER AGE INSTANCE NODE -demo SUCCEEDED VOLCANOJOB 59s demo-task-0-0 11.245.101.184 -demo RUNNING VOLCANOJOB 59s demo-task-1-0 11.245.101.184 -demo SUCCEEDED VOLCANOJOB 59s demo-task-2-0 11.245.101.184 - -``` -##### list arena jobs - -```$xslt -arena list -``` -we can observe the below data -```$xslt -NAME STATUS TRAINER AGE NODE -demo RUNNING VOLCANOJOB 2m 11.245.101.184 -``` - -##### delete volcano job - -```$xslt -arena delete --type=volcanojob demo -``` -You will found the volcano job is deleted. -```$xslt -job.batch.volcano.sh "demo" deleted -configmap "demo-volcanojob" deleted -INFO[0000] The Job demo has been deleted successfully -``` - -Congratulations! You've run the batch/distributed volcano job with `arena` successfully. diff --git a/archived/docs/userguide/13-preempted-mpijob.md b/archived/docs/userguide/13-preempted-mpijob.md deleted file mode 100644 index 43dda0a10..000000000 --- a/archived/docs/userguide/13-preempted-mpijob.md +++ /dev/null @@ -1,169 +0,0 @@ - -# Arena supports Priority and Preemption for MPIJob - -## prerequisites - -- k8s > 1.11 - -1.Create `PriorityClass` with the yaml below: - -```yaml -apiVersion: scheduling.k8s.io/v1beta1 -description: Used for the critical app -kind: PriorityClass -metadata: - name: critical -value: 1100000 - ---- - -apiVersion: scheduling.k8s.io/v1beta1 -description: Used for the medium app -kind: PriorityClass -metadata: - name: medium -value: 1000000 -``` - -Save the template that applies in a file named `pc.yaml`, and create the `PriorityClass`: - -``` -kubectl create -f pc.yaml -``` - -2.There is only 1 GPU available in the Kubernetes cluster - -``` -# arena top node -NAME IPADDRESS ROLE GPU(Total) GPU(Allocated) -192.168.0.20 192.168.0.20 master 0 0 -192.168.0.21 192.168.0.21 master 0 0 -192.168.0.22 192.168.0.22 master 0 0 -192.168.0.23 192.168.0.23 1 0 ------------------------------------------------------------------------------------------ -Allocated/Total GPUs In Cluster: -0/1 (0%) -``` - -3.Run the MPI training Job with `medium` priority: - - -The following command is an example. - -``` -# arena submit mpi \ - --name=medium \ - --priority=medium \ - --gpus=1 \ - --workers=1 \ - --image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - "mpirun tail -f /dev/null" -configmap/medium-mpijob created -configmap/medium-mpijob labeled -mpijob.kubeflow.org/medium created -INFO[0000] The Job medium has been submitted successfully -INFO[0000] You can run `arena get medium --type mpijob` to check the job status -``` - -4.Get the details of the specific job - -``` -# arena get medium -STATUS: RUNNING -NAMESPACE: default -PRIORITY: medium -TRAINING DURATION: 58s - -NAME STATUS TRAINER AGE INSTANCE NODE -medium RUNNING MPIJOB 58s medium-launcher-sz5xj 192.168.0.23 -medium RUNNING MPIJOB 58s medium-worker-0 192.168.0.23 -``` - -5.The only one GPU is used by MPI training Job `medium` - -``` -# arena top node -d - -NAME: cn-hangzhou.192.168.0.23 -IPADDRESS: 192.168.0.23 -ROLE: - -NAMESPACE NAME GPU REQUESTS GPU LIMITS -default medium-worker-0 1 1 - -Total GPUs In Node cn-hangzhou.192.168.0.23: 1 -Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%) ------------------------------------------------------------------------------------------ - -Allocated/Total GPUs In Cluster: 1/1 (100%) -``` - -6.Run the MPI training Job with `critical` priority: - -``` -# arena submit mpi \ - --name=critical \ - --priority=critical \ - --gpus=1 \ - --workers=1 \ - --image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - "mpirun tail -f /dev/null" -``` - -7.Check MPI Training Job `medium`, and find it's preempted by critical-worker-0 - -``` -# kubectl get events --field-selector involvedObject.name=medium-worker-0 -LAST SEEN TYPE REASON OBJECT MESSAGE -15m Normal Scheduled pod/medium-worker-0 Successfully assigned default/medium-worker-0 to 192.168.0.23 -14m Normal Pulled pod/medium-worker-0 Container image "registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5" already present on machine -14m Normal Created pod/medium-worker-0 Created container mpi -14m Normal Started pod/medium-worker-0 Started container mpi -2m32s Normal Preempted pod/medium-worker-0 by default/critical-worker-0 on node 192.168.0.23 -2m32s Normal Killing pod/medium-worker-0 Stopping container mpi -``` - -8.Check the details of the MPI Training Job `medium`, and it's turned to fail - -``` -# arena get medium -STATUS: FAILED -NAMESPACE: default -PRIORITY: medium -TRAINING DURATION: 12m - -NAME STATUS TRAINER AGE INSTANCE NODE -medium FAILED MPIJOB 20m medium-launcher-sz5xj 192.168.0.23 -``` - -9.And check the details of the MPI Training Job `critical`, it's running. - -``` -# arena get critical -STATUS: RUNNING -NAMESPACE: default -PRIORITY: critical -TRAINING DURATION: 10m - -NAME STATUS TRAINER AGE INSTANCE NODE -critical RUNNING MPIJOB 10m critical-launcher-mfffs 192.168.0.23 -critical RUNNING MPIJOB 10m critical-worker-0 192.168.0.23 -``` - -10.And we can find the only GPU is used by the MPI Training Job `critical` - -``` -# arena top node -d -NAME: cn-hangzhou.192.168.0.23 -IPADDRESS: 192.168.0.23 -ROLE: - -NAMESPACE NAME GPU REQUESTS GPU LIMITS -default critical-worker-0 1 1 - -Total GPUs In Node cn-hangzhou.192.168.0.23: 1 -Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%) ------------------------------------------------------------------------------------------ -``` - -Congratulations! You've run the the job in priorities and preemptions with `arena` successfully. \ No newline at end of file diff --git a/archived/docs/userguide/14-submit-with-node-selector.md b/archived/docs/userguide/14-submit-with-node-selector.md deleted file mode 100644 index 83113b79b..000000000 --- a/archived/docs/userguide/14-submit-with-node-selector.md +++ /dev/null @@ -1,160 +0,0 @@ - - -Arena supports assigning jobs to some k8s particular nodes(Currently only support mpi job and tf job). - -some usage examples in here. - -1.query k8s cluster information: -``` -# kubectl get nodes -NAME STATUS ROLES AGE VERSION -cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.228 Ready 2d22h v1.12.6-aliyun.1 -cn-beijing.192.168.3.229 Ready 2d22h v1.12.6-aliyun.1 -cn-beijing.192.168.3.230 Ready 2d22h v1.12.6-aliyun.1 -``` -2.give a label to nodes,for example: give label "gpu_node=ok" to node "cn-beijing.192.168.3.228" and node "cn-beijing.192.168.3.229",give label "ssd_node=ok" to node "cn-beijing.192.168.3.230" -``` -# kubectl label nodes cn-beijing.192.168.3.228 gpu_node=ok -node/cn-beijing.192.168.3.228 labeled -# kubectl label nodes cn-beijing.192.168.3.229 gpu_node=ok -node/cn-beijing.192.168.3.229 labeled -# kubectl label nodes cn-beijing.192.168.3.230 ssd_node=ok -node/cn-beijing.192.168.3.230 labeled -``` -## for MPI job -1.when submit a job,you can assign nodes to run job with operation "--selector" -``` -# arena submit mpi --name=mpi-dist \ - --gpus=1 \ - --workers=1 \ - --selector gpu_node=ok \ - --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - --tensorboard \ - --loglevel debug \ - "mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10" -``` -2.query the job information -``` -# arena get mpi-dist -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 21s - -NAME STATUS TRAINER AGE INSTANCE NODE -mpi-dist RUNNING MPIJOB 21s mpi-dist-launcher-7jn4q 192.168.3.229 -mpi-dist RUNNING MPIJOB 21s mpi-dist-worker-0 192.168.3.229 - -Your tensorboard will be available on: -http://192.168.3.225:31611 -``` -the jobs are running on node cn-beijing.192.168.3.229(ip is 192.168.3.229). - -3.you can use "--selector" multiple times,for example you can use "--selector gpu_node=ok --selector ssd_node=ok" in arena submit command,it represents that the job should be running on nodes which own label "gpu_node=ok" and label "ssd_node=ok". - -## for tf job - -1.because there is four roles("PS","Worker","Evaluator","Chief") in tf job,you can use "--selector" to assgin nodes,this is effective for all roles.for example: -``` -arena submit tfjob \ - --name=tf \ - --gpus=1 \ - --workers=1 \ - --selector ssd_node=ok \ - --workerImage=cheyang/tf-mnist-distributed:gpu \ - --psImage=cheyang/tf-mnist-distributed:cpu \ - --ps=1 \ - --tensorboard \ - --loglevel debug \ - "python /app/main.py" -``` -use follow command to check the job status: - -``` -# arena get tf -STATUS: PENDING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 24s - -NAME STATUS TRAINER AGE INSTANCE NODE -tf RUNNING TFJOB 24s tf-ps-0 192.168.3.230 -tf PENDING TFJOB 24s tf-worker-0 192.168.3.230 - -Your tensorboard will be available on: -http://192.168.3.225:31867 -``` - -the jobs(include "PS" and "Worker") have been running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is "ssd_node=ok"). - -2.you also can assign node to run single role job,for example: if you want to run a job whose role is "PS" on nodes which own label ssd_node="ok" and run "Worker" job on nodes which own label gpu_node=ok,you can use option "--ps-selector" and "--worker-selector" -``` -arena submit tfjob \ - --name=tf \ - --gpus=1 \ - --workers=1 \ - --ps-selector ssd_node=ok \ - --worker-selector gpu_node=ok \ - --workerImage=cheyang/tf-mnist-distributed:gpu \ - --psImage=cheyang/tf-mnist-distributed:cpu \ - --ps=1 \ - --tensorboard \ - --loglevel debug \ - "python /app/main.py" -``` - -then check the jobs's status: - -``` -# arena get tf -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 23s - -NAME STATUS TRAINER AGE INSTANCE NODE -tf RUNNING TFJOB 23s tf-ps-0 192.168.3.230 -tf RUNNING TFJOB 23s tf-worker-0 192.168.3.228 - -Your tensorboard will be available on: -http://192.168.3.225:30162 -``` - -the "PS" job is running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is "ssd_node=ok") and the "Worker" job is running on cn-beijing.192.168.3.228(ip is 192.168.3.228,label is "gpu_node=ok") - -3.if you use "--selector" in "arena submit tf" command and also use "--ps-selector"(or "--worker-selector","--evaluator-selector","chief-selector"),the value of "--ps-selector" would cover value of "--selector",for example: - -``` -arena submit tfjob \ - --name=tf \ - --gpus=1 \ - --workers=1 \ - --ps-selector ssd_node=ok \ - --selector gpu_node=ok \ - --workerImage=cheyang/tf-mnist-distributed:gpu \ - --psImage=cheyang/tf-mnist-distributed:cpu \ - --ps=1 \ - --tensorboard \ - --loglevel debug \ - "python /app/main.py" -``` - -"PS" job will be running on nodes whose label is "ssd_node=ok",other jobs will be running on nodes whose label is "gpu_node=ok",now verify our conclusions,use follow command to check job status. -``` -# arena get tf -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 39s - -NAME STATUS TRAINER AGE INSTANCE NODE -tf RUNNING TFJOB 39s tf-ps-0 192.168.3.230 -tf RUNNING TFJOB 39s tf-worker-0 192.168.3.228 - -Your tensorboard will be available on: -http://192.168.3.225:32105 -``` -as you can see, "PS" job is running on nodes which own label "ssd_node=ok",other jobs are running on nodes which own label "gpu_node=ok" diff --git a/archived/docs/userguide/14-submit-with-node-toleration.md b/archived/docs/userguide/14-submit-with-node-toleration.md deleted file mode 100644 index 2ee14c598..000000000 --- a/archived/docs/userguide/14-submit-with-node-toleration.md +++ /dev/null @@ -1,85 +0,0 @@ - - -Arena supports submiting a job with tolerating k8s nodes with taints(Currently only support mpi job and tf job). - -some usage examples in here. - -1.query k8s cluster information: -``` -# kubectl get nodes -NAME STATUS ROLES AGE VERSION -cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.228 Ready 2d22h v1.12.6-aliyun.1 -cn-beijing.192.168.3.229 Ready 2d22h v1.12.6-aliyun.1 -cn-beijing.192.168.3.230 Ready 2d22h v1.12.6-aliyun.1 -``` -2.give some taints for k8s nodes,for example: give taint "gpu_node=invalid:NoSchedule" to node "cn-beijing.192.168.3.228" and node "cn-beijing.192.168.3.229",give taint "ssd_node=invalid:NoSchedule" to node "cn-beijing.192.168.3.230",now all k8s pod can't schedule to these nodes. -``` -# kubectl taint nodes cn-beijing.192.168.3.228 gpu_node=invalid:NoSchedule -node/cn-beijing.192.168.3.228 tainted -# kubectl taint nodes cn-beijing.192.168.3.229 gpu_node=invalid:NoSchedule -node/cn-beijing.192.168.3.229 tainted -# kubectl taint nodes cn-beijing.192.168.3.230 ssd_node=invalid:NoSchedule -node/cn-beijing.192.168.3.230 tainted -``` -3.when submit a job,you can tolerate some nodes with taints to run job with operation "--toleration" -``` -# arena submit mpi --name=mpi-dist \ - --gpus=1 \ - --workers=1 \ - --toleration ssd_node \ - --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - --tensorboard \ - --loglevel debug \ - "mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10" -``` -query the job information -``` -# arena get mpi-dist -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 29s - -NAME STATUS TRAINER AGE INSTANCE NODE -mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.230 -mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230 - -Your tensorboard will be available on: -http://192.168.3.225:30052 -``` -the jobs are running on node cn-beijing.192.168.3.230(ip is 192.168.3.230,taint is ssd_node=invalid). - -4.you can use "--toleration" multiple times,for example you can use "--toleration gpu_node --toleration ssd_node" in arena submit command,it represents that the job tolerates nodes which own taint "gpu_node=invalid" and taint "ssd_node=invalid". - -``` -# arena submit mpi --name=mpi-dist \ - --gpus=1 \ - --workers=1 \ - --toleration ssd_node \ - --toleration gpu_node \ - --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - --tensorboard \ - --loglevel debug \ - "mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10" -``` -query the job status: - -``` -# arena get mpi-dist -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 29s - -NAME STATUS TRAINER AGE INSTANCE NODE -mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.229 -mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230 - -Your tensorboard will be available on: -http://192.168.3.225:30052 -``` - -5.you can use "--toleration all" to tolerate all node taints. diff --git a/archived/docs/userguide/15-custom-serving-sample-beijing.jpg b/archived/docs/userguide/15-custom-serving-sample-beijing.jpg deleted file mode 100644 index 9ed795482..000000000 Binary files a/archived/docs/userguide/15-custom-serving-sample-beijing.jpg and /dev/null differ diff --git a/archived/docs/userguide/15-custom-serving-sample-beijing_out.jpg b/archived/docs/userguide/15-custom-serving-sample-beijing_out.jpg deleted file mode 100644 index 8f0bc2074..000000000 Binary files a/archived/docs/userguide/15-custom-serving-sample-beijing_out.jpg and /dev/null differ diff --git a/archived/docs/userguide/15-custom-serving-sample.md b/archived/docs/userguide/15-custom-serving-sample.md deleted file mode 100644 index fd2300d82..000000000 --- a/archived/docs/userguide/15-custom-serving-sample.md +++ /dev/null @@ -1,80 +0,0 @@ -# Serving Trained Model with arena - -You can use arena to deploy your trained model as RESTful APIs.to illustrate usage,we use a sample project [fast-style-transfer](https://github.com/floydhub/fast-style-transfer).in order to save time,we use its' trainted model and add the model to docker images. - -### 1.Serve Mode - -we use the app.py script in project to start restful server,you can use arena to deploy trainted model: - -``` -# arena serve custom \ - --name=fast-style-transfer \ - --gpus=1 \ - --version=alpha \ - --replicas=1 \ - --restful-port=5000 \ - --image=happy365/fast-style-transfer:latest \ - "python app.py" -``` - -check the status of TensorFlow Serving Job: - -``` -# arena serve list -NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS -fast-style-transfer CUSTOM alpha 1 0 172.21.8.94 grpc:8001,restful:5000 -``` - -because the docker image is very large,pulling it requests some time,we can use kubectl to check the pod status: - -``` -# kubectl get po -NAME READY STATUS RESTARTS AGE -fast-style-transfer-alpha-custom-serving-845ffbf7dd-btbhj 0/1 ContainerCreating 0 6m44s -``` - -### 2.Access the service - -we can use a client to access the service,run the follow command to create a client: -``` -# kubectl run sample-client \ - --generator=run-pod/v1 \ - --image=happy365/arena-serve-custem-sample-client:latest \ - --command -- \ - /bin/sleep infinity -``` - -then,we can query the status of sample-client: -``` -# kubectl get po sample-client -NAME READY STATUS RESTARTS AGE -sample-client 1/1 Running 0 87s - -``` -we should query the sevice name,it is a combination of job name and version(the sample job name is fast-style-transfer and version is alpha): - -``` -# kubectl get svc fast-style-transfer-alpha -NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE -fast-style-transfer-alpha ClusterIP 172.21.1.114 5000/TCP 31m -``` - -now,we can use the "kubectl exec" command to login the container: - -``` -# kubectl exec -ti sample-client /bin/sh -# -``` - -then we use "curl" command to access the custom serving job: -``` -# curl -o /root/output/beijing_out.jpg -F "file=@/root/input/beijing.jpg" http://fast-style-transfer-alpha:5000 -``` -the input is an image which name is "beijing.jpg" ![beijing.jpg](15-custom-serving-sample-beijing.jpg),the image is stored in "/root/input",the output is stored in "/root/output". you can use "kubectl cp" command to copy output image from container to host: -``` -# kubectl cp sample-client:/root/output/beijing_out.jpg ~/beijing_out.jpg -``` -now you can view the image in ~/beijing_out.jpg,there is "beijing_out.jpg" ![beijing_out.jpg](15-custom-serving-sample-beijing_out.jpg) - - - diff --git a/archived/docs/userguide/16-assign-config-file.md b/archived/docs/userguide/16-assign-config-file.md deleted file mode 100644 index 71000e533..000000000 --- a/archived/docs/userguide/16-assign-config-file.md +++ /dev/null @@ -1,73 +0,0 @@ -# Assign configuration files for jobs - -you can pass the configuration files to containers when submiting jobs. - -this feature only support follow jobs: - -* tfjob -* mpijob - -## 1.usage - -you can use `--config-file :` to assign a configuration file to container.and there is some rules: - -* if assignd and not assign ,we see is the same as -* must be a file with absolute path -* you can use `--config-file` more than one in a command,eg: "--config-file /tmp/test1.conf:/etc/config/test1.conf --config-file /tmp/test2.conf:/etc/config/test2.conf" - - -## 2.sample - - -firstly,we create a test file which name is "test-config.json",its' path is "/tmp/test-config.json". we want push this file to containers of a tfjob (or mpijob) and the path in container is "/etc/config/config.json". -``` -# cat /tmp/test-config.json -{ - "key": "job-config" - -} -``` -secondly,use follow command to create tfjob: -``` -# arena submit tfjob \ - --name=tf \ - --gpus=1 \ - --workers=1 \ - --workerImage=cheyang/tf-mnist-distributed:gpu \ - --psImage=cheyang/tf-mnist-distributed:cpu \ - --ps=1 \ - --tensorboard \ - --config-file /tmp/test-config.json:/etc/config/config.json \ - "python /app/main.py" -``` -wait a minute,get the job status: -``` -# arena get tf -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 16s - -NAME STATUS TRAINER AGE INSTANCE NODE -tf RUNNING TFJOB 16s tf-ps-0 192.168.7.18 -tf RUNNING TFJOB 16s tf-worker-0 192.168.7.16 - -Your tensorboard will be available on: -http://192.168.7.10:31825 -``` -use kubectl to check file is in container or not: -``` -# kubectl exec -ti tf-ps-0 -- cat /etc/config/config.json -{ - "key": "job-config" - -} -# kubectl exec -ti tf-worker-0 -- cat /etc/config/config.json -{ - "key": "job-config" - -} - -``` - -as you see,the file is in the containers. diff --git a/archived/docs/userguide/17-pytorchjob-standalone.md b/archived/docs/userguide/17-pytorchjob-standalone.md deleted file mode 100644 index 5fc5da6d0..000000000 --- a/archived/docs/userguide/17-pytorchjob-standalone.md +++ /dev/null @@ -1,95 +0,0 @@ -This example shows how to use `Arena` to submit a pytorch stand-alone job. This example will download the source code from git url. - -1. The first step is to check the available resources. - ``` - ➜ arena top node - NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) - cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0 - cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0 - cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0 - cn-huhehaote.172.16.0.208 172.16.0.208 ready 4 0 - cn-huhehaote.172.16.0.209 172.16.0.209 ready 4 0 - cn-huhehaote.172.16.0.210 172.16.0.210 ready 4 0 - ----------------------------------------------------------------------------------------- - Allocated/Total GPUs In Cluster: - 0/12 (0%) - ``` - There are 3 available nodes with GPU for running training jobs. - -2. Submit a pytorch training job, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git). - ``` - # Single gpu card - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-local-git \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo" - configmap/pytorch-local-git-pytorchjob created - configmap/pytorch-local-git-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-local-git created - INFO[0000] The Job pytorch-local-git has been submitted successfully - INFO[0000] You can run `arena get pytorch-local-git --type pytorchjob` to check the job status - ``` - - > the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`. - - > If you are using the private git repo, you can use the following command: - - ``` - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-local-git \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --env=GIT_SYNC_USERNAME=yourname \ - --env=GIT_SYNC_PASSWORD=yourpwd \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo" - ``` - -3. List all the jobs. - ``` - ➜ arena list - NAME STATUS TRAINER AGE NODE - pytorch-local-git SUCCEEDED PYTORCHJOB 21h N/A - ``` - -4. Get the details of the this job. - ``` - ➜ arena get pytorch-local-git - STATUS: SUCCEEDED - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 35s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-local-git SUCCEEDED PYTORCHJOB 23h pytorch-local-git-master-0 172.16.0.210 - ``` - -5. Check logs. - ``` - ➜ arena logs pytorch-local-git - WORLD_SIZE: 1, CURRENT_RANK: 0 - args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000) - Using CUDA - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint8 = np.dtype([("qint8", np.int8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint16 = np.dtype([("qint16", np.int16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint32 = np.dtype([("qint32", np.int32, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - np_resource = np.dtype([("resource", np.ubyte, 1)]) - Train Epoch: 1 [0/60000 (0%)] loss=2.3000 - Train Epoch: 1 [640/60000 (1%)] loss=2.2135 - Train Epoch: 1 [1280/60000 (2%)] loss=2.1705 - Train Epoch: 1 [1920/60000 (3%)] loss=2.0767 - Train Epoch: 1 [2560/60000 (4%)] loss=1.8681 - ... - ``` diff --git a/archived/docs/userguide/18-pytorchjob-distributed.md b/archived/docs/userguide/18-pytorchjob-distributed.md deleted file mode 100644 index 91a6c2fc2..000000000 --- a/archived/docs/userguide/18-pytorchjob-distributed.md +++ /dev/null @@ -1,131 +0,0 @@ -This example shows how to use `Arena` to submit a pytorch distributed job. This example will download the source code from git url. - -1. The first step is to check the available resources. - ``` - ➜ arena top node - NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) - cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0 - cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0 - cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0 - cn-huhehaote.172.16.0.208 172.16.0.208 ready 4 0 - cn-huhehaote.172.16.0.209 172.16.0.209 ready 4 0 - cn-huhehaote.172.16.0.210 172.16.0.210 ready 4 0 - ----------------------------------------------------------------------------------------- - Allocated/Total GPUs In Cluster: - 0/12 (0%) - ``` - There are 3 available nodes with GPU for running training jobs. - -2. Submit a pytorch distributed training job with 2 nodes and one gpu card, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git). - ``` - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-dist-git \ - --gpus=1 \ - --workers=2 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo" - configmap/pytorch-dist-git-pytorchjob created - configmap/pytorch-dist-git-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-dist-git created - INFO[0000] The Job pytorch-dist-git has been submitted successfully - INFO[0000] You can run `arena get pytorch-dist-git --type pytorchjob` to check the job status - ``` - - > the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`. - - >`workers` is the total number of nodes participating in the training (must be a positive integer and greater than or equal to 1), including rank0 node used to establish communication (corresponding to the `master` node in the pytorch-operator). The default value of the parameter is 1, which can not be set, as a stand-alone job. - - -3. List all the jobs. - ``` - ➜ arena list - NAME STATUS TRAINER AGE NODE - pytorch-dist-git SUCCEEDED PYTORCHJOB 23h N/A - ``` - -4. Get the details of the this job. There are 2 instances of this job, and instance `pytorch-dist-git-master-0` is the rank0. Arena simplifies the process of submitting distributed jobs with `PyTorch-Operator`. -A `Service` will be created for this `master` instance for other nodes to access through the name of `Service` in `PyTorch-Operator`, and inject environment variables into each instance: `MASTER_PORT`、`MASTER_ADDR`、`WORLD_SIZE`、`RANK`. Initialization of distributed process group for pytorch( dist.init_ process_ group). `MASTER_PORT` auto assign, `MASTER_ADDR` is "localhost" in the `master` instance, and other instances are `Service` name of the `master`,`WORLD_SIZE` is the total number of instances, and `RANK` is the serial number of the current calculation node, and `master` is 0, `Worker` instance is the index of instance name suffix plus one. For example, in the following example, `RANK` of instance `pytorch-dist-git-worker-0` is `0 + 1 = 1` -In Arena, the value filled in by the parameter `--workers` contains one `master` instance, because `master` is also involved in training. - ``` - ➜ arena get pytorch-local-git - STATUS: SUCCEEDED - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 1m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-master-0 172.16.0.210 - pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-worker-0 172.16.0.210 - ``` - -5. Check logs. - ``` - ➜ arena logs pytorch-dist-git - WORLD_SIZE: 2, CURRENT_RANK: 0 - args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000) - Using CUDA - Using distributed PyTorch with gloo backend - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint8 = np.dtype([("qint8", np.int8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint16 = np.dtype([("qint16", np.int16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint32 = np.dtype([("qint32", np.int32, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - np_resource = np.dtype([("resource", np.ubyte, 1)]) - Train Epoch: 1 [0/60000 (0%)] loss=2.3000 - Train Epoch: 1 [640/60000 (1%)] loss=2.2135 - Train Epoch: 1 [1280/60000 (2%)] loss=2.1705 - Train Epoch: 1 [1920/60000 (3%)] loss=2.0767 - Train Epoch: 1 [2560/60000 (4%)] loss=1.8681 - Train Epoch: 1 [3200/60000 (5%)] loss=1.4142 - Train Epoch: 1 [3840/60000 (6%)] loss=1.0009 - ... - ``` - - > For multi instances of distributed job, the default output is the log of rank0 (the instance is the `master` node). If you want to view the log of the specific instance, you can view it by `-i` instance name, for example: - - ``` - ➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0 - WORLD_SIZE: 2, CURRENT_RANK: 1 - args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000) - Using CUDA - Using distributed PyTorch with gloo backend - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint8 = np.dtype([("qint8", np.int8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint16 = np.dtype([("qint16", np.int16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint32 = np.dtype([("qint32", np.int32, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - np_resource = np.dtype([("resource", np.ubyte, 1)]) - Train Epoch: 1 [0/60000 (0%)] loss=2.3000 - Train Epoch: 1 [640/60000 (1%)] loss=2.2135 - Train Epoch: 1 [1280/60000 (2%)] loss=2.1705 - Train Epoch: 1 [1920/60000 (3%)] loss=2.0767 - Train Epoch: 1 [2560/60000 (4%)] loss=1.8681 - Train Epoch: 1 [3200/60000 (5%)] loss=1.4142 - ``` - - > In addition, user can view the logs of the last few lines through the parameter `-t` lines num, such as: - - ``` - ➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0 -t 5 - Train Epoch: 1 [58880/60000 (98%)] loss=0.2048 - Train Epoch: 1 [59520/60000 (99%)] loss=0.0646 - - accuracy=0.9661 - - ``` - > For more parameters, see ` arena logs -- help` - \ No newline at end of file diff --git a/archived/docs/userguide/19-pytorchjob-tensorboard.md b/archived/docs/userguide/19-pytorchjob-tensorboard.md deleted file mode 100644 index a30ea0a14..000000000 --- a/archived/docs/userguide/19-pytorchjob-tensorboard.md +++ /dev/null @@ -1,75 +0,0 @@ -This example shows how to use `Arena` to submit a python distributed job and visualize by `Tensorboard`. The sample downloads the source code from git URL. - -1. The first step is to check the available resources. - ``` - ➜ arena top node - NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) - cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0 - cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0 - cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0 - cn-huhehaote.172.16.0.208 172.16.0.208 ready 4 0 - cn-huhehaote.172.16.0.209 172.16.0.209 ready 4 0 - cn-huhehaote.172.16.0.210 172.16.0.210 ready 4 0 - ----------------------------------------------------------------------------------------- - Allocated/Total GPUs In Cluster: - 0/12 (0%) - ``` - There are 3 available nodes with GPU for running training jobs. - -2. Submit a pytorch distributed training job with 2 nodes and one gpu card, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git). - ``` - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-dist-tensorboard \ - --gpus=1 \ - --workers=2 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --tensorboard \ - --logdir=/root/logs \ - "python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs" - configmap/pytorch-dist-tensorboard-pytorchjob created - configmap/pytorch-dist-tensorboard-pytorchjob labeled - service/pytorch-dist-tensorboard-tensorboard created - deployment.apps/pytorch-dist-tensorboard-tensorboard created - pytorchjob.kubeflow.org/pytorch-dist-tensorboard created - INFO[0000] The Job pytorch-dist-tensorboard has been submitted successfully - INFO[0000] You can run `arena get pytorch-dist-tensorboard --type pytorchjob` to check the job status - ``` - - > the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`. - - > `workers` is the total number of nodes participating in the training (must be a positive integer and greater than or equal to 1), including rank0 node used to establish communication (corresponding to the `master` node in the pytorch-operator). The default value of the parameter is 1, which can not be set, as a stand-alone job. - - > `logdir` indicates where the tensorboard reads the event logs of Pytorch. - -3. List all the jobs. - ``` - ➜ arena list - NAME STATUS TRAINER AGE NODE - pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h N/A - ``` - -4. Get the details of the this job. - ``` - ➜ arena get pytorch-dist-tensorboard - STATUS: SUCCEEDED - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 15m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h pytorch-dist-tensorboard-master-0 172.16.0.210 - pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h pytorch-dist-tensorboard-worker-0 172.16.0.210 - - Your tensorboard will be available on: - http://172.16.0.205:30583 - ``` - > Notice: you can access the tensorboard by using `172.16.0.205:30583`. You can consider `sshuttle` if you can't access the tensorboard directly from your laptop. For example: - ``` - # you can install sshuttle==0.74 in your mac with python2.7 - ➜ pip install sshuttle==0.74 - # 0/0 -> 0.0.0.0/0 - ➜ sshuttle -r root@39.104.17.205 0/0 - ``` - ![](19-pytorchjob-tensorboard.png) diff --git a/archived/docs/userguide/19-pytorchjob-tensorboard.png b/archived/docs/userguide/19-pytorchjob-tensorboard.png deleted file mode 100644 index 9192baa00..000000000 Binary files a/archived/docs/userguide/19-pytorchjob-tensorboard.png and /dev/null differ diff --git a/archived/docs/userguide/2-tensorboard.jpg b/archived/docs/userguide/2-tensorboard.jpg deleted file mode 100644 index c11e3fa60..000000000 Binary files a/archived/docs/userguide/2-tensorboard.jpg and /dev/null differ diff --git a/archived/docs/userguide/2-tfjob-tensorboard.md b/archived/docs/userguide/2-tfjob-tensorboard.md deleted file mode 100644 index 713c967a1..000000000 --- a/archived/docs/userguide/2-tfjob-tensorboard.md +++ /dev/null @@ -1,109 +0,0 @@ - -Here is an example how you can use `Arena` for the machine learning training. It will download the source code from git url, and use Tensorboard to visualize the Tensorflow computation graph and plot quantitative metrics. - -1. the first step is to check the available resources - -``` -arena top node -NAME IPADDRESS ROLE GPU(Total) GPU(Allocated) -i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0 -i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0 -i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0 -i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0 -i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0 -i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0 ------------------------------------------------------------------------------------------ -Allocated/Total GPUs In Cluster: -0/3 (0%) -``` - -There are 3 available nodes with GPU for running training jobs. - - -2\. Now we can submit a training job with `arena cli`, it will download the source code from github - -``` -# arena submit tf \ - --name=tf-tensorboard \ - --gpus=1 \ - --image=tensorflow/tensorflow:1.5.0-devel-gpu \ - --env=TEST_TMPDIR=code/tensorflow-sample-code/ \ - --syncMode=git \ - --syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - --tensorboard \ - --logdir=/training_logs \ - "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 5000" -configmap/tf-tensorboard-tfjob created -configmap/tf-tensorboard-tfjob labeled -service/tf-tensorboard-tensorboard created -deployment.extensions/tf-tensorboard-tensorboard created -tfjob.kubeflow.org/tf-tensorboard created -INFO[0001] The Job tf-tensorboard has been submitted successfully -INFO[0001] You can run `arena get tf-tensorboard --type tfjob` to check the job status -``` - -> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`. - -> `logdir` indicates where the tensorboard reads the event logs of TensorFlow - -3\. List all the jobs - -``` -# arena list -NAME STATUS TRAINER AGE NODE -tf-tensorboard RUNNING TFJOB 0s 192.168.1.119 -``` - -4\. Check the resource usage of the job - -``` -# arena top job -NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated) -tf-tensorboard RUNNING TFJOB 26s 192.168.1.119 1 1 - - -Total Allocated GPUs of Training Job: -0 - -Total Requested GPUs of Training Job: -1 -``` - - - -5\. Check the resource usage of the cluster - - -``` -# arena top node -NAME IPADDRESS ROLE GPU(Total) GPU(Allocated) -i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0 -i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 1 -i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0 -i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0 -i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0 -i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0 ------------------------------------------------------------------------------------------ -Allocated/Total GPUs In Cluster: -1/3 (33%) -``` - - -6\. Get the details of the specific job - -``` -# arena get tf-tensorboard -NAME STATUS TRAINER AGE INSTANCE NODE -tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-586fcf4d6f-vtlxv 192.168.1.119 -tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-worker-0 192.168.1.119 - -Your tensorboard will be available on: -192.168.1.117:30670 -``` - -> Notice: you can access the tensorboard by using `192.168.1.117:30670`. You can consider `sshuttle` if you can't access the tensorboard directly from your laptop. For example: `sshuttle -r root@47.89.59.51 192.168.0.0/16` - - -![](2-tensorboard.jpg) - -Congratulations! You've run the training job with `arena` successfully, and you can also check the tensorboard easily. \ No newline at end of file diff --git a/archived/docs/userguide/20-pytorchjob-distributed-data.md b/archived/docs/userguide/20-pytorchjob-distributed-data.md deleted file mode 100644 index a0aa16dd4..000000000 --- a/archived/docs/userguide/20-pytorchjob-distributed-data.md +++ /dev/null @@ -1,123 +0,0 @@ -This example shows how to use `Arena` to submit a python distributed job and mount an NFS data volume. The sample downloads the source code from git URL. - -1. Set up an NFS server.(refer to: https://www.cnblogs.com/weifeng1463/p/10037803.html ) - ```shell - # install nfs server - ➜ yum install nfs-utils -y - # Create local directory of NFS server - ➜ mkdir -p /root/nfs/data - # Configure nfs server - ➜ cat /etc/exports - /root/nfs/data *(rw,no_root_squash) - # Start nfs server - ➜ systemctl start nfs; systemctl start rpcbind - ➜ systemctl enable nfs - Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service. - ``` -2. Download training data to shared directory of NFS. - ```shell - # Get information of NFS server by showmount, 172.16.0.200 is the host ip of NFS server - ➜ showmount -e 172.16.0.200 - Export list for 172.16.0.200: - /root/nfs/data * - # Enter shared directory - ➜ cd /root/nfs/data - # Prepare training data to shared directory - ➜ pwd - /root/nfs/data - # MNIST -> That's the training data we need - ➜ ll - total 8.0K - drwxr-xr-x 4 502 games 4.0K 6月 17 16:05 data - drwxr-xr-x 4 root root 4.0K 6月 23 15:17 MNIST - ``` -3. Create PV. - ```shell - # Note: Typesetting may cause yaml indentation problems - ➜ cat nfs-pv.yaml - apiVersion: v1 - kind: PersistentVolume - metadata: - name: pytorchdata - labels: - pytorchdata: nas-mnist - spec: - persistentVolumeReclaimPolicy: Retain - capacity: - storage: 10Gi - accessModes: - - ReadWriteMany - nfs: - server: 172.16.0.200 - path: "/root/nfs/data" - - ➜ kubectl create -f nfs-pv.yaml - persistentvolume/pytorchdata created - ➜ kubectl get pv | grep pytorchdata - pytorchdata 10Gi RWX Retain Bound default/pytorchdata 7m38s - ``` -5. Create PVC. - ```shell - ➜ cat nfs-pvc.yaml - apiVersion: v1 - kind: PersistentVolumeClaim - metadata: - name: pytorchdata - annotations: - description: "this is the mnist demo" - owner: Tom - spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 5Gi - selector: - matchLabels: - pytorchdata: nas-mnist - - ➜ kubectl create -f nfs-pvc.yaml - persistentvolumeclaim/pytorchdata created - ➜ kubectl get pvc | grep pytorchdata - pytorchdata Bound pytorchdata 10Gi RWX 2m3s - ``` -7. Check the data volume. - ```shell - ➜ arena data list - NAME ACCESSMODE DESCRIPTION OWNER AGE - pytorchdata ReadWriteMany this is the mnist demo Tom 2m - ``` -9. Submit the pytorch job through `--data pvc_name:container_path` mount distributed storage volume. - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-data \ - --gpus=1 \ - --workers=2 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --data=pytorchdata:/mnist_data \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo --data /mnist_data/data" - configmap/pytorch-data-pytorchjob created - configmap/pytorch-data-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-data created - INFO[0000] The Job pytorch-data has been submitted successfully - INFO[0000] You can run `arena get pytorch-data --type pytorchjob` to check the job status - ``` -11. Get status of volume `pytorchdata` in one of the instances by `kubectl describe`. - ```shell - # Get the details of the this job - ➜ arena get pytorch-data - STATUS: SUCCEEDED - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 56s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-data SUCCEEDED PYTORCHJOB 1m pytorch-data-master-0 172.16.0.210 - pytorch-data SUCCEEDED PYTORCHJOB 1m pytorch-data-worker-0 172.16.0.210 - - # Get status of volume `pytorchdata` from `pytorch-data-master-0` - ➜ kubectl describe pod pytorch-data-master-0 | grep pytorchdata -C 3 - ``` - ![](20-pytorchjob-distributed-data.png) \ No newline at end of file diff --git a/archived/docs/userguide/20-pytorchjob-distributed-data.png b/archived/docs/userguide/20-pytorchjob-distributed-data.png deleted file mode 100644 index 7731a3c1c..000000000 Binary files a/archived/docs/userguide/20-pytorchjob-distributed-data.png and /dev/null differ diff --git a/archived/docs/userguide/21-pytorchjob-with-node-selector.md b/archived/docs/userguide/21-pytorchjob-with-node-selector.md deleted file mode 100644 index 4f93b0842..000000000 --- a/archived/docs/userguide/21-pytorchjob-with-node-selector.md +++ /dev/null @@ -1,54 +0,0 @@ -## Arena supports assigning pytorch jobs to some k8s particular nodes - -1. Get k8s cluster information: - ```shell - ➜ kubectl get nodes - NAME STATUS ROLES AGE VERSION - cn-huhehaote.172.16.0.205 Ready master 4h19m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.206 Ready master 4h18m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.207 Ready master 4h17m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.208 Ready 4h13m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.209 Ready 4h13m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.210 Ready 4h13m v1.16.9-aliyun.1 - ``` -2. Give a label to nodes,for example: - ```shell - # 172.16.0.208 label gpu_node=ok - ➜ kubectl label nodes cn-huhehaote.172.16.0.208 gpu_node=ok - node/cn-huhehaote.172.16.0.208 labeled - # 172.16.0.209 label gpu_node=ok - ➜ kubectl label nodes cn-huhehaote.172.16.0.209 gpu_node=ok - node/cn-huhehaote.172.16.0.209 labeled - # 172.16.0.210 label ssd_node=ok - ➜ kubectl label nodes cn-huhehaote.172.16.0.210 ssd_node=ok - node/cn-huhehaote.172.16.0.210 labeled - ``` -3. When submitting a python job, you can use the `--selector` to decide which node the job runs on - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-selector \ - --gpus=1 \ - --workers=2 \ - --selector gpu_node=ok \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo" - configmap/pytorch-selector-pytorchjob created - configmap/pytorch-selector-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-selector created - INFO[0000] The Job pytorch-selector has been submitted successfully - INFO[0000] You can run `arena get pytorch-selector --type pytorchjob` to check the job status - ``` -4. Get the job details, you can see that the job only runs on this node with IP 172.16.0.209 and label `gpu_node=ok`. - ```shell - ➜ arena get pytorch-selector - STATUS: PENDING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 14s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-selector PENDING PYTORCHJOB 14s pytorch-selector-master-0 172.16.0.209 - pytorch-selector PENDING PYTORCHJOB 14s pytorch-selector-worker-0 172.16.0.209 - ``` \ No newline at end of file diff --git a/archived/docs/userguide/22-pytorchjob-with-node-toleration.md b/archived/docs/userguide/22-pytorchjob-with-node-toleration.md deleted file mode 100644 index a69c9ea94..000000000 --- a/archived/docs/userguide/22-pytorchjob-with-node-toleration.md +++ /dev/null @@ -1,96 +0,0 @@ -## Arena supports submiting a pytorch job with tolerating k8s nodes with taints - -1. Get k8s cluster information: - ```shell - ➜ kubectl get node - NAME STATUS ROLES AGE VERSION - cn-huhehaote.172.16.0.205 Ready master 5h13m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.206 Ready master 5h12m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.207 Ready master 5h11m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.208 Ready 5h7m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.209 Ready 5h7m v1.16.9-aliyun.1 - cn-huhehasote.172.16.0.210 Ready 5h7m v1.16.9-aliyun.1 - ``` -2. Give some taints for k8s nodes,for example: - ```shell - # taint --> gpu_node - ➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node=invalid:NoSchedule - node/cn-huhehaote.172.16.0.208 tainted - ➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node=invalid:NoSchedule - node/cn-huhehaote.172.16.0.209 tainted - # taint --> ssd_node - ➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node=invalid:NoSchedule - node/cn-huhehaote.172.16.0.210 tainted - ``` -3. When we add the wrong nodes' taints or restore the node's schedulability, we can remove the nodes' taints in the following commands: - ```shell - ➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node- - node/cn-huhehaote.172.16.0.208 untainted - ➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node- - node/cn-huhehaote.172.16.0.209 untainted - ➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node- - node/cn-huhehaote.172.16.0.210 untainted - ``` -4. When submit a job, you can tolerate some nodes with taints to run job with operation `--toleration`, for example `--toleration=gpu_node`. This parameter can be used multiple times with different taint keys. - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-toleration \ - --gpus=1 \ - --workers=2 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --tensorboard \ - --logdir=/root/logs \ - --toleration gpu_node \ - "python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs" - configmap/pytorch-toleration-pytorchjob created - configmap/pytorch-toleration-pytorchjob labeled - service/pytorch-toleration-tensorboard created - deployment.apps/pytorch-toleration-tensorboard created - pytorchjob.kubeflow.org/pytorch-toleration created - INFO[0000] The Job pytorch-toleration has been submitted successfully - INFO[0000] You can run `arena get pytorch-toleration --type pytorchjob` to check the job status - ``` -5. Get the details of the this job. - ```shell - arena get pytorch-toleration - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 2m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-toleration RUNNING PYTORCHJOB 2m pytorch-toleration-master-0 172.16.0.209 - pytorch-toleration RUNNING PYTORCHJOB 2m pytorch-toleration-worker-0 172.16.0.209 - - Your tensorboard will be available on: - http://172.16.0.205:32091 - ``` -6. You can use `--toleration all` to tolerate all node taints. - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-toleration-all \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --toleration all \ - "python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend gloo" - configmap/pytorch-toleration-all-pytorchjob created - configmap/pytorch-toleration-all-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-toleration-all created - INFO[0000] The Job pytorch-toleration-all has been submitted successfully - INFO[0000] You can run `arena get pytorch-toleration-all --type pytorchjob` to check the job status - ``` -7. Get the details of the this job. - ```shell - ➜ arena get pytorch-toleration-all - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 33s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-toleration-all RUNNING PYTORCHJOB 33s pytorch-toleration-all-master-0 172.16.0.210 - ``` \ No newline at end of file diff --git a/archived/docs/userguide/23-pytorchjob-assign-config-file.md b/archived/docs/userguide/23-pytorchjob-assign-config-file.md deleted file mode 100644 index d7744464b..000000000 --- a/archived/docs/userguide/23-pytorchjob-assign-config-file.md +++ /dev/null @@ -1,49 +0,0 @@ -## Assign configuration files for pytorch jobs - -You can pass the configuration files to containers when submiting jobs. - -1. Prepare the configuration file to be mounted on the submitted machine. - ```shell - # prepare your config-file - ➜ cat /tmp/test-config.json - { - "key": "job-config" - } - ``` -2. Submit the job, and specify the configuration file to mount by `--config-file`. - ```shell - # arena submit job by --config-file ${host-config-file}:${container-config-file} - # This parameter supports multiple use and mounting multiple configuration files - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-config-file \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --config-file /tmp/test-config.json:/etc/config/config.json \ - "python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo" - configmap/pytorch-config-file-pytorchjob created - configmap/pytorch-config-file-pytorchjob labeled - configmap/pytorch-config-file-a9cbad1b8719778 created - pytorchjob.kubeflow.org/pytorch-config-file created - INFO[0000] The Job pytorch-config-file has been submitted successfully - INFO[0000] You can run `arena get pytorch-config-file --type pytorchjob` to check the job status - ``` -3. Get the details of the this job. - ```shell - ➜ arena get pytorch-config-file --type pytorchjob - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 51s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-config-file RUNNING PYTORCHJOB 51s pytorch-config-file-master-0 172.16.0.210 - ``` -4. Use kubectl to check file is in container or not: - ``` - ➜ kubectl exec -ti pytorch-config-file-master-0 -- cat /etc/config/config.json - { - "key": "job-config" - } - ``` \ No newline at end of file diff --git a/archived/docs/userguide/24-pytorchjob-preempted.md b/archived/docs/userguide/24-pytorchjob-preempted.md deleted file mode 100644 index 11b155ccb..000000000 --- a/archived/docs/userguide/24-pytorchjob-preempted.md +++ /dev/null @@ -1,130 +0,0 @@ -## Arena supports Priority and Preemption for pytorch job - -1. Create `PriorityClass` with the yaml below.There are two priorities defined here: `critical` and `medium`. - ```shell - # critical 和 medium 声明 - ➜ cat priorityClass.yaml - apiVersion: scheduling.k8s.io/v1beta1 - description: Used for the critical app - kind: PriorityClass - metadata: - name: critical - value: 1100000 - - --- - - apiVersion: scheduling.k8s.io/v1beta1 - description: Used for the medium app - kind: PriorityClass - metadata: - name: medium - value: 1000000 - - # Create two priority objects: critical and medium - ➜ kubectl create -f priorityClass.yaml - priorityclass.scheduling.k8s.io/critical created - priorityclass.scheduling.k8s.io/medium created - ``` -2. Check the available resources.There are 3 nodes in total, and each node has 4 gpu cards. - ```shell - ➜ arena top node - NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) - cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0 - cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0 - cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0 - cn-huhehaote.172.16.0.208 172.16.0.208 ready 4 0 - cn-huhehaote.172.16.0.209 172.16.0.209 ready 4 0 - cn-huhehaote.172.16.0.210 172.16.0.210 ready 4 0 - ----------------------------------------------------------------------------------------- - Allocated/Total GPUs In Cluster: - 0/12 (0%) - ``` -3. Submit a GPU job with `medium` priority of 3 nodes and 4 cards, which occupies the full resources. In order to verify the effect, we can increase the epoch of training, extend the training time, and facilitate the experiment to view. - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-priority-medium \ - --gpus=4 \ - --workers=3 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --priority=medium \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 200" - configmap/pytorch-priority-medium-pytorchjob created - configmap/pytorch-priority-medium-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-priority-medium created - INFO[0000] The Job pytorch-priority-medium has been submitted successfully - INFO[0000] You can run `arena get pytorch-priority-medium --type pytorchjob` to check the job status - ``` -4. Get the details of the this job. You can see that the task is running. - ```shell - ➜ arena get pytorch-priority-medium - STATUS: RUNNING - NAMESPACE: default - PRIORITY: medium - TRAINING DURATION: 3m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-master-0 172.16.0.208 - pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-0 172.16.0.210 - pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-1 172.16.0.209 - ``` -5. Check the GPU card usage. It is all occupied. - ```shell - ➜ arena top node - NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) - cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0 - cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0 - cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0 - cn-huhehaote.172.16.0.208 172.16.0.208 ready 4 4 - cn-huhehaote.172.16.0.209 172.16.0.209 ready 4 4 - cn-huhehaote.172.16.0.210 172.16.0.210 ready 4 4 - ----------------------------------------------------------------------------------------- - Allocated/Total GPUs In Cluster: - 12/12 (100%) - ``` -6. Submit a job with priority of `critical` to initiate preemption. - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-priority-critical \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --priority=critical \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 50" - configmap/pytorch-priority-critical-pytorchjob created - configmap/pytorch-priority-critical-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-priority-critical created - INFO[0000] The Job pytorch-priority-critical has been submitted successfully - INFO[0000] You can run `arena get pytorch-priority-critical --type pytorchjob` to check the job status - ``` -7. Get the details of the this job. - ```shell - ➜ arena get pytorch-priority-critical - arena get pytorch-priority-critical - STATUS: RUNNING - NAMESPACE: default - PRIORITY: critical - TRAINING DURATION: 22s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-priority-critical RUNNING PYTORCHJOB 22s pytorch-priority-critical-master-0 172.16.0.208 - ``` -8. Check the job status of `medium` priority. It has become `FAILED`. One instance has been deleted due to preemption. - ```shell - ➜ arena get pytorch-priority-medium - STATUS: FAILED - NAMESPACE: default - PRIORITY: medium - TRAINING DURATION: 1m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-master-0 172.16.0.210 - pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-worker-0 172.16.0.209 - ``` -9. Check the event of the `pytorch-priority-medium`, and you can see that its `python-priority-media-worker-1` has been expelled. The reason for the expulsion is that the `python-priority-critical-master-0` is also applying for the resource of this node, and the node has no additional GPU resource, so the low priority job is preempted by the high priority job. - ```shell - ➜ kubectl get events --field-selector involvedObject.name=pytorch-priority-medium-worker-1 - ``` - ![](24-pytorchjob-preempted.png) \ No newline at end of file diff --git a/archived/docs/userguide/24-pytorchjob-preempted.png b/archived/docs/userguide/24-pytorchjob-preempted.png deleted file mode 100644 index 2ae8978b6..000000000 Binary files a/archived/docs/userguide/24-pytorchjob-preempted.png and /dev/null differ diff --git a/archived/docs/userguide/25-pytorchjob-clean-pod-policy.md b/archived/docs/userguide/25-pytorchjob-clean-pod-policy.md deleted file mode 100644 index b46aab20a..000000000 --- a/archived/docs/userguide/25-pytorchjob-clean-pod-policy.md +++ /dev/null @@ -1,40 +0,0 @@ -## Specify the clean-up policy of pod after finishing for pytorch job - -1. Submit a job, and specify `--clean-task-policy` as `All`. After the job finished (`SUCCEEDED` or `FAILED`), all instances (pods) will be deleted; the default is `None`, and all pods will be retained. - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-clean-policy \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --clean-task-policy=All \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo" - configmap/pytorch-clean-policy-pytorchjob created - configmap/pytorch-clean-policy-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-clean-policy created - INFO[0000] The Job pytorch-clean-policy has been submitted successfully - INFO[0000] You can run `arena get pytorch-clean-policy --type pytorchjob` to check the job status - ``` - -2. Get the job details. After the job is finished, the instance `python-clean-policy-master-0` has been deleted. - ```shell - # RUNNING - ➜ arena get pytorch-clean-policy - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 18s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-clean-policy RUNNING PYTORCHJOB 18s pytorch-clean-policy-master-0 172.16.0.209 - - # FINISHED - ➜ arena get pytorch-clean-policy - STATUS: SUCCEEDED - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 37s - - NAME STATUS TRAINER AGE INSTANCE NODE - ``` \ No newline at end of file diff --git a/archived/docs/userguide/26-submitjob-with-private-registry.md b/archived/docs/userguide/26-submitjob-with-private-registry.md deleted file mode 100644 index ac749d830..000000000 --- a/archived/docs/userguide/26-submitjob-with-private-registry.md +++ /dev/null @@ -1,168 +0,0 @@ -# Submit the training jobs with ImagePullSecrets - -You can use a private registry when submiting jobs(include tensorboard images). -Assume the following images are in your private registry. -```shell -# pytorch -registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard-secret:1.5.1-cuda10.1-cudnn7-runtime -# tf -registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.5.0-devel-gpu -# mpi -registry.cn-huhehaote.aliyuncs.com/lumo/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 -# tensorboard (--tensorboard-image) -registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel -``` - -## Contents -* Create ImagePullSecrets -* TFJob With Secret -* MPIJob With Secret -* PyTorchJob With Secret -* Load imagePullSecrets from configuration of Arena - - -## Create ImagePullSecrets -* Create a [Secret](https://kubernetes.io/docs/concepts/configuration/secret/) with kubectl. In this case, it's [imagePullSecrets](https://kubernetes.io/docs/concepts/containers/images/). - ```shell script - kubectl create secret docker-registry [$Reg_Secret] --docker-server=[$Registry] --docker-username=[$Username] --docker-password=[$Password] --docker-email=[$Email] - ``` - > Note: - > [$Reg_Secret] is the name of the secret key, which can be defined by yourself. - > [$Registry] is your private registry address. - > [$Username] is username of your private registry. - > [$Password] is password of your private registry. - > [$Email] is your email address, Optional. - - For Example: - ```shell - kubectl create secret docker-registry \ - lumo-secret \ - --docker-server=registry.cn-huhehaote.aliyuncs.com \ - --docker-username=******@test.aliyunid.com \ - --docker-password=****** - secret/lumo-secret created - ``` - You can check that the secret was created. - ```shell - # kubectl get secrets | grep lumo-secret - lumo-secret kubernetes.io/dockerconfigjson 1 52s - ``` - -## TFJob With Secret -Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets. -1. Submit tf job. - ```shell - arena submit tf \ - --name=tf-git-with-secret \ - --working-dir=/root \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.5.0-devel-gpu \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - --data=training-data:/mnist_data \ - --tensorboard \ - --tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \ - --logdir=/mnist_data/tf_data/logs \ - --image-pull-secrets=lumo-secret \ - "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --log_dir /mnist_data/tf_data/logs --data_dir /mnist_data/tf_data/" - ``` - > Note: - > If you have many `imagePullSecrets` to use, you can use `--image-pull-secrets` multiple times. - ```shell - arena submit tf \ - --name=tf-git-with-secret \ - ... \ - --image-pull-secrets=lumo-secret \ - --image-pull-secrets=king-secret \ - --image-pull-secrets=test-secret - ... - ``` -2. Get the details of the job. - ```shell - # arena get tf-git-with-secret - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 17s - - NAME STATUS TRAINER AGE INSTANCE NODE - tf-git-with-secret RUNNING TFJOB 17s tf-git-with-secret-chief-0 172.16.0.202 - - Your tensorboard will be available on: - http://172.16.0.198:30080 - ``` - -## MPIJob With Secret -Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets. -1. Submit mpi job. - ```shell - arena submit mpi \ - --name=mpi-dist-with-secret \ - --gpus=1 \ - --workers=2 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - --env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \ - --sync-mode=git \ - --sync-source=https://github.com/tensorflow/benchmarks.git \ - --tensorboard \ - --tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \ - --image-pull-secrets=lumo-secret \ - "mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10" - ``` -2. Get the details of the job. - ```shell - # arena get mpi-dist-with-secret - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 9m - - NAME STATUS TRAINER AGE INSTANCE NODE - mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-launcher-v8sgt 172.16.0.201 - mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-worker-0 172.16.0.201 - mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-worker-1 172.16.0.202 - - Your tensorboard will be available on: - http://172.16.0.198:30450 - ``` - -## PyTorchJob With Secret -Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets. -1. Submit pytorch job. - ```shell - arena submit pytorch \ - --name=pytorch-git-with-secret \ - --gpus=1 \ - --working-dir=/root \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard-secret:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --data=training-data:/mnist_data \ - --tensorboard \ - --tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \ - --logdir=/mnist_data/pytorch_data/logs \ - --image-pull-secrets=lumo-secret \ - "python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend nccl --dir /mnist_data/pytorch_data/logs --data /mnist_data/pytorch_data/" - ``` -2. Get the details of the job. - ```shell - # arena get pytorch-git-with-secret - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 2m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-git-with-secret RUNNING PYTORCHJOB 2m pytorch-git-with-secret-master-0 172.16.0.202 - - Your tensorboard will be available on: - http://172.16.0.198:31155 - ``` -## Load imagePullSecrets from configuration of Arena -If you don't want to submit job by `--image-pull-secrets` every time. You can replace it with configuration of Arena. -Open the file `~/.arena/config`, if not exist, create it. And fill in the following configurations. -```shell -imagePullSecrets=lumo-secret,king-secret -``` -> Note: -> `--image-pull-secrets` will overwrite `~/.arena/config`. diff --git a/archived/docs/userguide/27-kfserving-custom.jpg b/archived/docs/userguide/27-kfserving-custom.jpg deleted file mode 100644 index 62de5471d..000000000 Binary files a/archived/docs/userguide/27-kfserving-custom.jpg and /dev/null differ diff --git a/archived/docs/userguide/27-kfserving-custom.md b/archived/docs/userguide/27-kfserving-custom.md deleted file mode 100644 index b50429dbd..000000000 --- a/archived/docs/userguide/27-kfserving-custom.md +++ /dev/null @@ -1,62 +0,0 @@ -This guide walks through the steps to deploy and serve a custom model with kfserving - -1. Setup - -Follow the kFserving [guide](https://github.com/kubeflow/kfserving#install-kfserving) to install kFserving.For the prerequisites,you should ensure 8g memery and 4 core cpu avaliable in your environment. - -2. summit your serving job into kfserving -```shell script -arena serve kfserving --name=max-object-detector --port=5000 --image=codait/max-object-detector --model-type=custom -configmap/max-object-detector-202008221942-kfserving created -configmap/max-object-detector-202008221942-kfserving labeled -inferenceservice.serving.kubeflow.org/max-object-detector-202008221942 created -``` -3. list the job you just serving -```shell script -arena serve list -NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS -max-object-detector KFSERVING 202008221942 1 1 10.97.52.65 http:80 -``` -4. test the model service -##### Determine the ingress IP and ports -The first step is to [determine the ingress IP](https://github.com/kubeflow/kfserving/blob/master/README.md#determine-the-ingress-ip-and-ports) and ports and set INGRESS_HOST and INGRESS_PORT - -This example uses the [codait/max-object-detector](https://github.com/IBM/MAX-Object-Detector) image. The Max Object Detector api server expects a POST request to the /model/predict endpoint that includes an image multipart/form-data and an optional threshold query string. - -```shell script -MODEL_NAME=max-object-detector-202008221942 -SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3) -INGRESS_HOST=localhost -INGRESS_PORT=80 -curl -v -F "image=@27-kfserving-custom.jpg" http://${INGRESS_HOST}:${INGRESS_PORT}/model/predict -H "Host: ${SERVICE_HOSTNAME}" -* Trying ::1... -* TCP_NODELAY set -* Connected to localhost (::1) port 80 (#0) -> POST /model/predict HTTP/1.1 -> Host: max-object-detector-202008221942.default.example.com -> User-Agent: curl/7.64.1 -> Accept: */* -> Content-Length: 125769 -> Content-Type: multipart/form-data; boundary=------------------------56b67bc60fc7bdc7 -> Expect: 100-continue -> -< HTTP/1.1 100 Continue -* We are completely uploaded and fine -< HTTP/1.1 200 OK -< content-length: 380 -< content-type: application/json -< date: Sun, 23 Aug 2020 03:27:14 GMT -< server: istio-envoy -< x-envoy-upstream-service-time: 3566 -< -{"status": "ok", "predictions": [{"label_id": "1", "label": "person", "probability": 0.9440352320671082, "detection_box": [0.12420991063117981, 0.12507185339927673, 0.8423266410827637, 0.5974075794219971]}, {"label_id": "18", "label": "dog", "probability": 0.8645510673522949, "detection_box": [0.10447663068771362, 0.17799144983291626, 0.8422801494598389, 0.7320016026496887]}]} -* Connection #0 to host localhost left intact -* Closing connection 0 -``` -5. delete them -```shell script -arena serve delete max-object-detector --version=202008221942 2 err -inferenceservice.serving.kubeflow.org "max-object-detector-202008221942" deleted -configmap "max-object-detector-202008221942-kfserving" deleted -INFO[0001] The Serving job max-object-detector with version 202008221942 has been deleted successfully -``` \ No newline at end of file diff --git a/archived/docs/userguide/28-elastictraining-tensorflow2-mnist.md b/archived/docs/userguide/28-elastictraining-tensorflow2-mnist.md deleted file mode 100644 index f21f082a4..000000000 --- a/archived/docs/userguide/28-elastictraining-tensorflow2-mnist.md +++ /dev/null @@ -1,175 +0,0 @@ -This guide walks through the steps to submit a elastic training job with horovod. - -1. Build image for training environment -You can use the [registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1]() image directly. -In addition, you can also build your own image with the help of this document [elastic-training-sample-image](https://code.aliyun.com/370272561/elastic-training-sample-image). - -2. Submit a elastic training job. Example code from [tensorflow2_mnist_elastic.py](https://github.com/horovod/horovod/blob/master/examples/elastic/tensorflow2_mnist_elastic.py) - ```shell script - arena submit etjob \ - --name=elastic-training \ - --gpus=1 \ - --workers=3 \ - --max-workers=9 \ - --min-workers=1 \ - --image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \ - --working-dir=/examples \ - "horovodrun - -np \$((\${workers}*\${gpus})) - --min-np \$((\${minWorkers}*\${gpus})) - --max-np \$((\${maxWorkers}*\${gpus})) - --host-discovery-script /usr/local/bin/discover_hosts.sh - python /examples/elastic/tensorflow2_mnist_elastic.py - " - ``` - Output: - ``` - configmap/elastic-training-etjob created - configmap/elastic-training-etjob labeled - trainingjob.kai.alibabacloud.com/elastic-training created - INFO[0000] The Job elastic-training has been submitted successfully - INFO[0000] You can run `arena get elastic-training --type etjob` to check the job status - ``` - -3. List your job. - ```shell script - arena list - ``` - Output: - ``` - NAME STATUS TRAINER AGE NODE - elastic-training RUNNING ETJOB 52s 192.168.0.116 - ``` - -4. Get your job details. - ```shell script - arena get elastic-training - ``` - Output: - ``` - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 1m - - NAME STATUS TRAINER AGE INSTANCE NODE - elastic-training RUNNING ETJOB 1m elastic-training-launcher 192.168.0.116 - elastic-training RUNNING ETJOB 1m elastic-training-worker-0 192.168.0.114 - elastic-training RUNNING ETJOB 1m elastic-training-worker-1 192.168.0.116 - elastic-training RUNNING ETJOB 1m elastic-training-worker-2 192.168.0.116 - ``` -5. Check logs - ```shell script - arena logs elastic-training --tail 10 - ``` - Output: - ``` - Tue Sep 8 08:32:50 2020[1]:Step #2170 Loss: 0.021992 - Tue Sep 8 08:32:50 2020[0]:Step #2180 Loss: 0.000902 - Tue Sep 8 08:32:50 2020[1]:Step #2180 Loss: 0.023190 - Tue Sep 8 08:32:50 2020[2]:Step #2180 Loss: 0.013149 - Tue Sep 8 08:32:51 2020[0]:Step #2190 Loss: 0.029536 - Tue Sep 8 08:32:51 2020[2]:Step #2190 Loss: 0.017537 - Tue Sep 8 08:32:51 2020[1]:Step #2190 Loss: 0.018273 - Tue Sep 8 08:32:51 2020[2]:Step #2200 Loss: 0.038399 - Tue Sep 8 08:32:51 2020[0]:Step #2200 Loss: 0.007017 - Tue Sep 8 08:32:51 2020[1]:Step #2200 Loss: 0.017495 - ``` - - -6. Scaleout your job. Will add one worker into jobs. - ```shell script - arena scaleout etjob --name="elastic-training" --count=1 --timeout=1m - ``` - Output: - ``` - configmap/elastic-training-1599548177-scaleout created - configmap/elastic-training-1599548177-scaleout labeled - scaleout.kai.alibabacloud.com/elastic-training-1599548177 created - INFO[0000] The scaleout job elastic-training-1599548177 has been submitted successfully - ``` - -7. Get your job details. We can see new worker(elastic-training-worker-3) has been "RUNNING". - ```shell script - arena get elastic-training - ``` - Output: - ``` - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 2m - - NAME STATUS TRAINER AGE INSTANCE NODE - elastic-training RUNNING ETJOB 2m elastic-training-launcher 192.168.0.116 - elastic-training RUNNING ETJOB 2m elastic-training-worker-0 192.168.0.114 - elastic-training RUNNING ETJOB 2m elastic-training-worker-1 192.168.0.116 - elastic-training RUNNING ETJOB 2m elastic-training-worker-2 192.168.0.116 - elastic-training RUNNING ETJOB 2m elastic-training-worker-3 192.168.0.117 - ``` - -8. Check logs. - ```shell script - arena logs elastic-training --tail 10 - ``` - Output: - ``` - Tue Sep 8 08:33:33 2020[1]:Step #3140 Loss: 0.014412 - Tue Sep 8 08:33:33 2020[0]:Step #3140 Loss: 0.004425 - Tue Sep 8 08:33:33 2020[3]:Step #3150 Loss: 0.000513 - Tue Sep 8 08:33:33 2020[2]:Step #3150 Loss: 0.062282 - Tue Sep 8 08:33:33 2020[1]:Step #3150 Loss: 0.020650 - Tue Sep 8 08:33:33 2020[0]:Step #3150 Loss: 0.008056 - Tue Sep 8 08:33:34 2020[3]:Step #3160 Loss: 0.002170 - Tue Sep 8 08:33:34 2020[2]:Step #3160 Loss: 0.009676 - Tue Sep 8 08:33:34 2020[1]:Step #3160 Loss: 0.051425 - Tue Sep 8 08:33:34 2020[0]:Step #3160 Loss: 0.023769 - ``` - -9. Scalein your job. Will remove one worker from current jobs. - ```shell script - arena scalein etjob --name="elastic-training" --count=1 --timeout=1m - ``` - Output: - ``` - configmap/elastic-training-1599554041-scalein created - configmap/elastic-training-1599554041-scalein labeled - scalein.kai.alibabacloud.com/elastic-training-1599554041 created - INFO[0000] The scalein job elastic-training-1599554041 has been submitted successfully - ``` - -10. Get your job details. We can see that `elastic-training-worker-3` has been removed. - ```shell script - arena get elastic-training - ``` - Output: - ``` - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 3m - - NAME STATUS TRAINER AGE INSTANCE NODE - elastic-training RUNNING ETJOB 3m elastic-training-launcher 192.168.0.116 - elastic-training RUNNING ETJOB 3m elastic-training-worker-0 192.168.0.114 - elastic-training RUNNING ETJOB 3m elastic-training-worker-1 192.168.0.116 - elastic-training RUNNING ETJOB 3m elastic-training-worker-2 192.168.0.116 - ``` - -11. Check logs. - ```shell script - arena logs elastic-training --tail 10 - ``` - Output: - ``` - Tue Sep 8 08:34:43 2020[0]:Step #5210 Loss: 0.005627 - Tue Sep 8 08:34:43 2020[2]:Step #5220 Loss: 0.002142 - Tue Sep 8 08:34:43 2020[1]:Step #5220 Loss: 0.002978 - Tue Sep 8 08:34:43 2020[0]:Step #5220 Loss: 0.011404 - Tue Sep 8 08:34:44 2020[2]:Step #5230 Loss: 0.000689 - Tue Sep 8 08:34:44 2020[1]:Step #5230 Loss: 0.024597 - Tue Sep 8 08:34:44 2020[0]:Step #5230 Loss: 0.040936 - Tue Sep 8 08:34:44 2020[0]:Step #5240 Loss: 0.000125 - Tue Sep 8 08:34:44 2020[2]:Step #5240 Loss: 0.026498 - Tue Sep 8 08:34:44 2020[1]:Step #5240 Loss: 0.000308 - ``` \ No newline at end of file diff --git a/archived/docs/userguide/29-elastictraining-pytorch-synthetic.md b/archived/docs/userguide/29-elastictraining-pytorch-synthetic.md deleted file mode 100644 index e795f2883..000000000 --- a/archived/docs/userguide/29-elastictraining-pytorch-synthetic.md +++ /dev/null @@ -1,182 +0,0 @@ -This guide walks through the steps to submit a elastic training job with horovod. - -1. Build image for training environment -You can use the [registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1]() image directly. -In addition, you can also build your own image with the help of this document [elastic-training-sample-image](https://code.aliyun.com/370272561/elastic-training-sample-image). - -2. Submit a elastic training job. Example code from [pytorch_synthetic_benchmark_elastic.py](https://github.com/horovod/horovod/blob/master/examples/elastic/pytorch_synthetic_benchmark_elastic.py) - ```shell script - arena submit etjob \ - --name=elastic-training-synthetic \ - --gpus=1 \ - --workers=3 \ - --max-workers=9 \ - --min-workers=1 \ - --image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \ - --working-dir=/examples \ - "horovodrun - --verbose - --log-level=DEBUG - -np \$((\${workers}*\${gpus})) - --min-np \$((\${minWorkers}*\${gpus})) - --max-np \$((\${maxWorkers}*\${gpus})) - --start-timeout 100 - --elastic-timeout 1000 - --host-discovery-script /usr/local/bin/discover_hosts.sh - python /examples/elastic/pytorch_synthetic_benchmark_elastic.py - --num-iters=10000 - --num-warmup-batches=0" - ``` - Output: - ``` - configmap/elastic-training-synthetic-etjob created - configmap/elastic-training-synthetic-etjob labeled - trainingjob.kai.alibabacloud.com/elastic-training-synthetic created - INFO[0000] The Job elastic-training-synthetic has been submitted successfully - INFO[0000] You can run `arena get elastic-training-synthetic --type etjob` to check the job status - ``` - -3. List your job. - ```shell script - arena list - ``` - Output: - ``` - NAME STATUS TRAINER AGE NODE - elastic-training-synthetic RUNNING ETJOB 2m 192.168.0.112 - ``` - -4. Get your job details. - ```shell script - arena get elastic-training-synthetic - ``` - Output: - ``` - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 3m - - NAME STATUS TRAINER AGE INSTANCE NODE - elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-launcher 192.168.0.112 - elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-0 192.168.0.116 - elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-1 192.168.0.117 - elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-2 192.168.0.116 - ``` - -5. Check logs - ```shell script - arena logs elastic-training-synthetic --tail 10 - ``` - Output: - ``` - Tue Sep 8 09:24:20 2020[0]:Iter #54: 95.3 img/sec per GPU - Tue Sep 8 09:24:23 2020[0]:Iter #55: 95.3 img/sec per GPU - Tue Sep 8 09:24:27 2020[0]:Iter #56: 94.6 img/sec per GPU - Tue Sep 8 09:24:30 2020[0]:Iter #57: 97.1 img/sec per GPU - Tue Sep 8 09:24:33 2020[0]:Iter #58: 99.7 img/sec per GPU - Tue Sep 8 09:24:36 2020[0]:Iter #59: 99.8 img/sec per GPU - Tue Sep 8 09:24:40 2020[0]:Iter #60: 98.0 img/sec per GPU - Tue Sep 8 09:24:43 2020[0]:Iter #61: 97.1 img/sec per GPU - Tue Sep 8 09:24:46 2020[0]:Iter #62: 96.1 img/sec per GPU - Tue Sep 8 09:24:50 2020[0]:Iter #63: 100.4 img/sec per GPU - ``` - - -6. Scaleout your job. Will add one worker into jobs. - ```shell script - arena scaleout etjob --name="elastic-training-synthetic" --count=1 --timeout=1m - ``` - Output: - ``` - configmap/elastic-training-synthetic-1599557124-scaleout created - configmap/elastic-training-synthetic-1599557124-scaleout labeled - scaleout.kai.alibabacloud.com/elastic-training-synthetic-1599557124 created - INFO[0000] The scaleout job elastic-training-synthetic-1599557124 has been submitted successfully - ``` - -7. Get your job details. We can see new worker(elastic-training-synthetic-worker-3) has been "RUNNING". - ```shell script - arena get elastic-training-synthetic - ``` - Output: - ``` - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 5m - - NAME STATUS TRAINER AGE INSTANCE NODE - elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-launcher 192.168.0.112 - elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-0 192.168.0.116 - elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-1 192.168.0.117 - elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-2 192.168.0.116 - elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-3 192.168.0.112 - ``` - -8. Check logs. - ```shell script - arena logs elastic-training-synthetic --tail 10 - ``` - Output: - ``` - Tue Sep 8 09:26:03 2020[0]:Iter #76: 65.0 img/sec per GPU - Tue Sep 8 09:26:08 2020[0]:Iter #77: 64.0 img/sec per GPU - Tue Sep 8 09:26:13 2020[0]:Iter #78: 65.4 img/sec per GPU - Tue Sep 8 09:26:18 2020[0]:Iter #79: 64.4 img/sec per GPU - Tue Sep 8 09:26:23 2020[0]:Iter #80: 62.9 img/sec per GPU - Tue Sep 8 09:26:28 2020[0]:Iter #81: 64.0 img/sec per GPU - Tue Sep 8 09:26:33 2020[0]:Iter #82: 64.4 img/sec per GPU - Tue Sep 8 09:26:38 2020[0]:Iter #83: 64.9 img/sec per GPU - Tue Sep 8 09:26:43 2020[0]:Iter #84: 62.7 img/sec per GPU - Tue Sep 8 09:26:48 2020[0]:Iter #85: 64.2 img/sec per GPU - ``` - -9. Scalein your job. Will remove one worker from current jobs. - ```shell script - arena scalein etjob --name="elastic-training-synthetic" --count=1 --timeout=1m - ``` - Output: - ``` - configmap/elastic-training-synthetic-1599557271-scalein created - configmap/elastic-training-synthetic-1599557271-scalein labeled - scalein.kai.alibabacloud.com/elastic-training-synthetic-1599557271 created - INFO[0000] The scalein job elastic-training-synthetic-1599557271 has been submitted successfully - ``` - -10. Get your job details. We can see that `elastic-training-synthetic-worker-3` has been removed. - ```shell script - arena get elastic-training-synthetic - ``` - Output: - ``` - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 7m - - NAME STATUS TRAINER AGE INSTANCE NODE - elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-launcher 192.168.0.112 - elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-0 192.168.0.116 - elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-1 192.168.0.117 - elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-2 192.168.0.116 - ``` - -11. Check logs. - ```shell script - arena logs elastic-training-synthetic --tail 10 - ``` - Output: - ``` - DEBUG:root:host elastic-training-synthetic-worker-3 has been blacklisted, ignoring exit from local_rank=0 - Process 3 exit with status code 134. - Tue Sep 8 09:27:56 2020[0]:Iter #97: 96.0 img/sec per GPU - Tue Sep 8 09:28:00 2020[0]:Iter #98: 95.4 img/sec per GPU - Tue Sep 8 09:28:03 2020[0]:Iter #99: 96.9 img/sec per GPU - Tue Sep 8 09:28:06 2020[0]:Iter #100: 97.2 img/sec per GPU - Tue Sep 8 09:28:10 2020[0]:Iter #101: 98.5 img/sec per GPU - Tue Sep 8 09:28:13 2020[0]:Iter #102: 95.8 img/sec per GPU - Tue Sep 8 09:28:16 2020[0]:Iter #103: 97.3 img/sec per GPU - Tue Sep 8 09:28:20 2020[0]:Iter #104: 97.3 img/sec per GPU - Tue Sep 8 09:28:23 2020[0]:Iter #105: 98.9 img/sec per GPU - ``` \ No newline at end of file diff --git a/archived/docs/userguide/3-tensorboard.jpg b/archived/docs/userguide/3-tensorboard.jpg deleted file mode 100644 index 86c2aeb05..000000000 Binary files a/archived/docs/userguide/3-tensorboard.jpg and /dev/null differ diff --git a/archived/docs/userguide/3-tfjob-distributed.md b/archived/docs/userguide/3-tfjob-distributed.md deleted file mode 100644 index b3cca279b..000000000 --- a/archived/docs/userguide/3-tfjob-distributed.md +++ /dev/null @@ -1,72 +0,0 @@ - - -Arena supports and simplifies distributed TensorFlow Training (PS/worker mode). - - -1. To run a distributed Tensorflow Training, you need to specify: - - - GPUs of each worker (only for GPU workload) - - The number of workers (required) - - The number of PS (required) - - The docker image of worker (required) - - The docker image of PS (required) - - The Port of Worker (default is 22222) - - The Port of PS (default is 22223) - -The following command is an example. In this example, it defines 2 workers and 1 PS, and each worker has 1 GPU. The source code of worker and PS are located in git, and the tensorboard are enabled. - -``` -# arena submit tf \ - --name=tf-dist-git \ - --gpus=1 \ - --workers=2 \ - --worker-image=tensorflow/tensorflow:1.5.0-devel-gpu \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - --ps=1 \ - --ps-image=tensorflow/tensorflow:1.5.0-devel \ - --tensorboard \ - "python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir=/training_logs --data_dir=code/tensorflow-sample-code/data" - -configmap/tf-dist-git-tfjob created -configmap/tf-dist-git-tfjob labeled -service/tf-dist-git-tensorboard created -deployment.extensions/tf-dist-git-tensorboard created -tfjob.kubeflow.org/tf-dist-git created -INFO[0001] The Job tf-dist-git has been submitted successfully -INFO[0001] You can run `arena get tf-dist-git --type tfjob` to check the job status -``` - -**Note**: If you saw the job or pod is failed, and then look at the logs, you may find out it is due to the reason that git code is not be able to cloned, especially if you are runing container insider some countries like China. This is not caused by arena, but cross-border network connectivity. - -2\. Get the details of the specific job - -``` -# arena get tf-dist-git -NAME STATUS TRAINER AGE INSTANCE NODE -tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-594d59789c-lrfsk 192.168.1.119 -tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-ps-0 192.168.1.118 -tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-worker-0 192.168.1.119 -tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-worker-1 192.168.1.120 - -Your tensorboard will be available on: -192.168.1.117:32298 -``` - -3\. Check the tensorboard - -![](3-tensorboard.jpg) - - -4\. Get the TFJob dashboard - -``` -# arena logviewer tf-dist-git -Your LogViewer will be available on: -192.168.1.120:8080/tfjobs/ui/#/default/tf-dist-git-tfjob -``` - - -![](4-tfjob-logviewer-distributed.jpg) - -Congratulations! You've run the distributed training job with `arena` successfully. \ No newline at end of file diff --git a/archived/docs/userguide/30-tfjob-with-role-sequence.md b/archived/docs/userguide/30-tfjob-with-role-sequence.md deleted file mode 100644 index c3b9814d3..000000000 --- a/archived/docs/userguide/30-tfjob-with-role-sequence.md +++ /dev/null @@ -1,78 +0,0 @@ -The Distributed Tensorflow job has some roles, includes: Worker,PS,Chief,Evaluator. Sometimes, you may need to decide the sequence when creating them, for example, you may need to create "Worker" role first and then create "PS" role second, This guide will help you. - -1. Now, assume that you want to submit a Distributed Tensorflow job,the tensorflow job has four roles: Worker,PS,Chief,Evaluator and you need the role starting sequence is "Worker,Chief,PS,Evaluator", it is simple for you only add option "--role-sequence" when submitting the job,the following command is an example: - -``` -$ arena submit tfjob \ ---name=tf-distributed-test \ ---role-sequence "Worker,Chief,PS,Evaluator" \ ---chief \ ---evaluator \ ---gpus=1 \ ---workers=1 \ ---worker-image=cheyang/tf-mnist-distributed:gpu \ ---ps-image=cheyang/tf-mnist-distributed:cpu \ ---ps=1 \ ---tensorboard \ ---tensorboard-image="registry.cn-hongkong.aliyuncs.com/ai-samples/tensorflow:1.12.0-devel" \ -"python /app/main.py" -``` - -the "--role-sequence Worker,Chief,PS,Evaluator" is the same as "--role-sequence w,c,p,e" and "w" represents "Worker", "c" represents "Chief", "p" represents "PS" and "e" represents "Evaluator". - -2. Make sure at least one pod belonging to the tfjob "tf-distributed-test" has annotation "job-role-sequence=Worker,Chief,PS,Evaluator": - -``` -$ kubectl get po -l tf-job-name=tf-distributed-test -NAME READY STATUS RESTARTS AGE -tf-distributed-test-chief-0 0/1 ContainerCreating 0 5m47s -tf-distributed-test-evaluator-0 0/1 ContainerCreating 0 5m47s -tf-distributed-test-ps-0 1/1 Running 0 5m47s -tf-distributed-test-worker-0 0/1 ContainerCreating 0 5m47s - -$ kubectl get po tf-distributed-test-worker-0 -o yaml -apiVersion: v1 -kind: Pod -metadata: - annotations: - job-role-sequence: Worker,Chief,PS,Evaluator - kubernetes.io/psp: ack.privileged - requestGPUsOfJobOwner: "3" - creationTimestamp: 2021-02-22T03:07:49Z -.... - -``` - -3. You can validate it by querying the tf-operator logs. - -``` -$ kubectl get po -n arena-system -NAME READY STATUS RESTARTS AGE -et-operator-576887864c-lvmrs 1/1 Running 1 19d -mpi-operator-66b4cf9b76-kl2fm 1/1 Running 0 26d -pytorch-operator-8545c46f98-cffgw 1/1 Running 4 26d -tf-job-dashboard-78478bfc45-msbzn 1/1 Running 0 19d -tf-job-operator-554d594cff-5vxfg 1/1 Running 0 101m -``` - -Query the logs of tf-job-operator-554d594cff-5vxfg. - -``` -$ kubectl logs tf-job-operator-554d594cff-5vxfg -n arena-system | grep "the Role Sequence" | tail -n 1 -{"filename":"tensorflow/controller.go:453","job":"default.tf-distributed-test","level":"info","msg":"the Role Sequence of job tf-distributed-test is: [Worker Chief PS Evaluator]","time":"2021-02-01T13:22:23Z","uid":"7db02629-4591-4e0c-a938-c6e4a1cfc074"} -``` - - -As you see the sequence of tf-operator handles the tfjob roles is match the sequence you specified. - -If you don't want to specify the role sequence every time when submitting the tfjob, you can save the role sequence to the arena configuration file "~/.arena/config", like: - -``` -tfjob_role_sequence = Worker,PS,Chief,Evaluator -``` - -or - -``` -tfjob_role_sequence = w,p,c,e -``` diff --git a/archived/docs/userguide/31-support-multiple-users.md b/archived/docs/userguide/31-support-multiple-users.md deleted file mode 100644 index 0cf0a1c84..000000000 --- a/archived/docs/userguide/31-support-multiple-users.md +++ /dev/null @@ -1,128 +0,0 @@ - -## Support Multiple Users - -In some usage scenarios, you may want multiple users to use arena and these users have different permissions to operate the kubernetes cluster. This guide will tell you how to implement the goal. - -Now, assume that there is 3 users to use arena and their privileges are described as follow table: - - -| User Name | User Namespace | Quota | Additional Privileges | -| --------- | -------------- | ----- |---------------------- | -| alex | workplace1 | - |-| -| bob | workplace2 |limits.cpu: "10",limits.memory: "20Gi",requests.cpu: "5",requests.memory: "10Gi" |list the jobs in the cluster scope| -| tom | workplace3 |requests.nvidia.com/gpu: 20|list the jobs in the namespace scope| - -the following steps describe how to generate the kubeconfig files of the users. - -1.Prepare the user configuration file, you can refer the ~/charts/user/values.yaml or /charts/user/values.yaml to write your own user configuration file. - -The user alex doesn't need to prepare a user configuration file,because it use the default configuration. - -The user bob's user configuration file is defined as: - -``` -quota: - limits.cpu: "10" - requests.cpu: "5" - requests.memory: "10Gi" - limits.memory: "20Gi" - -clusterRoles: - - apiGroups: - - batch - resources: - - jobs - verbs: - - list -``` - -and store it to /tmp/bob-config.yaml - -The user tom's user configuration file is defined as: - -``` -quota: - requests.nvidia.com/gpu: 5 - -roles: - - apiGroups: - - batch - resources: - - jobs - verbs: - - list -``` -and store it to /tmp/tom-config.yaml - - -2.Generate user kubeconfig, the script 'arena-gen-kubeconfig.sh' can help you: - -``` -$ arena-gen-kubeconfig.sh -h - -Usage: - - arena-gen-kubeconfig.sh [OPTION1] [OPTION2] ... - -Options: - --user-name Specify the user name - --user-namespace Specify the user namespace - --user-config Specify the user config,refer the ~/charts/user/values.yaml or /charts/user/values.yaml - --force If the user has been existed,force to update the user - --delete Delete the user - --output Specify the output kubeconfig file or the user manifest yaml - --admin-kubeconfig Specify the Admin kubeconfig file - --cluster-url Specify the Cluster URL,if not specified,the script will detect the cluster url - --create-user-yaml Only generate the user manifest yaml,don't apply it and create kubeconfig file -``` - -Firstly, create the kubeconfig file of alex: - -``` -$ arena-gen-kubeconfig.sh --user-name alex --user-namespace workplace1 --output /tmp/alex.kubeconfig --force - -2021-02-08/11:38:44 DEBUG found arena charts in /Users/yangjunfeng/charts -2021-02-08/11:38:44 DEBUG the user configuration not set,use the default configuration file -resourcequota/arena-quota-alex created -serviceaccount/alex created -clusterrole.rbac.authorization.k8s.io/arena:workplace1:alex configured -clusterrolebinding.rbac.authorization.k8s.io/arena:workplace1:alex configured -role.rbac.authorization.k8s.io/arena:alex created -rolebinding.rbac.authorization.k8s.io/arena:alex created -configmap/arena-user-alex created -Cluster "https://192.168.1.42:6443" set. -User "alex" set. -Context "registry" created. -Switched to context "registry". -2021-02-08/11:38:48 DEBUG kubeconfig written to file /tmp/alex.kubeconfig -``` -As you see the kubeconfig file has been created(/tmp/alex.kubeconfig). - -Secondly, create the kubeconfig file of user bob: - -``` -$ arena-gen-kubeconfig.sh --user-name bob --user-namespace workplace2 --user-config /tmp/bob.yaml --output /tmp/bob.kubeconfig --force -``` -the kubeconfig file will store at /tmp/bob.kubeconfig - -Thirdly, create the kubeconfig file of user tom: - -``` -$ arena-gen-kubeconfig.sh --user-name tom --user-namespace workplace3 --user-config /tmp/tom.yaml --output /tmp/tom.kubeconfig --force -``` -the kubeconfig file will store at /tmp/tom.kubeconfig - -3.Make the kubeconfig file is valid, you can set the env KUBECONFIG like: - -``` -$ export KUBECONFIG=/tmp/alex.kubeconfig - -``` - -4.Now you can use arena to submit your training jobs. - -5.If you want to delete the user,execute the command like: - -``` -$ arena-gen-kubeconfig.sh --user-name tom --user-namespace workplace3 --delete -``` diff --git a/archived/docs/userguide/4-tfjob-distributed-data.md b/archived/docs/userguide/4-tfjob-distributed-data.md deleted file mode 100644 index 8089c11ec..000000000 --- a/archived/docs/userguide/4-tfjob-distributed-data.md +++ /dev/null @@ -1,110 +0,0 @@ - -`arena` allows to mount multiple data volumes into the training jobs. There is an example that mounts `data volume` into the training job. - - -1. You need to create `/data` in the NFS Server, and prepare `mnist data` - -``` -# mkdir -p /nfs -# mount -t nfs -o vers=4.0 NFS_SERVER_IP:/ /nfs -# mkdir -p /data -# cd /data -# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-images-idx3-ubyte.gz -# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-labels-idx1-ubyte.gz -# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-images-idx3-ubyte.gz -# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-labels-idx1-ubyte.gz -# cd / -# umount /nfs -``` - -2\. Create Persistent Volume. Moidfy `NFS_SERVER_IP` to yours. - -``` -# cat nfs-pv.yaml -apiVersion: v1 -kind: PersistentVolume -metadata: - name: tfdata - labels: - tfdata: nas-mnist -spec: - persistentVolumeReclaimPolicy: Retain - capacity: - storage: 10Gi - accessModes: - - ReadWriteMany - nfs: - server: NFS_SERVER_IP - path: "/data" - - # kubectl create -f nfs-pv.yaml -``` - -3\. Create Persistent Volume Claim. - -``` -# cat nfs-pvc.yaml -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: tfdata - annotations: - description: "this is the mnist demo" - owner: Tom -spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 5Gi - selector: - matchLabels: - tfdata: nas-mnist -# kubectl create -f nfs-pvc.yaml -``` - -> Notice: suggest to add `description` and `owner` - -4\. Check the data volume - -``` -# arena data list -NAME ACCESSMODE DESCRIPTION OWNER AGE -tfdata ReadWriteMany this is for mnist demo myteam 43d -``` - -5\. Now we can submit a distributed training job with `arena`, it will download the source code from github and mount data volume `tfdata` to `/mnist_data`. - -``` -# arena submit tf --name=tf-dist-data \ - --gpus=1 \ - --workers=2 \ - --workerImage=tensorflow/tensorflow:1.5.0-devel-gpu \ - --syncMode=git \ - --syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - --ps=1 \ - --psImage=tensorflow/tensorflow:1.5.0-devel \ - --tensorboard \ - --data=tfdata:/mnist_data \ - "python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir /training_logs --data_dir /mnist_data" -``` - -> `--data` specifies the data volume to mount to all the tasks of the job, like :. In this example, the data volume is `tfdata`, and the target directory is `/mnist_data`. - - -6\. From the logs, we find that the training data is extracted from `/mnist_data` instead of downloading from internet directly. - -``` -# arena logs tf-dist-data -... -Extracting /mnist_data/train-images-idx3-ubyte.gz -Extracting /mnist_data/train-labels-idx1-ubyte.gz -Extracting /mnist_data/t10k-images-idx3-ubyte.gz -Extracting /mnist_data/t10k-labels-idx1-ubyte.gz -... -Accuracy at step 960: 0.9753 -Accuracy at step 970: 0.9739 -Accuracy at step 980: 0.9756 -Accuracy at step 990: 0.9777 -Adding run metadata for 999 -``` \ No newline at end of file diff --git a/archived/docs/userguide/4-tfjob-logviewer-distributed.jpg b/archived/docs/userguide/4-tfjob-logviewer-distributed.jpg deleted file mode 100644 index 59f520062..000000000 Binary files a/archived/docs/userguide/4-tfjob-logviewer-distributed.jpg and /dev/null differ diff --git a/archived/docs/userguide/5-mpi-logviewer.jpg b/archived/docs/userguide/5-mpi-logviewer.jpg deleted file mode 100644 index 93344d23d..000000000 Binary files a/archived/docs/userguide/5-mpi-logviewer.jpg and /dev/null differ diff --git a/archived/docs/userguide/5-mpi-tensorboard.jpg b/archived/docs/userguide/5-mpi-tensorboard.jpg deleted file mode 100644 index 9797367e4..000000000 Binary files a/archived/docs/userguide/5-mpi-tensorboard.jpg and /dev/null differ diff --git a/archived/docs/userguide/5-mpijob-distributed.md b/archived/docs/userguide/5-mpijob-distributed.md deleted file mode 100644 index 3e34936d2..000000000 --- a/archived/docs/userguide/5-mpijob-distributed.md +++ /dev/null @@ -1,56 +0,0 @@ - -Arena supports and simplifies distributed TensorFlow Training (MPI mode). - - -1. To run a distributed Training with MPI support, you need to specify: - - - GPUs of each worker (only for GPU workload) - - The number of workers (required) - - The docker image of MPI worker (required) - - -The following command is an example. In this example, it defines 2 workers, and each worker has 1 GPU. The tensorboard are enabled. - -``` -# arena submit mpi - --name=mpi-dist \ - --gpus=1 \ - --workers=2 \ - --image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - --env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \ - --sync-mode=git \ - --sync-source=https://github.com/tensorflow/benchmarks.git \ - --tensorboard \ - "mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10" -``` - -2\. Get the details of the specific job - -``` -# arena get mpi-dist -NAME STATUS TRAINER AGE INSTANCE NODE -mpi-dist RUNNING MPIJOB 1d mpi-dist-mpijob-launcher-ndnw8 192.168.1.120 -mpi-dist RUNNING MPIJOB 1d mpi-dist-mpijob-worker-0 192.168.1.119 -mpi-dist RUNNING MPIJOB 1d mpi-dist-mpijob-worker-1 192.168.1.120 - -Your tensorboard will be available on: -192.168.1.117:32559 -``` - -3\. Check the tensorboard - -![](5-mpi-tensorboard.jpg) - - -4\. Get the MPI dashboard - -``` -# arena logviewer mpi-dist -Your LogViewer will be available on: -192.168.1.119:9090/#!/log/default/mpi-dist-mpijob-launcher-ndnw8/mpi?namespace=default -``` - - -![](5-mpijob-logviewer.jpg) - -Congratulations! You've run the distributed MPI training job with `arena` successfully. \ No newline at end of file diff --git a/archived/docs/userguide/6-tfjob-gangschd.md b/archived/docs/userguide/6-tfjob-gangschd.md deleted file mode 100644 index c404e0d46..000000000 --- a/archived/docs/userguide/6-tfjob-gangschd.md +++ /dev/null @@ -1,67 +0,0 @@ - -Arena supports distributed TensorFlow Training with gang scheduling by using [kube-arbitrator](https://github.com/kubernetes-incubator/kube-arbitrator). - -When running distributed TensorFlow, we'd better to make sure `all` or `nothing`. Gang scheduling can help such case. - - -> Notice: the current [kubernetes gang scheduler](https://github.com/kubernetes-incubator/kube-arbitrator/tree/release-0.1) is not production ready. For example, it doesn't support Pod Affinity and PodFitsHostPorts for sheduling. - -> Limitation: when using gang scheduler, the tensorboard feature doesn't work well. - -1. To enable gang scheduler, edit `/charts/tfjob/values.yaml` - -Change `enableGangScheduler: false` to `enableGangScheduler: true` - -2. To run a distributed Tensorflow Training, you need to specify: - - - GPUs of each worker (only for GPU workload) - - The number of workers (required) - - The number of PS (required) - - The docker image of worker (required) - - The docker image of PS (required) - - The Port of Worker (default is 22222) - - The Port of PS (default is 22223) - -The following command is an example. In this example, it defines 2 workers and 1 PS, and each worker has 1 GPU. The source code of worker and PS are located in git, and the tensorboard are enabled. - -``` -# arena submit tf --name=tf-dist-git \ - --gpus=1 \ - --workers=2 \ - --workerImage=tensorflow/tensorflow:1.5.0-devel-gpu \ - --syncMode=git \ - --syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - --ps=1 \ - --psImage=tensorflow/tensorflow:1.5.0-devel \ - "python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir /training_logs" -configmap/tf-dist-git-tfjob created -configmap/tf-dist-git-tfjob labeled -service/tf-dist-git-tensorboard created -deployment.extensions/tf-dist-git-tensorboard created -tfjob.kubeflow.org/tf-dist-git created -INFO[0001] The Job tf-dist-git has been submitted successfully -INFO[0001] You can run `arena get tf-dist-git --type tfjob` to check the job status - -``` - -If there are no enough resources, all the instances of the job are `PENDING`. If it's not gang scheduler, you can see some of the pods are `RUNNING` and others are `PENDING`. - -``` -# arena get tf-dist-data -NAME STATUS TRAINER AGE INSTANCE NODE -tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-ps-0 N/A -tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-0 N/A -tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-1 N/A -tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-2 N/A -tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-3 N/A -``` - -When there are enough resources, the the instances become `RUNNING` - -``` -NAME STATUS TRAINER AGE INSTANCE NODE -tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-ps-0 192.168.1.115 -tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-worker-0 192.168.1.119 -tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-worker-1 192.168.1.118 -tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-worker-2 192.168.1.120 -``` \ No newline at end of file diff --git a/archived/docs/userguide/7-tf-serving-gpu.md b/archived/docs/userguide/7-tf-serving-gpu.md deleted file mode 100644 index e2d111d1d..000000000 --- a/archived/docs/userguide/7-tf-serving-gpu.md +++ /dev/null @@ -1,238 +0,0 @@ -This guide walks through the steps required to deploy and serve a TensorFlow model with GPU using Kubernetes (K8s) and Arena. - -1\. Setup - -Before using `Arena` for TensorFlow serving with GPU, we need to setup the environment including Kubernetes cluster. - -Make sure that your Kubernetes cluster is running and follow the Kubernetes [instructions for enabling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/). - - -2\. Create Persistent Volume for Model Files - -Create /tfmodel in the NFS Server, and prepare mnist models by following the command: - -``` -mount -t nfs -o vers=4.0 NFS_SERVER_IP:/ /tfmodel/ -wget https://github.com/osswangxining/tensorflow-sample-code/raw/master/models/tensorflow/mnist.tar.gz -tar xvf mnist.tar.gz -``` - -Then create Persistent Volume and Persistent Volume Claim by following the command (using NFS as sample): - -Persistent Volume: -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - name: tfmodel - labels: - tfmodel: nas-mnist -spec: - persistentVolumeReclaimPolicy: Retain - capacity: - storage: 10Gi - accessModes: - - ReadWriteMany - nfs: - server: NFS_SERVER_IP - path: "/tfmodel" -``` - -Persistent Volume Claim: - -``` -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: tfmodel - annotations: - description: "this is tfmodel for mnist" - owner: tester -spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 5Gi - selector: - matchLabels: - tfmodel: nas-mnist -``` - -Check the data volume: -``` -arena data list -NAME ACCESSMODE DESCRIPTION OWNER AGE -tfmodel ReadWriteMany this is tfmodel for mnist tester 31s -``` - - -3\. Tensorflow serving with GPU - -You can deploy and serve a Tensorflow model with GPU.If you want to serve a Tensorflow model with GPUMemory,please look at [GPUShare_UserGuide](7-tf-serving-gpushare.md) - -Submit tensorflow serving job to deploy and serve machine learning models using the following command. - -``` -arena serve tensorflow [flags] - -options: - --command string the command will inject to container's command. - --cpu string the request cpu of each replica to run the serve. - -d, --data stringArray specify the trained models datasource to mount for serving, like : - --enableIstio enable Istio for serving or not (disable Istio by default) - -e, --envs stringArray the environment variables - --exposeService expose service using Istio gateway for external access or not (not expose by default) - --gpumemory int the limit GPU memory of each replica to run the serve. - --gpus int the limit GPU count of each replica to run the serve. - -h, --help help for tensorflow - --image string the docker image name of serve job, default image is tensorflow/serving:latest (default "tensorflow/serving:latest") - --imagePullPolicy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent") - --memory string the request memory of each replica to run the serve. - --modelConfigFile string Corresponding with --model_config_file in tensorflow serving - --modelName string the model name for serving - --modelPath string the model path for serving in the container - --port int the port of tensorflow gRPC listening port (default 8500) - --replicas int the replicas number of the serve job. (default 1) - --restfulPort int the port of tensorflow RESTful listening port (default 8501) - --servingName string the serving name - --servingVersion string the serving version - --versionPolicy string support latest, latest:N, specific:N, all - -Options inherited from parent commands - --arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - --namespace string the namespace of the job (default "default") - --pprof enable cpu profile -``` - -3.1\. View the GPU resource of your cluster - -Before you submit the serving task,make sure you have GPU in your cluster and you have deployed [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes) -Using arena top node to see the GPU resource of your cluster. -``` -# arena top node -NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) -cn-shanghai.i-uf61h64dz1tmlob9hmtb 192.168.0.71 ready 1 0 -cn-shanghai.i-uf61h64dz1tmlob9hmtc 192.168.0.70 ready 1 0 -cn-shanghai.i-uf6347ba9krw8hj5yvsy 192.168.0.67 master ready 0 0 -cn-shanghai.i-uf662a07bhojl329pity 192.168.0.68 master ready 0 0 -cn-shanghai.i-uf69zddmom136duk79qu 192.168.0.69 master ready 0 0 -------------------------------------------------------------------------------------------- -Allocated/Total GPUs In Cluster: -0/2 (0%) -``` -If your cluster have enough GPU resource,you can submit a serving task. - -3.2\. Submit tensorflow serving task -you can submit a Tensorflow-GPU model with specific version policy as below. - -``` -arena serve tensorflow --name=mymnist1 --model-name=mnist1 --gpus=1 --image=tensorflow/serving:latest-gpu --data=tfmodel:/tfmodel --model-path=/tfmodel/mnist --versionPolicy=specific:1 - -``` - -Once this command is triggered, one Kubernetes service will be created to expose gRPC and RESTful APIs of mnist model.The task will assume the same gpus as it request. -After the command,using arena top node to see the gpu resource of the cluster. -``` -# arena top node -NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) -cn-shanghai.i-uf61h64dz1tmlob9hmtb 192.168.0.71 ready 1 0 -cn-shanghai.i-uf61h64dz1tmlob9hmtc 192.168.0.70 ready 1 1 -cn-shanghai.i-uf6347ba9krw8hj5yvsy 192.168.0.67 master ready 0 0 -cn-shanghai.i-uf662a07bhojl329pity 192.168.0.68 master ready 0 0 -cn-shanghai.i-uf69zddmom136duk79qu 192.168.0.69 master ready 0 0 -------------------------------------------------------------------------------------------- -Allocated/Total GPUs In Cluster: -1/2 (50%) - -``` -If you want to see the details of pod ,you can use arena top node -d. -``` -#arena top node -d -NAME: cn-shanghai.i-uf61h64dz1tmlob9hmtc -IPADDRESS: 192.168.0.70 -ROLE: - -NAMESPACE NAME GPU REQUESTS -default mymnist1-tensorflow-serving-76d5c7c8fc-2kwpw 1 - -Total GPUs In Node cn-shanghai.i-uf61h64dz1tmlob9hmtc: 1 -Allocated GPUs In Node cn-shanghai.i-uf61h64dz1tmlob9hmtc: 1 (100%) -``` - -4\. List all the serving jobs - -You can use the following command to list all the serving jobs. - -``` -# arena serve list - NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS - mymnist1 TENSORFLOW 1 1 172.19.10.38 serving:8500,http-serving:8501 -``` - - -5\. Test RESTful APIs of serving models - -Deploy the `sleep` pod so you can use `curl` to test above serving models via RESTful APIs. - -``` -# cat <: - --enableIstio enable Istio for serving or not (disable Istio by default) - -e, --envs stringArray the environment variables - --exposeService expose service using Istio gateway for external access or not (not expose by default) - --gpumemory int the limit GPU memory of each replica to run the serve. - --gpus int the limit GPU count of each replica to run the serve. - -h, --help help for tensorflow - --image string the docker image name of serve job, default image is tensorflow/serving:latest (default "tensorflow/serving:latest") - --imagePullPolicy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent") - --memory string the request memory of each replica to run the serve. - --modelConfigFile string Corresponding with --model_config_file in tensorflow serving - --modelName string the model name for serving - --modelPath string the model path for serving in the container - --port int the port of tensorflow gRPC listening port (default 8500) - --replicas int the replicas number of the serve job. (default 1) - --restfulPort int the port of tensorflow RESTful listening port (default 8501) - --servingName string the serving name - --servingVersion string the serving version - --versionPolicy string support latest, latest:N, specific:N, all - -Options inherited from parent commands - --arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - --namespace string the namespace of the job (default "default") - --pprof enable cpu profile -``` - - -Before you submit the serving task,make sure you have enough GPUMemory in your GPUShare nodes. -Using arena top node -s to see the GPUMemory resource of your GPUShare nodes. - -``` -# arena top node -s -NAME IPADDRESS GPU0(Allocated/Total)(GiB) -cn-zhangjiakou.10.0.0.86 10.0.0.86 0/15 -cn-zhangjiakou.10.0.0.87 10.0.0.87 0/15 -cn-zhangjiakou.10.0.0.88 10.0.0.88 0/15 -------------------------------------------------------------- -Allocated/Total GPU Memory In GPUShare Node: -0/45 (GiB) (0%) -``` - -If your cluster have enough gpu memory resource ,you can submit a task as below. -``` -# arena serve tensorflow --name=mymnist2 --model-name=mnist2 --gpumemory=3 --image=tensorflow/serving:latest-gpu --data=tfmodel:/tfmodel --model-path=/tfmodel/mnist --versionPolicy=specific:2 -configmap/mymnist2-202007161051-tf-serving created -configmap/mymnist2-202007161051-tf-serving labeled -configmap/mymnist2-202007161051-tensorflow-serving-cm created -service/mymnist2-202007161051-tensorflow-serving created -deployment.apps/mymnist2-202007161051-tensorflow-serving created - ``` -Once this command is triggered, one Kubernetes service will be created to expose gRPC and RESTful APIs of mnist model.The task will assume the same gpu memory as it request. - -``` -# arena top node -s -NAME IPADDRESS GPU0(Allocated/Total)(GiB) -cn-zhangjiakou.10.0.0.86 10.0.0.86 3/15 -cn-zhangjiakou.10.0.0.87 10.0.0.87 0/15 -cn-zhangjiakou.10.0.0.88 10.0.0.88 0/15 -------------------------------------------------------------- -Allocated/Total GPU Memory In GPUShare Node: -3/45 (GiB) (6%) -``` -If you want to see the details of pod ,you can use arena top node -s -d. -``` -# arena top node -s -d - -NAME: cn-zhangjiakou.10.0.0.88 -IPADDRESS: 10.0.0.88 - -NAME NAMESPACE GPU0(Allocated) -Allocated(GiB) : 0 (0%) -Total(GiB) : 15 ----------- - -NAME: cn-zhangjiakou.10.0.0.86 -IPADDRESS: 10.0.0.86 - -NAME NAMESPACE GPU0(Allocated) -mymnist2-202007161051-tensorflow-serving-86446d46d8-dbsfw default 3 -Allocated(GiB) : 3 (20%) -Total(GiB) : 15 ----------- - -NAME: cn-zhangjiakou.10.0.0.87 -IPADDRESS: 10.0.0.87 - -NAME NAMESPACE GPU0(Allocated) -Allocated(GiB) : 0 (0%) -Total(GiB) : 15 ----------- - - -Allocated/Total GPU Memory In GPUShare Node: -3/45 (GiB) (6%) -``` -3\. List all the serving jobs -``` -# arena serve list -NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS -mymnist2 TENSORFLOW 202007161051 1 0 192.168.75.28 serving:8500,http-serving:8501 -``` - -4\. Test RESTful APIs of serving models - -Deploy the `sleep` pod so you can use `curl` to test above serving models via RESTful APIs. - -``` -# cat <: - --enable-istio enable Istio for serving or not (disable Istio by default) - -e, --envs stringArray the environment variables - --expose-service expose service using Istio gateway for external access or not (not expose by default) - --gpumemory int the limit GPU memory of each replica to run the serve. - --gpus int the limit GPU count of each replica to run the serve. - -h, --help help for tensorflow - --image string the docker image name of serve job, and the default image is tensorflow/serving:latest (default "tensorflow/serving:latest") - --image-pull-policy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent") - --memory string the request memory of each replica to run the serve. - --model-name string the model name for serving - --model-path string the model path for serving in the container - --modelConfigFile string Corresponding with --model_config_file in tensorflow serving - --name string the serving name - --port int the port of tensorflow gRPC listening port (default 8500) - --replicas int the replicas number of the serve job. (default 1) - --restfulPort int the port of tensorflow RESTful listening port (default 8501) - --version string the serving version - --versionPolicy string support latest, latest:N, specific:N, all - -Options inherited from parent commands: - --arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system") - --config string Path to a kube config. Only required if out-of-cluster - --loglevel string Set the logging level. One of: debug|info|warn|error (default "info") - -n, --namespace string the namespace of the job (default "default") - --pprof enable cpu profile - --trace enable trace -``` - -For example, you can submit a Tensorflow model with specific version policy as below. - -``` -arena serve tensorflow \ - --name=mymnist \ - --model-name=mnist \ - --image=tensorflow/serving:latest \ - --data=tfmodel:/tfmodel \ - --model-path=/tfmodel/mnist \ - --versionPolicy=specific:1 \ - --loglevel=debug -``` - -Once this command is triggered, one Kubernetes service will be created to expose gRPC and RESTful APIs of mnist model. - - -4\. Tensorflow serving with Istio enabled (optional) - -If you need to enable Istio for Tensorflow serving, you can append the parameter `--enableIstio` into the command above (disable Istio by default). - -For example, you can submit a Tensorflow model with Istio enabled as below. - -``` -$ arena serve tensorflow \ - --enableIstio \ - --name=mymnist \ - --servingVersion=v1 \ - --model-name=mnist \ - --data=myoss1pvc:/data2 \ - --model-path=/data2/models/mnist \ - --versionPolicy=specific:1 \ - -NAME: mymnist-v1 -LAST DEPLOYED: Wed Sep 26 17:28:13 2018 -NAMESPACE: default -STATUS: DEPLOYED - -RESOURCES: -==> v1/ConfigMap -NAME DATA AGE -mymnist-v1-tensorflow-serving-cm 1 1s - -==> v1/Service -NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE -mymnist ClusterIP 172.19.12.176 8500/TCP,8501/TCP 1s - -==> v1beta1/Deployment -NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE -mymnist-v1-tensorflow-serving 1 1 1 0 1s - -==> v1alpha3/DestinationRule -NAME AGE -mymnist 1s - -==> v1alpha3/VirtualService -mymnist 1s - -==> v1/Pod(related) -NAME READY STATUS RESTARTS AGE -mymnist-v1-tensorflow-serving-757b669bbb-5vsmf 0/2 Init:0/1 0 1s - - -NOTES: -Getting Started: - -**** NOTE: It may take a few minutes for the LoadBalancer IP to be available. **** -**** You can watch the status of by running 'kubectl get svc --namespace default -w mymnist-v1-tensorflow-serving' **** - export TF_SERVING_SERVICE_IP=$(kubectl get svc --namespace default mymnist-v1-tensorflow-serving -o jsonpath='{.status.loadBalancer.ingress[0].ip}') - echo docker run -it --rm cheyang/tf-mnist:grpcio_upgraded /serving/bazel-bin/tensorflow_serving/example/mnist_client --num_tests=1000 --server=$TF_SERVING_SERVICE_IP:9090` - -``` - -5\. List all the serving jobs - -You can use the following command to list all the serving jobs. - -``` -# arena serve list - NAME VERSION STATUS - mymnist-v1 v1 DEPLOYED -``` - -6\. Adjust traffic routing dynamically for tfserving jobs - -You can leverage Istio to control traffic routing to multiple versions of your serving models. - -Supposing you've performed step 4, and had v1 model serving deployed already. Now deploy one new version of Tensorflow model with Istio enabled: -``` -arena serve tensorflow \ - --enableIstio \ - --name=mymnist \ - --servingVersion=v2 \ - --modelName=mnist \ - --data=myoss1pvc:/data2 \ - --model-path=/data2/models/mnist -``` - -Then you can adjust traffic routing dynamically with relative weights for both two versions of tfserving jobs. - -``` -arena serve traffic-router-split \ - --name=mymnist \ - --servingVersions=v1,v2 \ - --weights=50,50 -``` - -7\. Test RESTful APIs of serving models - -Deploy the `sleep` pod so you can use `curl` to test above serving models via RESTful APIs. - -If you disable Istio, run the following: -``` -# cat < Notice: suggest to add `description` and `owner` - -3\. Check the data volume - -``` -# arena data list -NAME ACCESSMODE DESCRIPTION OWNER AGE -tfdata ReadWriteMany this is for mnist demo myteam 43d -``` - -4\. To run a distributed Tensorflow Training, you need to specify: - - - GPUs of each worker (Include chief and evaluator) - - Enable chief (required) - - Enable Evaluator (optional) - - The number of workers (required) - - The number of PS (required) - - The docker image of worker and master (required) - - The docker image of PS (required) - - The Port of Chief (default is 22221) - - The Port of Worker (default is 22222) - - The Port of PS (default is 22223) - -The following command is an example. In this example, it defines 1 chief worker, 1 workers, 1 PS and 1 evaluator, and each worker has 1 GPU. The source code of worker and PS are located in git, and the tensorboard are enabled. - -``` -# arena submit tf --name=tf-estimator \ - --gpus=1 \ - --workers=1 \ - --chief \ - --evaluator \ - --data=tfdata:/data/mnist \ - --logdir=/data/mnist/models \ - --workerImage=tensorflow/tensorflow:1.9.0-devel-gpu \ - --syncMode=git \ - --syncSource=https://github.com/cheyang/models.git \ - --ps=1 \ - --psImage=tensorflow/tensorflow:1.9.0-devel \ - --tensorboard \ - "bash code/models/dist_mnist_estimator.sh --data_dir=/data/mnist/MNIST_data --model_dir=/data/mnist/models" -configmap/tf-estimator-tfjob created -configmap/tf-estimator-tfjob labeled -service/tf-estimator-tensorboard created -deployment.extensions/tf-estimator-tensorboard created -tfjob.kubeflow.org/tf-estimator created -INFO[0001] The Job tf-estimator has been submitted successfully -INFO[0001] You can run `arena get tf-estimator --type tfjob` to check the job status - -``` - -> `--data` specifies the data volume to mount to all the tasks of the job, like :. In this example, the data volume is `tfdata`, and the target directory is `/data/mnist`. - - -5\. From the logs, we have found the training is started - -``` -# arena logs tf-estimator -2018-09-27T00:37:01.576672145Z 2018-09-27 00:37:01.576562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:chief/replica:0/task:0/device:GPU:0 with 15123 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0) -2018-09-27T00:37:01.578669608Z 2018-09-27 00:37:01.578523: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:22222} -2018-09-27T00:37:01.578685739Z 2018-09-27 00:37:01.578550: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> tf-estimator-tfjob-ps-0:22223} -2018-09-27T00:37:01.578705274Z 2018-09-27 00:37:01.578562: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tf-estimator-tfjob-worker-0:22222} -2018-09-27T00:37:01.579637826Z 2018-09-27 00:37:01.579454: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:334] Started server with target: grpc://localhost:22222 -2018-09-27T00:37:01.701520696Z I0927 00:37:01.701258 140281586534144 tf_logging.py:115] Calling model_fn. -2018-09-27T00:37:02.172552485Z I0927 00:37:02.172167 140281586534144 tf_logging.py:115] Done calling model_fn. -2018-09-27T00:37:02.173930978Z I0927 00:37:02.173732 140281586534144 tf_logging.py:115] Create CheckpointSaverHook. -2018-09-27T00:37:02.431259294Z I0927 00:37:02.430984 140281586534144 tf_logging.py:115] Graph was finalized. -2018-09-27T00:37:02.4472109Z 2018-09-27 00:37:02.447018: I tensorflow/core/distributed_runtime/master_session.cc:1150] Start master session b0a6d2587e64ebef with config: allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } -... -2018-09-27T00:37:33.250440133Z I0927 00:37:33.250036 140281586534144 tf_logging.py:115] global_step/sec: 21.8175 -2018-09-27T00:37:33.253100942Z I0927 00:37:33.252873 140281586534144 tf_logging.py:115] loss = 0.09276967, step = 500 (4.583 sec) -2018-09-27T00:37:37.764446795Z I0927 00:37:37.764101 140281586534144 tf_logging.py:115] Saving checkpoints for 600 into /data/mnist/models/model.ckpt. -2018-09-27T00:37:38.064104604Z I0927 00:37:38.063472 140281586534144 tf_logging.py:115] Loss for final step: 0.24215397. -``` - -6\. Check the training status and tensorboard - -``` -# arena get tf-estimator -NAME STATUS TRAINER AGE INSTANCE NODE -tf-estimator SUCCEEDED TFJOB 5h tf-estimator-tfjob-chief-0 N/A -tf-estimator RUNNING TFJOB 5h tf-estimator-tfjob-evaluator-0 192.168.1.120 -tf-estimator RUNNING TFJOB 5h tf-estimator-tfjob-ps-0 192.168.1.119 -tf-estimator RUNNING TFJOB 5h tf-estimator-tfjob-worker-0 192.168.1.118 - -Your tensorboard will be available on: -192.168.1.117:31366 -``` - -7\. Check the tensorboard from 192.168.1.117:31366 in this sample - -![](8-tfjob-estimator-tensorboard.jpg) diff --git a/archived/docs/userguide/9-top-job-gpu-metric.md b/archived/docs/userguide/9-top-job-gpu-metric.md deleted file mode 100644 index a10bfa168..000000000 --- a/archived/docs/userguide/9-top-job-gpu-metric.md +++ /dev/null @@ -1,65 +0,0 @@ -The command `arena top job ` can display GPU monitoring metrics. Before using it, you must deploy a Prometheus and nodeExporter for GPU Metrics. - -1\. Deploy a Prometheus - -``` -kubectl apply -f kubernetes-artifacts/prometheus/prometheus.yaml -``` - -2\. Deploy GPU node exporter - -* If your cluster is ACK (Alibaba Cloud Kubernetes) cluster, you can just exec command: - -``` -# change gpu export nodeSelector to aliyun label -sed -i 's|accelerator/nvidia_gpu|aliyun.accelerator/nvidia_count|g' kubernetes-artifacts/prometheus/gpu-expoter.yaml -``` - -* If your cluster is not ACK cluster, you need to label your GPU node: - -``` -# label all your GPU nodes -kubectl label node accelerator/nvidia_gpu=true -``` - -* Deploy gpu exporter - -``` -kubectl apply -f kubernetes-artifacts/prometheus/gpu-exporter.yaml -``` - -> Notice: the prometheus and gpu-exporter components should be deployed in namespace `kube-system`, and so that `arena top job ` can work. - -3\. You can check the GPU metrics by prometheus SQL request - -``` -# kubectl get --raw '/api/v1/namespaces/arena-system/services/prometheus-svc:prometheus/proxy/api/v1/query?query=nvidia_gpu_num_devices' - -{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"nvidia_gpu_num_devices","app":"node-gpu-exporter","instance":"172.16.1.144:9445","job":"kubernetes-service-endpoints","k8s_app":"node-gpu-exporter","kubernetes_name":"node-gpu-exporter","node_name":"mynode"},"value":[1543202894.919,"2"]}]}} - -``` - -4\. Submit a traing job by arena - -``` -arena submit tf --name=style-transfer \ - --gpus=2 \ - --workers=2 \ - --workerImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/neural-style:gpu \ - --workingDir=/neural-style \ - --ps=1 \ - --psImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/style-transfer:ps \ - "python neural_style.py --styles /neural-style/examples/1-style.jpg --iterations 1000000" -``` - -5\. Check GPU metrics for the job you deployed - -``` -# arena top job style-transfer -INSTANCE NAME STATUS NODE GPU(Device Index) GPU(Duty Cycle) GPU(Memory MiB) -style-transfer-tfjob-ps-0 Running 192.168.0.95 N/A N/A N/A -style-transfer-tfjob-worker-0 Running 192.168.0.98 0 98% 15641MiB / 16276MiB - 1 0% 15481MiB / 16276MiB -style-transfer-tfjob-worker-1 Running 192.168.0.99 0 98% 15641MiB / 16276MiB - 1 0% 15481MiB / 16276MiB -``` diff --git a/archived/docs/userguide_cn/1-tfjob-logviewer.jpg b/archived/docs/userguide_cn/1-tfjob-logviewer.jpg deleted file mode 100755 index 22cd1222a..000000000 Binary files a/archived/docs/userguide_cn/1-tfjob-logviewer.jpg and /dev/null differ diff --git a/archived/docs/userguide_cn/1-tfjob-standalone.md b/archived/docs/userguide_cn/1-tfjob-standalone.md deleted file mode 100755 index dc6e74393..000000000 --- a/archived/docs/userguide_cn/1-tfjob-standalone.md +++ /dev/null @@ -1,139 +0,0 @@ - -这个示例展示了如何使用 `Arena` 进行机器学习模型训练。该示例将从 git url 下载源代码。 - -1. 第一步是检查可用的GPU资源 - -``` -arena top node -NAME IPADDRESS ROLE GPU(Total) GPU(Allocated) -i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0 -i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0 -i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0 -i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0 -i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0 -i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0 ------------------------------------------------------------------------------------------ -Allocated/Total GPUs In Cluster: -0/3 (0%) -``` - -有 3 个包含 GPU 的可用节点用于运行训练作业。 - - -2\.现在,我们可以通过 `arena` 提交一个训练作业,本示例从 github 下载源代码 - -``` -#arena submit tf \ - --name=tf-git \ - --gpus=1 \ - --image=tensorflow/tensorflow:1.5.0-devel-gpu \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 10000 --data_dir=code/tensorflow-sample-code/data" -configmap/tf-git-tfjob created -configmap/tf-git-tfjob labeled -tfjob.kubeflow.org/tf-git created -INFO[0000] The Job tf-git has been submitted successfully -INFO[0000] You can run `arena get tf-git --type tfjob` to check the job status -``` - -> 这会下载源代码,并将其解压缩到工作目录的 `code/` 目录。默认的工作目录是 `/root`,您也可以使用 `--workingDir` 加以指定。同时你也可以通过在提交的命令中通过增加 `--env GIT_SYNC_BRANCH=main` 的方式来声明想要拉取的分支。`注意Github现在新建的repo都会以main作为主分支而不是Mater。` - -> 如果您正在使用非公开 git 代码库,则可以使用以下命令: - -``` -#arena submit tf \ - --name=tf-git \ - --gpus=1 \ - --image=tensorflow/tensorflow:1.5.0-devel-gpu \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - --env=GIT_SYNC_USERNAME=yourname \ - --env=GIT_SYNC_PASSWORD=yourpwd \ - "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py" -``` - -注意:`arena` 使用 [git-sync](https://github.com/kubernetes/git-sync/blob/master/cmd/git-sync/main.go) 来同步源代码。您可以设置在 git-sync 项目中定义的环境变量。 - -3\.列出所有作业 - -``` -#arena list -NAME STATUS TRAINER AGE NODE -tf-git RUNNING tfjob 0s 192.168.1.120 -``` - -4\.检查作业所使用的GPU资源 - -``` -#arena top job -NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated) -tf-git RUNNING TFJOB 17s 192.168.1.120 1 1 - - -Total Allocated GPUs of Training Job: -1 - -Total Requested GPUs of Training Job: -1 -``` - -5\.检查集群所使用的GPU资源 - -``` -#arena top node -NAME IPADDRESS ROLE GPU(Total) GPU(Allocated) -i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0 -i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0 -i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 1 -i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0 -i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0 -i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0 ------------------------------------------------------------------------------------------ -Allocated/Total GPUs In Cluster: -1/3 (33%) -``` - - -6\.获取特定作业的详细信息 - -``` -#arena get tf-git -NAME STATUS TRAINER AGE INSTANCE NODE -tf-git RUNNING TFJOB 5s tf-git-tfjob-worker-0 192.168.1.120 -``` - -7\.检查日志 - -``` -#arena logs tf-git -2018-07-22T23:56:20.841129509Z WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:119: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. -2018-07-22T23:56:20.841211064Z Instructions for updating: -2018-07-22T23:56:20.841217002Z -2018-07-22T23:56:20.841221287Z Future major versions of TensorFlow will allow gradients to flow -2018-07-22T23:56:20.841225581Z into the labels input on backprop by default. -2018-07-22T23:56:20.841229492Z -... -2018-07-22T23:57:11.842929868Z Accuracy at step 920: 0.967 -2018-07-22T23:57:11.842933859Z Accuracy at step 930: 0.9646 -2018-07-22T23:57:11.842937832Z Accuracy at step 940: 0.967 -2018-07-22T23:57:11.842941362Z Accuracy at step 950: 0.9674 -2018-07-22T23:57:11.842945487Z Accuracy at step 960: 0.9693 -2018-07-22T23:57:11.842949067Z Accuracy at step 970: 0.9687 -2018-07-22T23:57:11.842952818Z Accuracy at step 980: 0.9688 -2018-07-22T23:57:11.842956775Z Accuracy at step 990: 0.9649 -2018-07-22T23:57:11.842961076Z Adding run metadata for 999 -``` - -8\.日志查看器中有关训练作业的更多信息 - -``` -#arena logviewer tf-git -Your LogViewer will be available on: -192.168.1.120:8080/tfjobs/ui/#/default/tf-git-tfjob -``` - -![](1-tfjob-logviewer.jpg) - - -恭喜!您已经成功使用 `arena` 完成了第一项训练作业。 diff --git a/archived/docs/userguide_cn/13-preempted-mpijob.md b/archived/docs/userguide_cn/13-preempted-mpijob.md deleted file mode 100644 index 7eee44ea7..000000000 --- a/archived/docs/userguide_cn/13-preempted-mpijob.md +++ /dev/null @@ -1,168 +0,0 @@ - -# Arena 支持MPIJob任务抢占的示例 - -## 前提条件 - -- k8s > 1.11 - -1.利用下列yaml创建`PriorityClass`对象,这里定义了两个优先级`critical`和`medium`: - -```yaml -apiVersion: scheduling.k8s.io/v1beta1 -description: Used for the critical app -kind: PriorityClass -metadata: - name: critical -value: 1100000 - ---- - -apiVersion: scheduling.k8s.io/v1beta1 -description: Used for the medium app -kind: PriorityClass -metadata: - name: medium -value: 1000000 -``` - -将上述内容保存到`pc.yaml`文件,并且通过下列命令创建: - -``` -kubectl create -f pc.yaml -``` - -2.通过arena命令可以看到:在当前Kubernetes集群中只有一张可用GPU卡: - -``` -# arena top node -NAME IPADDRESS ROLE GPU(Total) GPU(Allocated) -192.168.0.20 192.168.0.20 master 0 0 -192.168.0.21 192.168.0.21 master 0 0 -192.168.0.22 192.168.0.22 master 0 0 -192.168.0.23 192.168.0.23 1 0 ------------------------------------------------------------------------------------------ -Allocated/Total GPUs In Cluster: -0/1 (0%) -``` - -3.提交一个MPI训练任务,该任务的优先级为`medium`: - -参考如下例子 - -``` -# arena submit mpi \ - --name=medium \ - --priority=medium \ - --gpus=1 \ - --workers=1 \ - --image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - "mpirun tail -f /dev/null" -configmap/medium-mpijob created -configmap/medium-mpijob labeled -mpijob.kubeflow.org/medium created -INFO[0000] The Job medium has been submitted successfully -INFO[0000] You can run `arena get medium --type mpijob` to check the job status -``` - -4.查看该任务的运行状态 - -``` -# arena get medium -STATUS: RUNNING -NAMESPACE: default -PRIORITY: medium -TRAINING DURATION: 58s - -NAME STATUS TRAINER AGE INSTANCE NODE -medium RUNNING MPIJOB 58s medium-launcher-sz5xj 192.168.0.23 -medium RUNNING MPIJOB 58s medium-worker-0 192.168.0.23 -``` - -5.可以看到该任务占用了唯一的一张GPU卡 - -``` -# arena top node -d - -NAME: cn-hangzhou.192.168.0.23 -IPADDRESS: 192.168.0.23 -ROLE: - -NAMESPACE NAME GPU REQUESTS GPU LIMITS -default medium-worker-0 1 1 - -Total GPUs In Node cn-hangzhou.192.168.0.23: 1 -Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%) ------------------------------------------------------------------------------------------ - -Allocated/Total GPUs In Cluster: 1/1 (100%) -``` - -6.再提交一个MPI训练任务,该任务的优先级为`critical`: - -``` -# arena submit mpi \ - --name=critical \ - --priority=critical \ - --gpus=1 \ - --workers=1 \ - --image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - "mpirun tail -f /dev/null" -``` - -7.检查MPI训练任务`medium`的相关事件,可以发现它被驱逐了。而它被驱逐的原因是由于被更重要的任务`critical`下的Pod也在申请GPU资源,而集群内只有一个可用的GPU资源,所以较低优先级的任务`medium`的`medium-worker-0`被驱逐 - -``` -# kubectl get events --field-selector involvedObject.name=medium-worker-0 -LAST SEEN TYPE REASON OBJECT MESSAGE -15m Normal Scheduled pod/medium-worker-0 Successfully assigned default/medium-worker-0 to 192.168.0.23 -14m Normal Pulled pod/medium-worker-0 Container image "registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5" already present on machine -14m Normal Created pod/medium-worker-0 Created container mpi -14m Normal Started pod/medium-worker-0 Started container mpi -2m32s Normal Preempted pod/medium-worker-0 by default/critical-worker-0 on node 192.168.0.23 -2m32s Normal Killing pod/medium-worker-0 Stopping container mpi -``` - -8.查看MPI训练任务`medium`的细节信息,发现这个任务已经处于失败状态。 - -``` -# arena get medium -STATUS: FAILED -NAMESPACE: default -PRIORITY: medium -TRAINING DURATION: 12m - -NAME STATUS TRAINER AGE INSTANCE NODE -medium FAILED MPIJOB 20m medium-launcher-sz5xj 192.168.0.23 -``` - -9.查看MPI训练任务`critical`的细节信息,发现这个任务已经处于运行状态。 - -``` -# arena get critical -STATUS: RUNNING -NAMESPACE: default -PRIORITY: critical -TRAINING DURATION: 10m - -NAME STATUS TRAINER AGE INSTANCE NODE -critical RUNNING MPIJOB 10m critical-launcher-mfffs 192.168.0.23 -critical RUNNING MPIJOB 10m critical-worker-0 192.168.0.23 -``` - -10.而且也可以通过`arena top node -d`发现这个GPU已经被MPI训练任务`critical`占用。 - -``` -# arena top node -d -NAME: cn-hangzhou.192.168.0.23 -IPADDRESS: 192.168.0.23 -ROLE: - -NAMESPACE NAME GPU REQUESTS GPU LIMITS -default critical-worker-0 1 1 - -Total GPUs In Node cn-hangzhou.192.168.0.23: 1 -Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%) ------------------------------------------------------------------------------------------ -``` - -恭喜! 你已经可以通过arena实现对于MPIJob优先级抢占。 diff --git a/archived/docs/userguide_cn/14-submit-with-node-selector.md b/archived/docs/userguide_cn/14-submit-with-node-selector.md deleted file mode 100644 index c6f2d2ee1..000000000 --- a/archived/docs/userguide_cn/14-submit-with-node-selector.md +++ /dev/null @@ -1,159 +0,0 @@ - -Arena支持给提交的任务指定运行的节点(目前仅支持mpi和tf类型的任务)。 - -下面展示一些使用例子。 - -1.查询k8s集群信息: -``` -# kubectl get nodes -NAME STATUS ROLES AGE VERSION -cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.228 Ready 2d22h v1.12.6-aliyun.1 -cn-beijing.192.168.3.229 Ready 2d22h v1.12.6-aliyun.1 -cn-beijing.192.168.3.230 Ready 2d22h v1.12.6-aliyun.1 -``` -2.为一些k8s节点指定标签。例如,为节点"cn-beijing.192.168.3.228"和节点"cn-beijing.192.168.3.229"指定标签"gpu_node=ok",为节点"cn-beijing.192.168.3.230"指定标签"ssd_node=ok"。 -``` -# kubectl label nodes cn-beijing.192.168.3.228 gpu_node=ok -node/cn-beijing.192.168.3.228 labeled -# kubectl label nodes cn-beijing.192.168.3.229 gpu_node=ok -node/cn-beijing.192.168.3.229 labeled -# kubectl label nodes cn-beijing.192.168.3.230 ssd_node=ok -node/cn-beijing.192.168.3.230 labeled -``` -## MPI类型的job -1.当提交一些job时,可以通过"--selector"选项来确定这些job运行在哪些节点上 -``` -# arena submit mpi --name=mpi-dist \ - --gpus=1 \ - --workers=1 \ - --selector gpu_node=ok \ - --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - --tensorboard \ - --loglevel debug \ - "mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10" -``` -2.查询job信息 -``` -# arena get mpi-dist -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 21s - -NAME STATUS TRAINER AGE INSTANCE NODE -mpi-dist RUNNING MPIJOB 21s mpi-dist-launcher-7jn4q 192.168.3.229 -mpi-dist RUNNING MPIJOB 21s mpi-dist-worker-0 192.168.3.229 - -Your tensorboard will be available on: -http://192.168.3.225:31611 -``` -可以看到job已经运行在节点cn-beijing.192.168.3.228(ip是192.168.3.229)上了。 -the jobs have been running on node cn-beijing.192.168.3.228(ip is 192.168.3.229). - -3.你可以多次使用"--selector"选项,例如:你可以在arena的提交命令中使用"--selector gpu_node=ok --selector ssd_node=ok",这代表你需要将job运行在那些同时拥有标签"gpu_node=ok"和标签"ssd_node=ok"的节点上 - -## TF类型的job - -1.因为在tf类型的job当中,存在四种角色("PS","Worker","Evaluator","Chief"),你可以使用"--selector"来指定job运行在哪些节点上。 -``` -arena submit tfjob \ - --name=tf \ - --gpus=1 \ - --workers=1 \ - --selector ssd_node=ok \ - --work-image=cheyang/tf-mnist-distributed:gpu \ - --ps-image=cheyang/tf-mnist-distributed:cpu \ - --ps=1 \ - --tensorboard \ - --loglevel debug \ - "python /app/main.py" -``` -使用如下命令检查节点状态: - -``` -# arena get tf -STATUS: PENDING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 24s - -NAME STATUS TRAINER AGE INSTANCE NODE -tf RUNNING TFJOB 24s tf-ps-0 192.168.3.230 -tf PENDING TFJOB 24s tf-worker-0 192.168.3.230 - -Your tensorboard will be available on: -http://192.168.3.225:31867 -``` - -可以看到"PS"类型的job和"Worker"类型的job都运行在了节点cn-beijing.192.168.3.230(ip是192.168.3.230,标签是"ssd_node=ok")上了。 -the jobs(include "PS" and "Worker") have been running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is "ssd_node=ok"). - -2.你也可以单独指定一种角色的job运行在哪些节点上,例如:如果你希望把"PS" job运行在标签为ssd_node="ok"节点上,把"Worker" job运行在标签为"gpu_node=ok"的节点上,可以使用"--ps-selector"和"--worker-selector"。 - -``` -arena submit tfjob \ - --name=tf \ - --gpus=1 \ - --workers=1 \ - --ps-selector ssd_node=ok \ - --worker-selector gpu_node=ok \ - --work-image=cheyang/tf-mnist-distributed:gpu \ - --ps-image=cheyang/tf-mnist-distributed:cpu \ - --ps=1 \ - --tensorboard \ - --loglevel debug \ - "python /app/main.py" -``` -检查job的状态: - -``` -# arena get tf -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 23s - -NAME STATUS TRAINER AGE INSTANCE NODE -tf RUNNING TFJOB 23s tf-ps-0 192.168.3.230 -tf RUNNING TFJOB 23s tf-worker-0 192.168.3.228 - -Your tensorboard will be available on: -http://192.168.3.225:30162 -``` -"PS" job运行在节点cn-beijing.192.168.3.230(ip是192.168.3.230,标签是"ssd_node=ok"),"Worker" job运行在节点cn-beijing.192.168.3.228(ip是192.168.3.228,标签是"gpu_node=ok")上。 - -3.如果你同时使用"--selector"和"--ps-selector"(或者"--worker-selector","--evaluator-selector","chief-selector"),那么"--ps-selector"的值会覆盖"--selector"的值。,例如: - -``` -arena submit tfjob \ - --name=tf \ - --gpus=1 \ - --workers=1 \ - --ps-selector ssd_node=ok \ - --selector gpu_node=ok \ - --work-image=cheyang/tf-mnist-distributed:gpu \ - --ps-image=cheyang/tf-mnist-distributed:cpu \ - --ps=1 \ - --tensorboard \ - --loglevel debug \ - "python /app/main.py" -``` -理论上"--selector"会应用到所有角色的job中,在上面的命令中,所有角色的job将会被调度到标签为gpu_node=ok的节点上,但是因为有"--ps-selector",那么"PS" job会被调度到标签为ssd_node=ok上,而不是标签为gpu_node=ok的节点上。 -``` -# arena get tf -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 39s - -NAME STATUS TRAINER AGE INSTANCE NODE -tf RUNNING TFJOB 39s tf-ps-0 192.168.3.230 -tf RUNNING TFJOB 39s tf-worker-0 192.168.3.228 - -Your tensorboard will be available on: -http://192.168.3.225:32105 -``` -正如你所看到的,"PS" job被调度到拥有标签为"ssd_node=ok"的节点上,其他节点被调度到标签为"gpu_node=ok"的节点上。 diff --git a/archived/docs/userguide_cn/14-submit-with-node-toleration.md b/archived/docs/userguide_cn/14-submit-with-node-toleration.md deleted file mode 100644 index 91e2808d8..000000000 --- a/archived/docs/userguide_cn/14-submit-with-node-toleration.md +++ /dev/null @@ -1,83 +0,0 @@ - -Arena支持将提交的job运行在k8s污点上(目前仅支持mpi和tf类型的 job) - -下面展示一些使用例子。 - -1.查询k8s集群信息: -``` -# kubectl get nodes -NAME STATUS ROLES AGE VERSION -cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1 -cn-beijing.192.168.3.228 Ready 2d22h v1.12.6-aliyun.1 -cn-beijing.192.168.3.229 Ready 2d22h v1.12.6-aliyun.1 -cn-beijing.192.168.3.230 Ready 2d22h v1.12.6-aliyun.1 -``` -2.为k8s节点打上一些污点,例如:为节点"cn-beijing.192.168.3.228"和节点"cn-beijing.192.168.3.229"打上污点"gpu_node=invalid:NoSchedule",为节点"cn-beijing.192.168.3.230"打上污点"ssd_node=invalid:NoSchedule"。现在,所有pod都不能调度到这些节点了。 -``` -# kubectl taint nodes cn-beijing.192.168.3.228 gpu_node=invalid:NoSchedule -node/cn-beijing.192.168.3.228 tainted -# kubectl taint nodes cn-beijing.192.168.3.229 gpu_node=invalid:NoSchedule -node/cn-beijing.192.168.3.229 tainted -# kubectl taint nodes cn-beijing.192.168.3.230 ssd_node=invalid:NoSchedule -node/cn-beijing.192.168.3.230 tainted -``` -3.当提交一个job时,你可以使用"--toleration"来容忍一些带有污点的k8s节点。 -``` -# arena submit mpi --name=mpi-dist \ - --gpus=1 \ - --workers=1 \ - --toleration ssd_node \ - --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - --tensorboard \ - --loglevel debug \ - "mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10" -``` -查询job信息: -``` -# arena get mpi-dist -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 29s - -NAME STATUS TRAINER AGE INSTANCE NODE -mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.230 -mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230 - -Your tensorboard will be available on: -http://192.168.3.225:30052 -``` -job已经运行在节点cn-beijing.192.168.3.230(ip为192.168.3.230,污点为ssd_node=invalid)上了。 - -4.你可以在同一个命令中多次使用"--toleration"。例如,你可以在命令中使用"--toleration gpu_node --toleration ssd_node",它代表既可以容忍有污点"gpu_node"的节点,又可以容忍污点"ssd_node"的节点。 - -``` -# arena submit mpi --name=mpi-dist \ - --gpus=1 \ - --workers=1 \ - --toleration ssd_node \ - --toleration gpu_node \ - --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - --tensorboard \ - --loglevel debug \ - "mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10" -``` -查询job状态: - -``` -# arena get mpi-dist -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 29s - -NAME STATUS TRAINER AGE INSTANCE NODE -mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.229 -mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230 - -Your tensorboard will be available on: -http://192.168.3.225:30052 -``` -5.你可以使用"--toleration all"来容忍所有节点上的所有污点。 diff --git a/archived/docs/userguide_cn/15-custom-serving-sample-beijing.jpg b/archived/docs/userguide_cn/15-custom-serving-sample-beijing.jpg deleted file mode 100644 index 9ed795482..000000000 Binary files a/archived/docs/userguide_cn/15-custom-serving-sample-beijing.jpg and /dev/null differ diff --git a/archived/docs/userguide_cn/15-custom-serving-sample-beijing_out.jpg b/archived/docs/userguide_cn/15-custom-serving-sample-beijing_out.jpg deleted file mode 100644 index 8f0bc2074..000000000 Binary files a/archived/docs/userguide_cn/15-custom-serving-sample-beijing_out.jpg and /dev/null differ diff --git a/archived/docs/userguide_cn/15-custom-serving-sample.md b/archived/docs/userguide_cn/15-custom-serving-sample.md deleted file mode 100644 index edb5a2503..000000000 --- a/archived/docs/userguide_cn/15-custom-serving-sample.md +++ /dev/null @@ -1,75 +0,0 @@ -# 用arena服务训练模型 - -你可以适用arena部署你的训练模型,通过RESTful API的方式访问。为了说明怎样使用,我们将会使用一个案例[fast-style-transfer](https://github.com/floydhub/fast-style-transfer),同时为了节约时间,直接使用这个项目已经训练好的模型并且把模型加入docker镜像中。 - -### 1.部署训练模型 - -使用项目中的app.py这个脚本启动一个restful服务器,你可以使用如下的命令去部署模型: - -``` -# arena serve custom \ - --name=fast-style-transfer \ - --gpus=1 \ - --version=alpha \ - --replicas=1 \ - --restful-port=5000 \ - --image=happy365/fast-style-transfer:latest \ - "python app.py" -``` -检查TensorFlow Serving Job的状态: - -``` -# arena serve list -NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS -fast-style-transfer CUSTOM alpha 1 0 172.21.8.94 grpc:8001,restful:5000 -``` -因为docker镜像比较大,拉取它需要一定的时间,我们可以使用"kubectl"检查pod运行情况: - -``` -# kubectl get po -NAME READY STATUS RESTARTS AGE -fast-style-transfer-alpha-custom-serving-845ffbf7dd-btbhj 0/1 ContainerCreating 0 6m44s -``` - -### 2.访问服务 - -我们可以使用一个带有curl命令的容器作为客户端去访问刚才创建的服务,但是首先我们需要创建这个客户端: -``` -# kubectl run sample-client \ - --generator=run-pod/v1 \ - --image=happy365/arena-serve-custem-sample-client:latest \ - --command -- \ - /bin/sleep infinity -``` -然后,可以查询客户端的状态: -``` -# kubectl get po sample-client -NAME READY STATUS RESTARTS AGE -sample-client 1/1 Running 0 87s - -``` -在用客户端访问custom service之前,我们需要查询服务名称,它是一个任务名和版本的结合(本例中,任务名为fast-style-transfer,版本为alpha): - -``` -# kubectl get svc fast-style-transfer-alpha -NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE -fast-style-transfer-alpha ClusterIP 172.21.1.114 5000/TCP 31m -``` -现在我们可以可以使用kubectl exec 进入容器当中: - -``` -# kubectl exec -ti sample-client /bin/sh -# -``` -接着在容器当中使用curl命令去访问aren创建的自定义服务: -``` -# curl -o /root/output/beijing_out.jpg -F "file=@/root/input/beijing.jpg" http://fast-style-transfer-alpha:5000 -``` -在上面的命令中,输入文件的名称为"beijing.jpg" ![beijing.jpg](15-custom-serving-sample-beijing.jpg),存放的路径为"/root/input",输出文件的路径为"/root/output/beijing_out.jpg",现在需要退出容器然后在master节点上执行kubectl cp命令将结果从容器中拷贝出来: -``` -# kubectl cp sample-client:/root/output/beijing_out.jpg ~/beijing_out.jpg -``` -图片"beijing_out.jpg" ![beijing_out.jpg](15-custom-serving-sample-beijing_out.jpg)将会复制到当前用户的家目录下面。 - - - diff --git a/archived/docs/userguide_cn/16-assign-config-file.md b/archived/docs/userguide_cn/16-assign-config-file.md deleted file mode 100644 index eaf265102..000000000 --- a/archived/docs/userguide_cn/16-assign-config-file.md +++ /dev/null @@ -1,73 +0,0 @@ -# 为训练任务提供配置文件 - -在提交训练任务时,我们可以指定该训练任务所需的配置文件。 - -目前,该功能只支持如下两种任务: - -* tfjob -* mpijob - -## 1.用法 - -当提交训练任务时,通过 `--config-file :` 为训练任务指定配置文件,该选项有一些规则: - -* 如果指定了 并且没有指定 ,我们认为 的值与 相同 -* 必须是一个文件,并且是绝对路径 -* 在一个提交命令中,可以多次指定 `--config-file` ,例如: "--config-file /tmp/test1.conf:/etc/config/test1.conf --config-file /tmp/test2.conf:/etc/config/test2.conf" - - -## 2.例子 - - -首先,我们创建一个名为"test-config.json"的文件,它的路径为"/tmp/test-config.json"。我们希望把这个文件放到训练任务的实例中,并存放路径为"/etc/config/config.json"。这个文件内容如下: -``` -# cat /tmp/test-config.json -{ - "key": "job-config" - -} -``` -接着,使用如下命令提交一个tfjob: -``` -# arena submit tfjob \ - --name=tf \ - --gpus=1 \ - --workers=1 \ - --work-image=cheyang/tf-mnist-distributed:gpu \ - --ps-image=cheyang/tf-mnist-distributed:cpu \ - --ps=1 \ - --tensorboard \ - --config-file /tmp/test-config.json:/etc/config/config.json \ - "python /app/main.py" -``` -等一段时间,查看任务状态: -``` -# arena get tf -STATUS: RUNNING -NAMESPACE: default -PRIORITY: N/A -TRAINING DURATION: 16s - -NAME STATUS TRAINER AGE INSTANCE NODE -tf RUNNING TFJOB 16s tf-ps-0 192.168.7.18 -tf RUNNING TFJOB 16s tf-worker-0 192.168.7.16 - -Your tensorboard will be available on: -http://192.168.7.10:31825 -``` -使用kubectl检测文件是否已经存放在了任务的实例中: -``` -# kubectl exec -ti tf-ps-0 -- cat /etc/config/config.json -{ - "key": "job-config" - -} -# kubectl exec -ti tf-worker-0 -- cat /etc/config/config.json -{ - "key": "job-config" - -} - -``` - -可以看到,文件已经在实例中存在了。 diff --git a/archived/docs/userguide_cn/17-pytorchjob-standalone.md b/archived/docs/userguide_cn/17-pytorchjob-standalone.md deleted file mode 100644 index 7488ccd88..000000000 --- a/archived/docs/userguide_cn/17-pytorchjob-standalone.md +++ /dev/null @@ -1,96 +0,0 @@ - -这个示例展示了如何使用 `Arena` 提交一个 pytorch 单机的作业。该示例将从 git url 下载源代码。 - -1. 第一步是检查可用的GPU资源 - ``` - ➜ arena top node - NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) - cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0 - cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0 - cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0 - cn-huhehaote.172.16.0.208 172.16.0.208 ready 4 0 - cn-huhehaote.172.16.0.209 172.16.0.209 ready 4 0 - cn-huhehaote.172.16.0.210 172.16.0.210 ready 4 0 - ----------------------------------------------------------------------------------------- - Allocated/Total GPUs In Cluster: - 0/12 (0%) - ``` - 有 3 个包含 GPU 的可用节点用于运行训练作业 - -2. 提交一个 pytorch 训练作业,本示例从 [阿里云 code](https://code.aliyun.com/370272561/mnist-pytorch.git) 下载源代码 - ``` - # 单机单卡 - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-local-git \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo" - configmap/pytorch-local-git-pytorchjob created - configmap/pytorch-local-git-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-local-git created - INFO[0000] The Job pytorch-local-git has been submitted successfully - INFO[0000] You can run `arena get pytorch-local-git --type pytorchjob` to check the job status - ``` - - > 这会下载源代码,并将其解压缩到工作目录的 `code/` 目录。默认的工作目录是 `/root`,您也可以使用 `--workingDir` 加以指定。 - - > 如果您正在使用非公开 git 代码库,则可以使用以下命令: - - ``` - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-local-git \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --env=GIT_SYNC_USERNAME=yourname \ - --env=GIT_SYNC_PASSWORD=yourpwd \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo" - ``` - -3. 列出所有作业 - ``` - ➜ arena list - NAME STATUS TRAINER AGE NODE - pytorch-local-git SUCCEEDED PYTORCHJOB 21h N/A - ``` - -4. 获取作业的详细信息 - ``` - ➜ arena get pytorch-local-git - STATUS: SUCCEEDED - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 35s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-local-git SUCCEEDED PYTORCHJOB 23h pytorch-local-git-master-0 172.16.0.210 - ``` - -5. 检查日志 - ``` - ➜ arena logs pytorch-local-git - WORLD_SIZE: 1, CURRENT_RANK: 0 - args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000) - Using CUDA - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint8 = np.dtype([("qint8", np.int8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint16 = np.dtype([("qint16", np.int16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint32 = np.dtype([("qint32", np.int32, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - np_resource = np.dtype([("resource", np.ubyte, 1)]) - Train Epoch: 1 [0/60000 (0%)] loss=2.3000 - Train Epoch: 1 [640/60000 (1%)] loss=2.2135 - Train Epoch: 1 [1280/60000 (2%)] loss=2.1705 - Train Epoch: 1 [1920/60000 (3%)] loss=2.0767 - Train Epoch: 1 [2560/60000 (4%)] loss=1.8681 - ... - ``` diff --git a/archived/docs/userguide_cn/18-pytorchjob-distributed.md b/archived/docs/userguide_cn/18-pytorchjob-distributed.md deleted file mode 100644 index 322b42e30..000000000 --- a/archived/docs/userguide_cn/18-pytorchjob-distributed.md +++ /dev/null @@ -1,132 +0,0 @@ - -这个示例展示了如何使用 `Arena` 提交一个 pytorch 分布式的作业。该示例将从 git url 下载源代码。 - -1. 第一步是检查可用的GPU资源 - ``` - ➜ arena top node - NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) - cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0 - cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0 - cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0 - cn-huhehaote.172.16.0.208 172.16.0.208 ready 4 0 - cn-huhehaote.172.16.0.209 172.16.0.209 ready 4 0 - cn-huhehaote.172.16.0.210 172.16.0.210 ready 4 0 - ----------------------------------------------------------------------------------------- - Allocated/Total GPUs In Cluster: - 0/12 (0%) - ``` - 有 3 个包含 GPU 的可用节点用于运行训练作业 - -2. 提交一个 pytorch 2 机 1 卡的训练作业,本示例从 [阿里云 code](https://code.aliyun.com/370272561/mnist-pytorch.git) 下载源代码 - ``` - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-dist-git \ - --gpus=1 \ - --workers=2 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo" - configmap/pytorch-dist-git-pytorchjob created - configmap/pytorch-dist-git-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-dist-git created - INFO[0000] The Job pytorch-dist-git has been submitted successfully - INFO[0000] You can run `arena get pytorch-dist-git --type pytorchjob` to check the job status - ``` - - > 这会下载源代码,并将其解压缩到工作目录的 `code/` 目录。默认的工作目录是 `/root`,您也可以使用 `--workingDir` 加以指定。 - - > workers 为参与计算的节点总数(必须为正整数且大于等于 1),包括用于建立通信的 rank0 节点(对应 pytorch operator 中的 master 节点),默认为 1,可以不填写,即为单机作业。 - -3. 列出所有作业 - ``` - ➜ arena list - NAME STATUS TRAINER AGE NODE - pytorch-dist-git SUCCEEDED PYTORCHJOB 23h N/A - ``` - -4. 获取作业的详细信息, 可以看到这个作业有 2 个实例,其中 `pytorch-dist-git-master-0` 即为 rank0 的节点。 Arena 借助 `PyTorch-Operator` 简化提交分布式作业的流程, -在 `PyTorch-Operator` 中,会为这个 `master` 实例创建一个 `Service` 便于其他节点通过 `Service` 的 name 访问, 在每个实例中注入环境变量 `MASTER_PORT`、`MASTER_ADDR`、`WORLD_SIZE`、`RANK`,用于 -pytorch 建立分布式进程组的初始化工作(dist.init_process_group)。其中 `MASTER_PORT` 自动分配,`MASTER_ADDR` 在 `master` 实例中是 localhost, 其他实例是 `master` 的 `Service` name, -`WORLD_SIZE` 总实例数,`RANK` 当前计算节点的序号,`master` 实例 为 0,`worker` 为实例名尾缀的下标加一,例如,下面的例子中的实例 `pytorch-dist-git-worker-0`,其 `RANK` 为 `0+1=1`。 -在 Arena 中,参数 --workers 填写的值包含了 1 个 `master` 节点,因为 `master` 节点也参与训练。 - ``` - ➜ arena get pytorch-local-git - STATUS: SUCCEEDED - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 1m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-master-0 172.16.0.210 - pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-worker-0 172.16.0.210 - ``` - -5. 检查日志 - ``` - ➜ arena logs pytorch-dist-git - WORLD_SIZE: 2, CURRENT_RANK: 0 - args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000) - Using CUDA - Using distributed PyTorch with gloo backend - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint8 = np.dtype([("qint8", np.int8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint16 = np.dtype([("qint16", np.int16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint32 = np.dtype([("qint32", np.int32, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - np_resource = np.dtype([("resource", np.ubyte, 1)]) - Train Epoch: 1 [0/60000 (0%)] loss=2.3000 - Train Epoch: 1 [640/60000 (1%)] loss=2.2135 - Train Epoch: 1 [1280/60000 (2%)] loss=2.1705 - Train Epoch: 1 [1920/60000 (3%)] loss=2.0767 - Train Epoch: 1 [2560/60000 (4%)] loss=1.8681 - Train Epoch: 1 [3200/60000 (5%)] loss=1.4142 - Train Epoch: 1 [3840/60000 (6%)] loss=1.0009 - ... - ``` - - > 对于多实例的分布式作业,默认输出 rank0 (实例为 master 节点)的日志,如果想查看某一个实例的日志,可以通过 -i 实例名查看,例如: - - ``` - ➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0 - WORLD_SIZE: 2, CURRENT_RANK: 1 - args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000) - Using CUDA - Using distributed PyTorch with gloo backend - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint8 = np.dtype([("qint8", np.int8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint16 = np.dtype([("qint16", np.int16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - _np_qint32 = np.dtype([("qint32", np.int32, 1)]) - /opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. - np_resource = np.dtype([("resource", np.ubyte, 1)]) - Train Epoch: 1 [0/60000 (0%)] loss=2.3000 - Train Epoch: 1 [640/60000 (1%)] loss=2.2135 - Train Epoch: 1 [1280/60000 (2%)] loss=2.1705 - Train Epoch: 1 [1920/60000 (3%)] loss=2.0767 - Train Epoch: 1 [2560/60000 (4%)] loss=1.8681 - Train Epoch: 1 [3200/60000 (5%)] loss=1.4142 - ``` - - > 另外,用户查看日志可以通过参数 -t 行数,可以查看尾部倒数几行的日志,如: - - ``` - ➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0 -t 5 - Train Epoch: 1 [58880/60000 (98%)] loss=0.2048 - Train Epoch: 1 [59520/60000 (99%)] loss=0.0646 - - accuracy=0.9661 - - ``` - > 更多参数见 `arena logs --help` \ No newline at end of file diff --git a/archived/docs/userguide_cn/19-pytorchjob-tensorboard.md b/archived/docs/userguide_cn/19-pytorchjob-tensorboard.md deleted file mode 100644 index 54709899c..000000000 --- a/archived/docs/userguide_cn/19-pytorchjob-tensorboard.md +++ /dev/null @@ -1,77 +0,0 @@ - -这个示例展示了如何使用 `Arena` 提交一个 pytorch 分布式的作业,并通过 tensorboard 可视化。该示例将从 git url 下载源代码。 - -1. 第一步是检查可用的GPU资源 - ``` - ➜ arena top node - NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) - cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0 - cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0 - cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0 - cn-huhehaote.172.16.0.208 172.16.0.208 ready 4 0 - cn-huhehaote.172.16.0.209 172.16.0.209 ready 4 0 - cn-huhehaote.172.16.0.210 172.16.0.210 ready 4 0 - ----------------------------------------------------------------------------------------- - Allocated/Total GPUs In Cluster: - 0/12 (0%) - ``` - 有 3 个包含 GPU 的可用节点用于运行训练作业 - -2. 提交一个 pytorch 2 机 1 卡的训练作业,本示例从 [阿里云 code](https://code.aliyun.com/370272561/mnist-pytorch.git) 下载源代码 - ``` - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-dist-tensorboard \ - --gpus=1 \ - --workers=2 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --tensorboard \ - --logdir=/root/logs \ - "python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs" - configmap/pytorch-dist-tensorboard-pytorchjob created - configmap/pytorch-dist-tensorboard-pytorchjob labeled - service/pytorch-dist-tensorboard-tensorboard created - deployment.apps/pytorch-dist-tensorboard-tensorboard created - pytorchjob.kubeflow.org/pytorch-dist-tensorboard created - INFO[0000] The Job pytorch-dist-tensorboard has been submitted successfully - INFO[0000] You can run `arena get pytorch-dist-tensorboard --type pytorchjob` to check the job status - ``` - - > 这会下载源代码,并将其解压缩到工作目录的 `code/` 目录。默认的工作目录是 `/root`,您也可以使用 `--workingDir` 加以指定。 - - > workers 为参与计算的节点总数(必须为正数且大于等于 1),包括用于建立通信的 rank0 节点(对应 pytorch operator 中的 master 节点),默认为 1,可以不填写,即为单机作业。 - - > `logdir` 指示 Tensorboard 在何处读取 PyTorch 的事件日志 - -3. 列出所有作业 - ``` - ➜ arena list - NAME STATUS TRAINER AGE NODE - pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h N/A - ``` - -4. 获取作业的详细信息 - ``` - ➜ arena get pytorch-dist-tensorboard - STATUS: SUCCEEDED - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 15m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h pytorch-dist-tensorboard-master-0 172.16.0.210 - pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h pytorch-dist-tensorboard-worker-0 172.16.0.210 - - Your tensorboard will be available on: - http://172.16.0.205:30583 - ``` - > 注意:您可以使用 `172.16.0.205:30583` 访问 Tensorboard。如果您通过笔记本电脑无法直接访问 Tensorboard,则可以考虑使用 `sshuttle`。例如: - - ``` - # you can install sshuttle==0.74 in your mac with python2.7 - ➜ pip install sshuttle==0.74 - # 0/0 -> 0.0.0.0/0 - ➜ sshuttle -r root@39.104.17.205 0/0 - ``` - ![](19-pytorchjob-tensorboard.png) diff --git a/archived/docs/userguide_cn/19-pytorchjob-tensorboard.png b/archived/docs/userguide_cn/19-pytorchjob-tensorboard.png deleted file mode 100644 index 9192baa00..000000000 Binary files a/archived/docs/userguide_cn/19-pytorchjob-tensorboard.png and /dev/null differ diff --git a/archived/docs/userguide_cn/2-tensorboard.jpg b/archived/docs/userguide_cn/2-tensorboard.jpg deleted file mode 100755 index c11e3fa60..000000000 Binary files a/archived/docs/userguide_cn/2-tensorboard.jpg and /dev/null differ diff --git a/archived/docs/userguide_cn/2-tfjob-tensorboard.md b/archived/docs/userguide_cn/2-tfjob-tensorboard.md deleted file mode 100755 index 7de02e18c..000000000 --- a/archived/docs/userguide_cn/2-tfjob-tensorboard.md +++ /dev/null @@ -1,109 +0,0 @@ - -这个示例展示了如何使用 `Arena` 进行机器学习模型训练。它会从 git url 下载源代码,并使用 Tensorboard 可视化 Tensorflow 训练状态。 - -1. 第一步是检查可用资源 - -``` -arena top node -NAME IPADDRESS ROLE GPU(Total) GPU(Allocated) -i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0 -i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0 -i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0 -i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0 -i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0 -i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0 ------------------------------------------------------------------------------------------ -Allocated/Total GPUs In Cluster: -0/3 (0%) -``` - -有 3 个带有 GPU 的可用节点用于运行训练作业。 - - -2\.现在,我们可以通过 `arena submit` 提交一个训练作业,这会从 github 下载源代码 - -``` -#arena submit tf \ - --name=tf-tensorboard \ - --gpus=1 \ - --image=tensorflow/tensorflow:1.5.0-devel-gpu \ - --env=TEST_TMPDIR=code/tensorflow-sample-code/ \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - --tensorboard \ - --logdir=/training_logs \ - "python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 5000" -configmap/tf-tensorboard-tfjob created -configmap/tf-tensorboard-tfjob labeled -service/tf-tensorboard-tensorboard created -deployment.extensions/tf-tensorboard-tensorboard created -tfjob.kubeflow.org/tf-tensorboard created -INFO[0001] The Job tf-tensorboard has been submitted successfully -INFO[0001] You can run `arena get tf-tensorboard --type tfjob` to check the job status -``` - -> 这会下载源代码,并将其解压缩到工作目录的 `code/` 目录。默认的工作目录是 `/root`,您也可以使用 `--workingDir` 加以指定。 - -> `logdir` 指示 Tensorboard 在何处读取 TensorFlow 的事件日志 - -3\.列出所有作业 - -``` -#arena list -NAME STATUS TRAINER AGE NODE -tf-tensorboard RUNNING TFJOB 0s 192.168.1.119 -``` - -4\.检查作业所使用的GPU资源 - -``` -#arena top job -NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated) -tf-tensorboard RUNNING TFJOB 26s 192.168.1.119 1 1 - - -Total Allocated GPUs of Training Job: -0 - -Total Requested GPUs of Training Job: -1 -``` - - - -5\.检查集群所使用的GPU资源 - - -``` -#arena top node -NAME IPADDRESS ROLE GPU(Total) GPU(Allocated) -i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0 -i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 1 -i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0 -i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0 -i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0 -i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0 ------------------------------------------------------------------------------------------ -Allocated/Total GPUs In Cluster: -1/3 (33%) -``` - - -6\.获取特定作业的详细信息 - -``` -#arena get tf-tensorboard -NAME STATUS TRAINER AGE INSTANCE NODE -tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-586fcf4d6f-vtlxv 192.168.1.119 -tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-worker-0 192.168.1.119 - -Your tensorboard will be available on: -192.168.1.117:30670 -``` - -> 注意:您可以使用 `192.168.1.117:30670` 访问 Tensorboard。如果您通过笔记本电脑无法直接访问 Tensorboard,则可以考虑使用 `sshuttle`。例如:`sshuttle -r root@47.89.59.51 192.168.0.0/16` - - -![](2-tensorboard.jpg) - -恭喜!您已经使用 `arena` 成功运行了训练作业,而且还能轻松检查 Tensorboard。 diff --git a/archived/docs/userguide_cn/20-pytorchjob-distributed-data.md b/archived/docs/userguide_cn/20-pytorchjob-distributed-data.md deleted file mode 100644 index edfa4ce2e..000000000 --- a/archived/docs/userguide_cn/20-pytorchjob-distributed-data.md +++ /dev/null @@ -1,124 +0,0 @@ - -这个示例展示了如何使用 `Arena` 提交一个 pytorch 分布式的作业,挂载一个 NFS 数据卷。该示例将从 git url 下载源代码。 - -1. 搭建一个 nfs server(详情参考:https://www.cnblogs.com/weifeng1463/p/10037803.html) - ```shell - # install nfs server - ➜ yum install nfs-utils -y - # 创建 nfs server 的本地目录 - ➜ mkdir -p /root/nfs/data - # 配置 nfs server - ➜ cat /etc/exports - /root/nfs/data *(rw,no_root_squash) - # Start nfs server - ➜ systemctl start nfs; systemctl start rpcbind - ➜ systemctl enable nfs - Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service. - ``` -2. 在 nfs 的共享目录中,放入训练数据 - ```shell - # 查看 nfs 服务器的挂载目录,172.16.0.200 为 nfs 服务器端的 ip - ➜ showmount -e 172.16.0.200 - Export list for 172.16.0.200: - /root/nfs/data * - # 进入共享目录 - ➜ cd /root/nfs/data - # 提前准备好训练数据 - ➜ pwd - /root/nfs/data - # MNIST -> 就是我们需要用的训练数据 - ➜ ll - 总用量 8.0K - drwxr-xr-x 4 502 games 4.0K 6月 17 16:05 data - drwxr-xr-x 4 root root 4.0K 6月 23 15:17 MNIST - ``` -3. 创建 pv - ```shell - # 排版可能导致 yaml 缩进有问题 - ➜ cat nfs-pv.yaml - apiVersion: v1 - kind: PersistentVolume - metadata: - name: pytorchdata - labels: - pytorchdata: nas-mnist - spec: - persistentVolumeReclaimPolicy: Retain - capacity: - storage: 10Gi - accessModes: - - ReadWriteMany - nfs: - server: 172.16.0.200 - path: "/root/nfs/data" - - ➜ kubectl create -f nfs-pv.yaml - persistentvolume/pytorchdata created - ➜ kubectl get pv | grep pytorchdata - pytorchdata 10Gi RWX Retain Bound default/pytorchdata 7m38s - ``` -5. 创建 pvc - ```shell - ➜ cat nfs-pvc.yaml - apiVersion: v1 - kind: PersistentVolumeClaim - metadata: - name: pytorchdata - annotations: - description: "this is the mnist demo" - owner: Tom - spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 5Gi - selector: - matchLabels: - pytorchdata: nas-mnist - - ➜ kubectl create -f nfs-pvc.yaml - persistentvolumeclaim/pytorchdata created - ➜ kubectl get pvc | grep pytorchdata - pytorchdata Bound pytorchdata 10Gi RWX 2m3s - ``` -7. 检查数据卷 - ```shell - ➜ arena data list - NAME ACCESSMODE DESCRIPTION OWNER AGE - pytorchdata ReadWriteMany this is the mnist demo Tom 2m - ``` -9. 提交 pytorch 作业,通过 `--data pvc_name:container_path` 挂载分布式存储卷 - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-data \ - --gpus=1 \ - --workers=2 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --data=pytorchdata:/mnist_data \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo --data /mnist_data/data" - configmap/pytorch-data-pytorchjob created - configmap/pytorch-data-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-data created - INFO[0000] The Job pytorch-data has been submitted successfully - INFO[0000] You can run `arena get pytorch-data --type pytorchjob` to check the job status - ``` -11. 通过 kubectl describe 查看存储卷 `pytorchdata` 在其中一个 instance 的情况 - ```shell - # 查看作业 pytorch-data 的实例情况 - ➜ arena get pytorch-data - STATUS: SUCCEEDED - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 56s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-data SUCCEEDED PYTORCHJOB 1m pytorch-data-master-0 172.16.0.210 - pytorch-data SUCCEEDED PYTORCHJOB 1m pytorch-data-worker-0 172.16.0.210 - - # 通过 kubectl describe 查看实例 pytorch-data-master-0 存储卷 pytorchdata 的情况 - ➜ kubectl describe pod pytorch-data-master-0 | grep pytorchdata -C 3 - ``` - ![](20-pytorchjob-distributed-data.png) \ No newline at end of file diff --git a/archived/docs/userguide_cn/20-pytorchjob-distributed-data.png b/archived/docs/userguide_cn/20-pytorchjob-distributed-data.png deleted file mode 100644 index 7731a3c1c..000000000 Binary files a/archived/docs/userguide_cn/20-pytorchjob-distributed-data.png and /dev/null differ diff --git a/archived/docs/userguide_cn/21-pytorchjob-with-node-selector.md b/archived/docs/userguide_cn/21-pytorchjob-with-node-selector.md deleted file mode 100644 index 0a9d1be9c..000000000 --- a/archived/docs/userguide_cn/21-pytorchjob-with-node-selector.md +++ /dev/null @@ -1,54 +0,0 @@ -## 指定节点运行 pytorch 任务 - -1. 查询 k8s 集群信息 - ```shell - ➜ kubectl get nodes - NAME STATUS ROLES AGE VERSION - cn-huhehaote.172.16.0.205 Ready master 4h19m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.206 Ready master 4h18m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.207 Ready master 4h17m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.208 Ready 4h13m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.209 Ready 4h13m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.210 Ready 4h13m v1.16.9-aliyun.1 - ``` -2. 给不同的节点打上不同的标签 - ```shell - # 172.16.0.208 打上 gpu_node=ok - ➜ kubectl label nodes cn-huhehaote.172.16.0.208 gpu_node=ok - node/cn-huhehaote.172.16.0.208 labeled - # 172.16.0.209 打上 gpu_node=ok - ➜ kubectl label nodes cn-huhehaote.172.16.0.209 gpu_node=ok - node/cn-huhehaote.172.16.0.209 labeled - # 172.16.0.210 打上 ssd_node=ok - ➜ kubectl label nodes cn-huhehaote.172.16.0.210 ssd_node=ok - node/cn-huhehaote.172.16.0.210 labeled - ``` -3. 提交 pytorch 作业的时候,通过 `--selector` 选定 job 运行在哪个标签的节点上 - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-selector \ - --gpus=1 \ - --workers=2 \ - --selector gpu_node=ok \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo" - configmap/pytorch-selector-pytorchjob created - configmap/pytorch-selector-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-selector created - INFO[0000] The Job pytorch-selector has been submitted successfully - INFO[0000] You can run `arena get pytorch-selector --type pytorchjob` to check the job status - ``` -4. 查询 job 信息,可以看到作业 `pytorch-selector` 只运行在带有标签 `gpu_node=ok` 的节点 172.16.0.209 上 - ```shell - ➜ arena get pytorch-selector - STATUS: PENDING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 14s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-selector PENDING PYTORCHJOB 14s pytorch-selector-master-0 172.16.0.209 - pytorch-selector PENDING PYTORCHJOB 14s pytorch-selector-worker-0 172.16.0.209 - ``` \ No newline at end of file diff --git a/archived/docs/userguide_cn/22-pytorchjob-with-node-toleration.md b/archived/docs/userguide_cn/22-pytorchjob-with-node-toleration.md deleted file mode 100644 index bc1d594e1..000000000 --- a/archived/docs/userguide_cn/22-pytorchjob-with-node-toleration.md +++ /dev/null @@ -1,96 +0,0 @@ -## 支持 pytorch 任务运行在 k8s 污点上 - -1. 查看 k8s 集群节点信息 - ```shell - ➜ kubectl get node - NAME STATUS ROLES AGE VERSION - cn-huhehaote.172.16.0.205 Ready master 5h13m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.206 Ready master 5h12m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.207 Ready master 5h11m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.208 Ready 5h7m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.209 Ready 5h7m v1.16.9-aliyun.1 - cn-huhehaote.172.16.0.210 Ready 5h7m v1.16.9-aliyun.1 - ``` -2. 给节点加上 taints - ```shell - # taint --> gpu_node - ➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node=invalid:NoSchedule - node/cn-huhehaote.172.16.0.208 tainted - ➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node=invalid:NoSchedule - node/cn-huhehaote.172.16.0.209 tainted - # taint --> ssd_node - ➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node=invalid:NoSchedule - node/cn-huhehaote.172.16.0.210 tainted - ``` -3. 当我们加错节点的 taints 或者恢复节点的调度性,可以通过以下方式去除节点的 taints - ```shell - ➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node- - node/cn-huhehaote.172.16.0.208 untainted - ➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node- - node/cn-huhehaote.172.16.0.209 untainted - ➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node- - node/cn-huhehaote.172.16.0.210 untainted - ``` -4. 提交作业,使用 `--toleration` 可以容忍的节点 key, 如:`--toleration=gpu_node`;该参数可以使用多次,带上不同 taint key - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-toleration \ - --gpus=1 \ - --workers=2 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --tensorboard \ - --logdir=/root/logs \ - --toleration gpu_node \ - "python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs" - configmap/pytorch-toleration-pytorchjob created - configmap/pytorch-toleration-pytorchjob labeled - service/pytorch-toleration-tensorboard created - deployment.apps/pytorch-toleration-tensorboard created - pytorchjob.kubeflow.org/pytorch-toleration created - INFO[0000] The Job pytorch-toleration has been submitted successfully - INFO[0000] You can run `arena get pytorch-toleration --type pytorchjob` to check the job status - ``` -5. 查看 job 是否被调度起来 - ```shell - arena get pytorch-toleration - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 2m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-toleration RUNNING PYTORCHJOB 2m pytorch-toleration-master-0 172.16.0.209 - pytorch-toleration RUNNING PYTORCHJOB 2m pytorch-toleration-worker-0 172.16.0.209 - - Your tensorboard will be available on: - http://172.16.0.205:32091 - ``` -6. `--toleration` 通过指定 all, 标记这个任务可以容忍所有节点 - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-toleration-all \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --toleration all \ - "python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend gloo" - configmap/pytorch-toleration-all-pytorchjob created - configmap/pytorch-toleration-all-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-toleration-all created - INFO[0000] The Job pytorch-toleration-all has been submitted successfully - INFO[0000] You can run `arena get pytorch-toleration-all --type pytorchjob` to check the job status - ``` -7. 查看 job 是否被调度起来 - ```shell - ➜ arena get pytorch-toleration-all - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 33s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-toleration-all RUNNING PYTORCHJOB 33s pytorch-toleration-all-master-0 172.16.0.210 - ``` \ No newline at end of file diff --git a/archived/docs/userguide_cn/23-pytorchjob-assign-config-file.md b/archived/docs/userguide_cn/23-pytorchjob-assign-config-file.md deleted file mode 100644 index d1c3cf40b..000000000 --- a/archived/docs/userguide_cn/23-pytorchjob-assign-config-file.md +++ /dev/null @@ -1,48 +0,0 @@ -## 为 pytorch 训练任务提供配置文件 -在提交训练任务时,我们可以指定该训练任务所需的配置文件。 - -1. 在提交的机器上准备好要挂载的配置文件 - ```shell - # prepare your config-file - ➜ cat /tmp/test-config.json - { - "key": "job-config" - } - ``` -2. 提交作业,通过 `--config-file` 指定要挂载的配置文件 - ```shell - # arena submit job by --config-file ${host-config-file}:${container-config-file} - # 该参数支持多次使用,挂载多个配置文件 - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-config-file \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --config-file /tmp/test-config.json:/etc/config/config.json \ - "python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo" - configmap/pytorch-config-file-pytorchjob created - configmap/pytorch-config-file-pytorchjob labeled - configmap/pytorch-config-file-a9cbad1b8719778 created - pytorchjob.kubeflow.org/pytorch-config-file created - INFO[0000] The Job pytorch-config-file has been submitted successfully - INFO[0000] You can run `arena get pytorch-config-file --type pytorchjob` to check the job status - ``` -3. 查询作业详情 - ```shell - ➜ arena get pytorch-config-file --type pytorchjob - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 51s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-config-file RUNNING PYTORCHJOB 51s pytorch-config-file-master-0 172.16.0.210 - ``` -4. 使用 kubectl 检测文件是否已经存放在了任务的实例中: - ``` - ➜ kubectl exec -ti pytorch-config-file-master-0 -- cat /etc/config/config.json - { - "key": "job-config" - } - ``` \ No newline at end of file diff --git a/archived/docs/userguide_cn/24-pytorchjob-preempted.md b/archived/docs/userguide_cn/24-pytorchjob-preempted.md deleted file mode 100644 index 0c2fb1010..000000000 --- a/archived/docs/userguide_cn/24-pytorchjob-preempted.md +++ /dev/null @@ -1,130 +0,0 @@ -## pytorch 任务支持抢占 - -1. 利用下列 yaml 创建 PriorityClass 对象,这里定义了两个优先级 `critical` 和 `medium`: - ```shell - # critical 和 medium 声明 - ➜ cat priorityClass.yaml - apiVersion: scheduling.k8s.io/v1beta1 - description: Used for the critical app - kind: PriorityClass - metadata: - name: critical - value: 1100000 - - --- - - apiVersion: scheduling.k8s.io/v1beta1 - description: Used for the medium app - kind: PriorityClass - metadata: - name: medium - value: 1000000 - - # 创建 critical 和 medium 两个 priority 对象 - ➜ kubectl create -f priorityClass.yaml - priorityclass.scheduling.k8s.io/critical created - priorityclass.scheduling.k8s.io/medium created - ``` -2. 检查一下集群资源使用情况。总共有 3 个节点,每个节点有 4 张 gpu 显卡。 - ```shell - ➜ arena top node - NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) - cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0 - cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0 - cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0 - cn-huhehaote.172.16.0.208 172.16.0.208 ready 4 0 - cn-huhehaote.172.16.0.209 172.16.0.209 ready 4 0 - cn-huhehaote.172.16.0.210 172.16.0.210 ready 4 0 - ----------------------------------------------------------------------------------------- - Allocated/Total GPUs In Cluster: - 0/12 (0%) - ``` -3. 提交一个 3 节点 4 卡的 medium 的 gpu 作业,占满资源,为了验证效果,我们可以将训练的 epoch 调大一点,延长训练时间,方便实验查看 - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-priority-medium \ - --gpus=4 \ - --workers=3 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --priority=medium \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 200" - configmap/pytorch-priority-medium-pytorchjob created - configmap/pytorch-priority-medium-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-priority-medium created - INFO[0000] The Job pytorch-priority-medium has been submitted successfully - INFO[0000] You can run `arena get pytorch-priority-medium --type pytorchjob` to check the job status - ``` -4. 查看 `medium` 任务运行状态,已经全部运行起来 - ```shell - ➜ arena get pytorch-priority-medium - STATUS: RUNNING - NAMESPACE: default - PRIORITY: medium - TRAINING DURATION: 3m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-master-0 172.16.0.208 - pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-0 172.16.0.210 - pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-1 172.16.0.209 - ``` -5. 查看节点 gpu 卡使用情况,已经全部被占用 - ```shell - ➜ arena top node - NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) - cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0 - cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0 - cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0 - cn-huhehaote.172.16.0.208 172.16.0.208 ready 4 4 - cn-huhehaote.172.16.0.209 172.16.0.209 ready 4 4 - cn-huhehaote.172.16.0.210 172.16.0.210 ready 4 4 - ----------------------------------------------------------------------------------------- - Allocated/Total GPUs In Cluster: - 12/12 (100%) - ``` -6. 提交一个单机 1 卡的 `critical` 的 gpu 作业,发起抢占 - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-priority-critical \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --priority=critical \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 50" - configmap/pytorch-priority-critical-pytorchjob created - configmap/pytorch-priority-critical-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-priority-critical created - INFO[0000] The Job pytorch-priority-critical has been submitted successfully - INFO[0000] You can run `arena get pytorch-priority-critical --type pytorchjob` to check the job status - ``` -7. 查看 `critical` 任务运行状态,等待运行起来,完成抢占 - ```shell - ➜ arena get pytorch-priority-critical - arena get pytorch-priority-critical - STATUS: RUNNING - NAMESPACE: default - PRIORITY: critical - TRAINING DURATION: 22s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-priority-critical RUNNING PYTORCHJOB 22s pytorch-priority-critical-master-0 172.16.0.208 - ``` -8. 查看 `medium` 的作业状态,已经变成 `FAILED` 了, 有一个 instance 由于抢占被删除了 - ```shell - ➜ arena get pytorch-priority-medium - STATUS: FAILED - NAMESPACE: default - PRIORITY: medium - TRAINING DURATION: 1m - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-master-0 172.16.0.210 - pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-worker-0 172.16.0.209 - ``` -9. 检查 `medium` 作业的 event, 可以看到他的 `pytorch-priority-medium-worker-1` 被驱逐了,而被驱逐的原因是 `critical` 作业的实例 `pytorch-priority-critical-master-0` 也在申请这个节点的资源,而节点已经没有额外的 gpu 资源,因此低优先级的 `medium` 作业被高优的任务抢占 - ```shell - ➜ kubectl get events --field-selector involvedObject.name=pytorch-priority-medium-worker-1 - ``` - ![](24-pytorchjob-preempted.png) \ No newline at end of file diff --git a/archived/docs/userguide_cn/24-pytorchjob-preempted.png b/archived/docs/userguide_cn/24-pytorchjob-preempted.png deleted file mode 100644 index 2ae8978b6..000000000 Binary files a/archived/docs/userguide_cn/24-pytorchjob-preempted.png and /dev/null differ diff --git a/archived/docs/userguide_cn/25-pytorchjob-clean-pod-policy.md b/archived/docs/userguide_cn/25-pytorchjob-clean-pod-policy.md deleted file mode 100644 index 14ac0e55f..000000000 --- a/archived/docs/userguide_cn/25-pytorchjob-clean-pod-policy.md +++ /dev/null @@ -1,40 +0,0 @@ -## 指定 pytorch 任务结束后 pod 的清理策略 - -1. 提交一个作业,指定 `--clean-task-policy` 为 `All`, 作业结束后(成功或者失败),将会删除所有 instance (pod);默认为 `None`, 会保留所有 pod - ```shell - ➜ arena --loglevel info submit pytorch \ - --name=pytorch-clean-policy \ - --gpus=1 \ - --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \ - --clean-task-policy=All \ - "python /root/code/mnist-pytorch/mnist.py --backend gloo" - configmap/pytorch-clean-policy-pytorchjob created - configmap/pytorch-clean-policy-pytorchjob labeled - pytorchjob.kubeflow.org/pytorch-clean-policy created - INFO[0000] The Job pytorch-clean-policy has been submitted successfully - INFO[0000] You can run `arena get pytorch-clean-policy --type pytorchjob` to check the job status - ``` - -2. 查看作业详情, 任务结束后,实例 `pytorch-clean-policy-master-0` 被删除 - ```shell - # RUNNING - ➜ arena get pytorch-clean-policy - STATUS: RUNNING - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 18s - - NAME STATUS TRAINER AGE INSTANCE NODE - pytorch-clean-policy RUNNING PYTORCHJOB 18s pytorch-clean-policy-master-0 172.16.0.209 - - # FINISHED - ➜ arena get pytorch-clean-policy - STATUS: SUCCEEDED - NAMESPACE: default - PRIORITY: N/A - TRAINING DURATION: 37s - - NAME STATUS TRAINER AGE INSTANCE NODE - ``` \ No newline at end of file diff --git a/archived/docs/userguide_cn/3-tensorboard.jpg b/archived/docs/userguide_cn/3-tensorboard.jpg deleted file mode 100755 index 86c2aeb05..000000000 Binary files a/archived/docs/userguide_cn/3-tensorboard.jpg and /dev/null differ diff --git a/archived/docs/userguide_cn/3-tfjob-distributed.md b/archived/docs/userguide_cn/3-tfjob-distributed.md deleted file mode 100755 index 892f97f16..000000000 --- a/archived/docs/userguide_cn/3-tfjob-distributed.md +++ /dev/null @@ -1,68 +0,0 @@ - - -Arena 为分布式 TensorFlow 训练(ps/worker 模式)提供了支持和简化。 - - -1.要运行分布式 Tensorflow 训练,您需要指定以下信息: - - - 各 Worker 的 GPU(仅 GPU 工作负载需要) - - Worker 的数量(必需) - - PS 的数量(必需) - - Worker 的 docker 镜像(必需) - - PS 的 docker 镜像(必需) - - Worker 的端口(默认为 22222) - - PS 的端口(默认为 22223) - -如下命令提供了一个示例。本例中定义了 2 个 Worker 和 1 个 PS,每个 Worker 有 1 个 GPU。Worker 和 PS 的源代码位于 git 中,Tensorboard 已启用。 - -``` -#arena submit tf --name=tf-dist-git \ - --gpus=1 \ - --workers=2 \ - --worker-image=tensorflow/tensorflow:1.5.0-devel-gpu \ - --sync-mode=git \ - --sync-source=https://github.com/xigang/tensorflow-sample-code.git \ - --ps=1 \ - --ps-image=tensorflow/tensorflow:1.5.0-devel \ - --tensorboard \ - "python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir /training_logs" -configmap/tf-dist-git-tfjob created -configmap/tf-dist-git-tfjob labeled -service/tf-dist-git-tensorboard created -deployment.extensions/tf-dist-git-tensorboard created -tfjob.kubeflow.org/tf-dist-git created -INFO[0001] The Job tf-dist-git has been submitted successfully -INFO[0001] You can run `arena get tf-dist-git --type tfjob` to check the job status -``` - -2\.获取特定作业的详细信息 - -``` -#arena get tf-dist-git -NAME STATUS TRAINER AGE INSTANCE NODE -tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-594d59789c-lrfsk 192.168.1.119 -tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-ps-0 192.168.1.118 -tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-worker-0 192.168.1.119 -tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-worker-1 192.168.1.120 - -Your tensorboard will be available on: -192.168.1.117:32298 -``` - -3\.检查 Tensorboard - -![](3-tensorboard.jpg) - - -4\.获取 TFJob 控制台 - -``` -#arena logviewer tf-dist-git -Your LogViewer will be available on: -192.168.1.120:8080/tfjobs/ui/#/default/tf-dist-git-tfjob -``` - - -![](4-tfjob-logviewer-distributed.jpg) - -恭喜!您已经成功使用 `arena` 运行了分布式训练作业。 diff --git a/archived/docs/userguide_cn/4-tfjob-distributed-data.md b/archived/docs/userguide_cn/4-tfjob-distributed-data.md deleted file mode 100755 index 5261e2dc3..000000000 --- a/archived/docs/userguide_cn/4-tfjob-distributed-data.md +++ /dev/null @@ -1,110 +0,0 @@ - -`arena` 允许在训练作业中挂载多个数据卷。下面的示例将数据卷挂载到训练作业中。 - - -1.您需要在 NFS Server 中创建 `/data` 并准备 `mnist data` - -``` -#mkdir -p /nfs -#mount -t nfs -o vers=4.0 NFS_SERVER_IP://nfs -#mkdir -p /data -#cd /data -#wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-images-idx3-ubyte.gz -#wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-labels-idx1-ubyte.gz -#wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-images-idx3-ubyte.gz -#wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-labels-idx1-ubyte.gz -#cd / -#umount /nfs -``` - -2\.创建持久卷。将 `NFS_SERVER_IP` 更改为您的相应 NFS Server IP 地址。 - -``` -#cat nfs-pv.yaml -apiVersion: v1 -kind: PersistentVolume -metadata: - name: tfdata - labels: - tfdata: nas-mnist -spec: - persistentVolumeReclaimPolicy: Retain - capacity: - storage: 10Gi - accessModes: - - ReadWriteMany - nfs: - server: NFS_SERVER_IP - path: "/data" - - # kubectl create -f nfs-pv.yaml -``` - -3\.创建持久卷声明。 - -``` -#cat nfs-pvc.yaml -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: tfdata - annotations: - description: "this is the mnist demo" - owner: Tom -spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 5Gi - selector: - matchLabels: - tfdata: nas-mnist -#kubectl create -f nfs-pvc.yaml -``` - -> 注意:建议添加 `description` 和 `owner` - -4\.检查数据卷 - -``` -#arena data list -NAME ACCESSMODE DESCRIPTION OWNER AGE -tfdata ReadWriteMany this is for mnist demo myteam 43d -``` - -5\.现在,我们可以通过 `arena` 提交分布式训练作业,它会从 github 下载源代码,并将数据卷 `tfdata` 挂载到 `/mnist_data`。 - -``` -#arena submit tf --name=tf-dist-data \ - --gpus=1 \ - --workers=2 \ - --work-image=tensorflow/tensorflow:1.5.0-devel-gpu \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - --ps=1 \ - --ps-image=tensorflow/tensorflow:1.5.0-devel \ - --tensorboard \ - --data=tfdata:/mnist_data \ - "python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir /training_logs --data_dir /mnist_data" -``` - -> `--data` 指定了要挂载到作业的所有任务的数据卷,例如 :。在本例中,数据卷是 `tfdata`,目标目录是 `/mnist_data`。 - - -6\.通过日志,我们发现训练数据提取自 `/mnist_data`,而非直接通过互联网下载得到。 - -``` -#arena logs tf-dist-data -... -Extracting /mnist_data/train-images-idx3-ubyte.gz -Extracting /mnist_data/train-labels-idx1-ubyte.gz -Extracting /mnist_data/t10k-images-idx3-ubyte.gz -Extracting /mnist_data/t10k-labels-idx1-ubyte.gz -... -Accuracy at step 960: 0.9753 -Accuracy at step 970: 0.9739 -Accuracy at step 980: 0.9756 -Accuracy at step 990: 0.9777 -Adding run metadata for 999 -``` diff --git a/archived/docs/userguide_cn/4-tfjob-logviewer-distributed.jpg b/archived/docs/userguide_cn/4-tfjob-logviewer-distributed.jpg deleted file mode 100755 index 59f520062..000000000 Binary files a/archived/docs/userguide_cn/4-tfjob-logviewer-distributed.jpg and /dev/null differ diff --git a/archived/docs/userguide_cn/5-mpi-logviewer.jpg b/archived/docs/userguide_cn/5-mpi-logviewer.jpg deleted file mode 100755 index 93344d23d..000000000 Binary files a/archived/docs/userguide_cn/5-mpi-logviewer.jpg and /dev/null differ diff --git a/archived/docs/userguide_cn/5-mpi-tensorboard.jpg b/archived/docs/userguide_cn/5-mpi-tensorboard.jpg deleted file mode 100755 index 9797367e4..000000000 Binary files a/archived/docs/userguide_cn/5-mpi-tensorboard.jpg and /dev/null differ diff --git a/archived/docs/userguide_cn/5-mpijob-distributed.md b/archived/docs/userguide_cn/5-mpijob-distributed.md deleted file mode 100755 index 14921846d..000000000 --- a/archived/docs/userguide_cn/5-mpijob-distributed.md +++ /dev/null @@ -1,56 +0,0 @@ - -Arena 为分布式 TensorFlow 训练(MPI 模式)提供了支持和简化。 - - -1.要借助 MPI 支持运行分布式训练,您需要指定以下信息: - - - 各 Worker 的 GPU(仅 GPU 工作负载需要) - - Worker 的数量(必需) - - mpi Worker 的 docker 镜像(必需) - - -如下命令提供了一个示例。本例中定义了 2 个 Worker 和 1 个 PS,每个 Worker 有 1 个 GPU。Worker 和 PS 的源代码位于 git 中,Tensorboard 已启用。 - -``` -#arena submit mpi --name=mpi-dist \ - --gpus=1 \ - --workers=2 \ - --image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \ - --env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \ - --sync-mode=git \ - --sync-source=https://github.com/tensorflow/benchmarks.git \ - --tensorboard \ - "mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 - --save_summaries_steps=10" -``` - -2\.获取特定作业的详细信息 - -``` -#arena get mpi-dist -NAME STATUS TRAINER AGE INSTANCE NODE -mpi-dist RUNNING MPIJOB 1d mpi-dist-mpijob-launcher-ndnw8 192.168.1.120 -mpi-dist RUNNING MPIJOB 1d mpi-dist-mpijob-worker-0 192.168.1.119 -mpi-dist RUNNING MPIJOB 1d mpi-dist-mpijob-worker-1 192.168.1.120 - -Your tensorboard will be available on: -192.168.1.117:32559 -``` - -3\.检查 Tensorboard - -![](5-mpi-tensorboard.jpg) - - -4\.获取 MPI 控制台 - -``` -#arena logviewer mpi-dist -Your LogViewer will be available on: -192.168.1.119:9090/#!/log/default/mpi-dist-mpijob-launcher-ndnw8/mpi?namespace=default -``` - - -![](5-mpijob-logviewer.jpg) - -恭喜!您已经成功使用 `arena` 运行了分布式 MPI 训练作业。 diff --git a/archived/docs/userguide_cn/6-tfjob-gangschd.md b/archived/docs/userguide_cn/6-tfjob-gangschd.md deleted file mode 100755 index e8fde3602..000000000 --- a/archived/docs/userguide_cn/6-tfjob-gangschd.md +++ /dev/null @@ -1,67 +0,0 @@ - -Arena 支持利用 [kube-arbitrator](https://github.com/kubernetes-incubator/kube-arbitrator),通过群调度 (gang scheduling) 算法执行分布式 TensorFlow 训练。 - -运行分布式 TensorFlow 时,我们最好确保使用 `all` 或 `nothing`。在这种情况下,群调度可以提供帮助。 - - -> 注意:当前的 [kubernetes gang scheduler](https://github.com/kubernetes-incubator/kube-arbitrator/tree/release-0.1) 尚未准备好投入生产应用。例如,它在调度中不支持 Pod 亲和度和 PodFitsHostPorts。 - -> 限制:使用群调度器时,Tensorboard 存在一定的问题。 - -1.为启用群调度器,首先要编辑 `/charts/tfjob/values.yaml` 文件 - -将 `enableGangScheduler: false` 更改为 `enableGangScheduler: true` - -2.要运行分布式 Tensorflow 训练,您需要指定以下信息: - - - 各 Worker 的 GPU(仅 GPU 工作负载需要) - - Worker 的数量(必需) - - PS 的数量(必需) - - Worker 的 docker 镜像(必需) - - PS 的 docker 镜像(必需) - - Worker 的端口(默认为 22222) - - PS 的端口(默认为 22223) - -如下命令提供了一个示例。本例中定义了 2 个 Worker 和 1 个 PS,每个 Worker 有 1 个 GPU。Worker 和 PS 的源代码位于 git 中,Tensorboard 已启用。 - -``` -arena submit tf --name=tf-dist-git \ - --gpus=1 \ - --workers=2 \ - --work-image=tensorflow/tensorflow:1.5.0-devel-gpu \ - --sync-mode=git \ - --sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \ - --ps=1 \ - --ps-image=tensorflow/tensorflow:1.5.0-devel \ - "python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir /training_logs" -configmap/tf-dist-git-tfjob created -configmap/tf-dist-git-tfjob labeled -service/tf-dist-git-tensorboard created -deployment.extensions/tf-dist-git-tensorboard created -tfjob.kubeflow.org/tf-dist-git created -INFO[0001] The Job tf-dist-git has been submitted successfully -INFO[0001] You can run `arena get tf-dist-git --type tfjob` to check the job status - -``` - -如果没有足够的资源,所有作业实例均处于 'PENDING' 状态。如果不是群调度器,您可以看到部分 pod 处于 `RUNNING` 状态,其他 pod 处于 `PENDING` 状态。 - -``` -# arena get tf-dist-data -NAME STATUS TRAINER AGE INSTANCE NODE -tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-ps-0 N/A -tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-0 N/A -tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-1 N/A -tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-2 N/A -tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-3 N/A -``` - -在有充足的资源时,实例状态会变为 `RUNNING` - -``` -NAME STATUS TRAINER AGE INSTANCE NODE -tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-ps-0 192.168.1.115 -tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-worker-0 192.168.1.119 -tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-worker-1 192.168.1.118 -tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-worker-2 192.168.1.120 -``` diff --git a/archived/docs/userguide_cn/7-tf-serving.md b/archived/docs/userguide_cn/7-tf-serving.md deleted file mode 100755 index ae5cc3165..000000000 --- a/archived/docs/userguide_cn/7-tf-serving.md +++ /dev/null @@ -1,246 +0,0 @@ -本指南详细介绍了使用 Kubernetes (K8s) 和 Istio 部署和提供 TensorFlow 模型预测时所需的步骤。 - -1. 部署 - -在使用 `Arena` serving TensorFlow 之前,我们需要准备环境,包括 Kubernetes 集群和 Istio (可选)。 - -确保您的 Kubernetes 集群处于运行状态。 - -按照 Istio [文档](https://istio.io/docs/setup/kubernetes/quick-start/#installation-steps) 安装 Istio。安装完成之后,您应该会在命名空间 `istio-system` 内看到 `istio-pilot` 和 `istio-mixer` 服务。 - -Istio 默认 [拒绝访问外部数据流量](https://istio.io/docs/tasks/traffic-management/egress.html)。由于 TensorFlow Serving组件可能需要从外部都模型文件,因此我们需要某些特定于云的 [配置](https://istio.io/docs/tasks/traffic-management/egress.html#calling-external-services-directly)。 - -2\.为模型文件创建持久卷 - -在 NFS Server 中创建 /tfmodel,执行如下命令准备 mnist 模型: - -``` -mount -t nfs -o vers=4.0 NFS_SERVER_IP://tfmodel/ -wget https://github.com/osswangxining/tensorflow-sample-code/raw/master/models/tensorflow/mnist.tar.gz -tar xvf mnist.tar.gz -``` - -然后执行如下命令(以 NFS 为例)创建持久卷和持久卷声明: - -持久卷: -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - name: tfmodel - labels: - tfmodel: nas-mnist -spec: - persistentVolumeReclaimPolicy: Retain - capacity: - storage: 10Gi - accessModes: - - ReadWriteMany - nfs: - server: NFS_SERVER_IP - path: "/tfmodel" -``` - -持久卷声明: - -``` -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: tfmodel - annotations: - description: "this is tfmodel for mnist" - owner: tester -spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 5Gi - selector: - matchLabels: - tfmodel: nas-mnist -``` - -检查数据卷: -``` -arena data list -NAME ACCESSMODE DESCRIPTION OWNER AGE -tfmodel ReadWriteMany this is tfmodel for mnist tester 31s -``` - - -3\.为 Tensorflow Serving禁用 Istio - -您可以在不启用 Istio 的情况下部署并运行 Tensorflow 模型预测。 - -执行以下命令,提交 tensorflow Serving作业,部署机器学习模型预测。 - -``` -arena serve tensorflow [flags] - -options: - --command string the command will inject to container's command. - --cpu string the request cpu of each replica to run the serve. - -d, --data stringArray specify the trained models datasource to mount for serving, like : - --enableIstio enable Istio for serving or not (disable Istio by default) - -e, --envs stringArray the environment variables - --gpus int the limit GPU count of each replica to run the serve. - -h, --help help for tensorflow - --image string the docker image name of serve job, default image is tensorflow/serving:latest (default "tensorflow/serving:latest") - --memory string the request memory of each replica to run the serve. - --modelConfigFile string Corresponding with --model_config_file in tensorflow serving - --modelName string the model name for serving - --modelPath string the model path for serving in the container - --port int the port of tensorflow gRPC listening port (default 8500) - --replicas int the replicas number of the serve job.(default 1) - --restfulPort int the port of tensorflow RESTful listening port (default 8501) - --servingName string the serving name - --servingVersion string the serving version - --versionPolicy string support latest, latest:N, specific:N, all - -继承自父命令的选项 - --arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system") - --config string Path to a kube config.Only required if out-of-cluster - --loglevel string Set the logging level.One of: debug|info|warn|error (default "info") - --namespace string the namespace of the job (default "default") - --pprof enable cpu profile -``` - -例如,您可以使用如下所示的特定版本策略提交 Tensorflow 模型。 - -``` -arena serve tensorflow --servingName=mymnist --modelName=mnist --image=tensorflow/serving:latest --data=tfmodel:/tfmodel --modelPath=/tfmodel/mnist --versionPolicy=specific:1 --loglevel=debug -``` - -触发该命令之后,系统将创建相应 Kubernetes 服务,以提供公开的 gRPC 和 RESTful API。 - - -4\.为 Tensorflow Serving启用 Istio - -如果您需要为 Tensorflow Serving启用 Istio,则可以在上述命令中附上参数 `--enableIstio`(默认禁用 Istio)。 - -例如,您可以在提交 Tensorflow 模型的同时启用 Istio,如下所示。 - -``` -#arena serve tensorflow --enableIstio --servingName=mymnist --servingVersion=v1 --modelName=mnist --data=myoss1pvc:/data2 --modelPath=/data2/models/mnist --versionPolicy=specific:1 - -NAME: mymnist-v1 -LAST DEPLOYED: Wed Sep 26 17:28:13 2018 -NAMESPACE: default -STATUS: DEPLOYED - -RESOURCES: -==> v1/ConfigMap -NAME DATA AGE -mymnist-v1-tensorflow-serving-cm 1 1s - -==> v1/Service -NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE -mymnist ClusterIP 172.19.12.176 8500/TCP,8501/TCP 1s - -==> v1beta1/Deployment -NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE -mymnist-v1-tensorflow-serving 1 1 1 0 1s - -==> v1alpha3/DestinationRule -NAME AGE -mymnist 1s - -==> v1alpha3/VirtualService -mymnist 1s - -==> v1/Pod(related) -NAME READY STATUS RESTARTS AGE -mymnist-v1-tensorflow-serving-757b669bbb-5vsmf 0/2 Init:0/1 0 1s - - -NOTES: -Getting Started: - -**** NOTE: It may take a few minutes for the LoadBalancer IP to be available. **** -**** You can watch the status of by running 'kubectl get svc --namespace default -w mymnist-v1-tensorflow-serving' **** - export TF_SERVING_SERVICE_IP=$(kubectl get svc --namespace default mymnist-v1-tensorflow-serving -o jsonpath='{.status.loadBalancer.ingress[0].ip}') - echo docker run -it --rm cheyang/tf-mnist:grpcio_upgraded /serving/bazel-bin/tensorflow_serving/example/mnist_client --num_tests=1000 --server=$TF_SERVING_SERVICE_IP:9090` - -``` - -5\.列出所有传送作业 - -您可以使用如下命令列出所有预测服务。 - -``` -#arena serve list - NAME VERSION STATUS - mymnist-v1 v1 DEPLOYED -``` - -6\.为 tfserving 作业动态调整流量路由 - -部署一个新版本的 Tensorflow 模型,同时启用 Istio: -``` -#arena serve tensorflow --enableIstio --servingName=mymnist --servingVersion=v2 --modelName=mnist --data=myoss1pvc:/data2 --modelPath=/data2/models/mnist -``` - -随后您可以为全部这两个版本的 tfserving 作业动态调整流量路由。 - -``` -#arena serve traffic-router-split --servingName=mymnist --servingVersions=v1,v2 --weights=50,50 -``` - -7\.运行测试 - -启动 `sleep` 服务,以便使用 `curl` 提供负载: - -``` -#cat < 注意:建议添加 `description` 和 `owner` - -3\.检查数据卷 - -``` -#arena data list -NAME ACCESSMODE DESCRIPTION OWNER AGE -tfdata ReadWriteMany this is for mnist demo myteam 43d -``` - -4\.要运行分布式 Tensorflow 训练,您需要指定以下信息: - - - 各 Worker 的 GPU(包括主 Worker 和评估Worker ) - - 启用主 Worker (必需) - - 启用评估器 (必需) - - Worker 的数量(必需) - - PS 的数量(必需) - - Worker 和主节点的 docker 镜像(必需) - - PS 的 docker 镜像(必需) - - 主 Worker 的端口(默认为 22221) - - Worker 的端口(默认为 22222) - - PS 的端口(默认为 22223) - -如下命令提供了一个示例。本例中定义了 1 个主 Worker 、1 个 Worker 和 1 个评估器 ,每个 Worker 有一个 GPU。Worker 和 PS 的源代码位于 git 中,Tensorboard 已启用。 - -``` -#arena submit tf --name=tf-estimator \ - --gpus=1 \ - --workers=1 \ - --chief \ - --evaluator \ - --data=tfdata:/data/mnist \ - --logdir=/data/mnist/models \ - --work-image=tensorflow/tensorflow:1.9.0-devel-gpu \ - --sync-mode=git \ - --sync-source=https://github.com/cheyang/models.git \ - --ps=1 \ - --ps-image=tensorflow/tensorflow:1.9.0-devel \ - --tensorboard \ - "bash code/models/dist_mnist_estimator.sh --data_dir=/data/mnist/MNIST_data --model_dir=/data/mnist/models" -configmap/tf-estimator-tfjob created -configmap/tf-estimator-tfjob labeled -service/tf-estimator-tensorboard created -deployment.extensions/tf-estimator-tensorboard created -tfjob.kubeflow.org/tf-estimator created -INFO[0001] The Job tf-estimator has been submitted successfully -INFO[0001] You can run `arena get tf-estimator --type tfjob` to check the job status - -``` - -> `--data` 指定了要挂载到作业的所有任务的数据卷,例如 :。在本例中,数据卷是 `tfdata`,目标目录是 `/data/mnist`。 - - -5\.通过日志,我们发现训练已经启动 - -``` -#arena logs tf-estimator -2018-09-27T00:37:01.576672145Z 2018-09-27 00:37:01.576562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:chief/replica:0/task:0/device:GPU:0 with 15123 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0) -2018-09-27T00:37:01.578669608Z 2018-09-27 00:37:01.578523: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:22222} -2018-09-27T00:37:01.578685739Z 2018-09-27 00:37:01.578550: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> tf-estimator-tfjob-ps-0:22223} -2018-09-27T00:37:01.578705274Z 2018-09-27 00:37:01.578562: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tf-estimator-tfjob-worker-0:22222} -2018-09-27T00:37:01.579637826Z 2018-09-27 00:37:01.579454: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:334] Started server with target: grpc://localhost:22222 -2018-09-27T00:37:01.701520696Z I0927 00:37:01.701258 140281586534144 tf_logging.py:115] Calling model_fn. -2018-09-27T00:37:02.172552485Z I0927 00:37:02.172167 140281586534144 tf_logging.py:115] Done calling model_fn. -2018-09-27T00:37:02.173930978Z I0927 00:37:02.173732 140281586534144 tf_logging.py:115] Create CheckpointSaverHook. -2018-09-27T00:37:02.431259294Z I0927 00:37:02.430984 140281586534144 tf_logging.py:115] Graph was finalized. -2018-09-27T00:37:02.4472109Z 2018-09-27 00:37:02.447018: I tensorflow/core/distributed_runtime/master_session.cc:1150] Start master session b0a6d2587e64ebef with config: allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } -... -2018-09-27T00:37:33.250440133Z I0927 00:37:33.250036 140281586534144 tf_logging.py:115] global_step/sec: 21.8175 -2018-09-27T00:37:33.253100942Z I0927 00:37:33.252873 140281586534144 tf_logging.py:115] loss = 0.09276967, step = 500 (4.583 sec) -2018-09-27T00:37:37.764446795Z I0927 00:37:37.764101 140281586534144 tf_logging.py:115] Saving checkpoints for 600 into /data/mnist/models/model.ckpt. -2018-09-27T00:37:38.064104604Z I0927 00:37:38.063472 140281586534144 tf_logging.py:115] Loss for final step: 0.24215397. -``` - -6\.检查训练状态和 Tensorboard - -``` -#arena get tf-estimator -NAME STATUS TRAINER AGE INSTANCE NODE -tf-estimator SUCCEEDED TFJOB 5h tf-estimator-tfjob-chief-0 N/A -tf-estimator RUNNING TFJOB 5h tf-estimator-tfjob-evaluator-0 192.168.1.120 -tf-estimator RUNNING TFJOB 5h tf-estimator-tfjob-ps-0 192.168.1.119 -tf-estimator RUNNING TFJOB 5h tf-estimator-tfjob-worker-0 192.168.1.118 - -Your tensorboard will be available on: -192.168.1.117:31366 -``` - -7\.检查本示例中来自 192.168.1.117:31366 的 Tensorboard - -![](8-tfjob-estimator-tensorboard.jpg) -