IBM WML CE on Triton User Menu
==============================================
Introduction
------------
To release the power of the advanced hardware, IBM provides
Watson Machine Learning Community Edition (WML CE) which is a set of
software packages for deep learning and machine learning development and
applications on the state-of-the-art POWER system equiped with the most
advanced NVIDIA GPUs. WML CE contains the popular open source deep
learning frameworks such as TensorFlow and PyTorch, IBM-optimized Caffe,
IBM's machine learning library (Snap ML) and software for distributed
training (DDL) and large model support (LMS).
Using Anaconda
--------------
The WML CE packages are distributed as conda packages in `an online
conda repository `__.
You can use Anaconda or Miniconda to install and manage the packages.
On Triton, we have pre-installed Anaconda that is configured to point to the repository containing the WML CE packages.
After logging to the system with ``ssh @triton.ccs.miami.edu``, you can do ``ml wml_anaconda3`` to load the default version of
the WML-configured Anaconda `module `__.
We recommend using the pre-installed Anaconda since it will be easier for us to track down the problem if you need assistance. However, you can also install Anaconda or Miniconda in your home directory following `the WML CE system setup guide `__, and handle it by yourself.
Installing WML CE packages
--------------------------
- $ ``ml wml_anaconda3``
- $ ``conda create -n python= powerai=``
For example, ``conda create -n my_wml_env python=3.7 powerai=1.7.0`` will create
an environment named ``my_wml_env`` at ``~/.conda/envs`` with python 3.7 and all the WML CE packages installed.
- $ ``conda activate ``
Installing all the WML CE packages at the same time in your environment:
- (your environment)$ ``conda install powerai=1.6.2``
Or installing individual package:
- (your environment)$ ``conda install ``
.. note::
WML CE 1.7.0 and 1.6.2 need Python 3.6 or 3.7; WML CE 1.6.1 needs Python 2.7 or 3.6.
You can check the packages included in each WML CE version at
https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/#/
Using WML CE packages
---------------------
.. warning::
You should only do small testing on the login node using the command line interface, formal jobs need to
be submitted via `LSF `__.
Small testing using the command line interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- $ ``ml wml_anaconda3``
- $ ``conda activate ``
- (your environment)$ ``python testing_program.py``
- $ ``conda deactivate``
Submitting jobs using LSF on Triton
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use ``#BSUB -q normal`` to submit job to queue normal for testing on
Triton now (it will change in the future).
Add ``#BSUB -gpu "num="`` if you need GPUs in your job.
You can use up to 2 GPUs if Distributed Deep Learning (DDL) is not
involved.
A job script example:
::
#!/bin/bash
#BSUB -J "my_example"
#BSUB -o "my_example_%J.out"
#BSUB -e "my_example_%J.err"
#BSUB -n 4
#BSUB -gpu "num=1"
#BSUB -q "normal"
#BSUB -W 00:10
ml wml_anaconda3
conda activate
python path/to/your_program.py
If the above file is named my_job_script.job, run the below command to submit the job:
$ ``bsub < my_job_script.job``
The output will show in the ``my_example_.out`` file after the job is done.
Installing other packages not included in WML CE
------------------------------------------------
.. warning::
Installing other packages could cause conflicts with the installed WML CE packages.
If you really need to install other packages, you can try the steps below in order until you find it.
1. ``conda install `` or ``conda install =`` in the activated environment will
search the package in the `IBM WML CE repo `__,
then the `official repo hosted by Anaconda `__ as configured
in the ``wml_anaconda3``. The package will be installed if it is found in the repos.
2. Search in `Anaconda Cloud `__ and **choose
Platform** ``linux-ppc64le``, then click on the name of the found package.
The detail page will show you how to install the package with a specific channel, such as
``conda install -c ``
3. Use ``pip install ``
.. warning::
Issues may arise when using pip and conda together.
Only after conda has been used to install as many packages as possible should pip be used to install any remaining software.
Using DDL (Testing)
-------------------
`Getting started with DDL `__.
.. warning::
ddl-tensorflow operator and pytorch DDL are DEPRECATED and will be REMOVED in the next WML CE release. Please start using `horovod `__ with NCCL backend.
A job script example:
::
#BSUB -L /bin/bash
#BSUB -J "MNIST_DDL"
#BSUB -o "MNIST_DDL.%J"
#BSUB -n 12
#BSUB -R "span[ptile=4]"
#BSUB -gpu "num=2"
#BSUB -q "normal"
#BSUB -W 00:10
module unload gcc
ml wml_anaconda3
ml xl
ml smpi
conda activate
# Workaround for GPU selection issue
cat > launch.sh << EoF_l
#! /bin/sh
export CUDA_VISIBLE_DEVICES=0,1
exec \$*
EoF_l
chmod +x launch.sh
# Run the program
export PAMI_IBV_ADAPTER_AFFINITY=0
ddlrun ./launch.sh python /path/to/your_program.py
# Clean up
/bin/rm -f launch.sh
- ``#BSUB -n 12`` requests 12 CPU cores
- ``#BSUB -R "span[ptile=4]"`` asks for 4 cores per node, so 3 nodes (12 / 4) will be involved.
- ``#BSUB -gpu "num=2"`` requests 2 GPUs per node, and therefore 6 GPUs in total (2 * 3) are requested for this job.
Using LMS (Testing)
-------------------
`Getting started with TensorFlow large model support `__
LMS section of `Getting started with PyTorch `__
System Pre-installed WML CE packages
------------------------------------
We recommend you set up your own environment and install WML CE packages so you have a total control. However, you can also use the different versions of WML CE that we have installed on the system.
You can do ``ml wml/`` to activate the environment including packages of the specific WML CE version.
``ml -wml`` will deactivate the environment.
Conda General Commands
----------------------
- $ ``conda create -n python=`` to create
an environment
- $ ``conda env list`` to list all available environments
- $ ``conda activate `` to activate an environment
Inside an environment (after activating the environment):
- $ ``conda list`` to list installed packages
- $ ``conda install `` to install a package
- $ ``conda install =`` to install a package
with a specific version
- $ ``conda install -c `` to install a package from
a specific channel (repository)
- $ ``conda remove `` to uninstall a package
- $ ``conda deactivate`` to deactivate the environment
Please check the `official document `__ for details.
References and Additional Resources
-----------------------------------
`Watson Machine Learning Community
Edition `__
`IBM Watson Machine Learning Community Edition Version 1.7.0
documentation `__
`Deep learning and AI on Power Systems technical
resources `__