MapR Distributed Deep Learning QSS is to combining enterprise ready distributed file system with Kubernetes, to train and deploy deep learning models distributedly on a heterogeneous GPU cluster. We will demonstrate the steps to deploy Distributed Deep Learning QSS on MapR converge data platform. On AWS cloud, we set up 3 g2.2xlarge nodes as GPU nodes and 1 m4.2xlarge as master node. We used Ubuntu 16.04 here, but it will work as well on Redhat and CentOS.
Install MapR on CPU and GPU nodes
First, we install MapR on the cluster. For simplicity, we put all MapR services on the master node, and leave the GPU for computing.
#set up clustershell and passwordless ssh
apt-get install -y clustershell screen
vi /etc/clustershell/groups
all: ip-10-0-0-[226,75,189,121].ec2.internal
cldb: ip-10-0-0-226.ec2.internal
zk: ip-10-0-0-226.ec2.internal
web: ip-10-0-0-226.ec2.internal
nfs: ip-10-0-0-[226,75,189,121].ec2.internal
gpu: ip-10-0-0-[75,189,121].ec2.internal
ssh-keygen -t rsa
for i in ip-10-0-0-226.ec2.internal ip-10-0-0-75.ec2.internal ip-10-0-0-189.ec2.internal ip-10-0-0-121.ec2.internal; do ssh -i /home/ubuntu/mapr-dm.pem $i; done
cat ~/.ssh/id_rsa.pub | ssh -i /home/ubuntu/mapr-dm.pem root@ip-10-0-0-226.ec2.internal 'cat >> .ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh -i /home/ubuntu/mapr-dm.pem root@ip-10-0-0-75.ec2.internal 'cat >> .ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh -i /home/ubuntu/mapr-dm.pem root@ip-10-0-0-189.ec2.internal 'cat >> .ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh -i /home/ubuntu/mapr-dm.pem root@ip-10-0-0-121.ec2.internal 'cat >> .ssh/authorized_keys'
#start to install MapR
clush -a 'apt-get update -y'
clush -a 'apt-get install -y openjdk-8-jdk'
clush -a "echo never > /sys/kernel/mm/transparent_hugepage/defrag"
clush -a "cat >> /etc/security/limits.conf << EOL
mapr soft nofile 64000
mapr hard nofile 64000
mapr soft nproc 64000
mapr hard nproc 64000
EOL"
clush -a "groupadd -g 5000 mapr"
clush -a "useradd -g 5000 -u 5000 mapr"
passwd mapr
clush -a " wget -O - http://package.mapr.com/releases/pub/maprgpg.key | sudo apt-key add -"
clush -a "cat >> /etc/apt/sources.list << EOL
deb http://package.mapr.com/releases/v5.2.1/ubuntu binary trusty
deb http://package.mapr.com/releases/MEP/MEP-3.0/ubuntu binary trusty
EOL"
clush -a 'fdisk -l'
clush -a "cat >> /root/disks.txt << EOL
/dev/xvde
/dev/xvdc
/dev/xvdd
EOL"
clush -a apt-get update -y
clush -g zk apt-get install -y mapr-cldb mapr-zookeeper mapr-webserver
clush -a apt-get install -y mapr-core mapr-fileserver mapr-nfs
clush -a /opt/mapr/server/configure.sh -C `nodeset -S, -e @cldb` -Z `nodeset -S, -e @zk` -N DLcluster -M7 -no-autostart
clush -a "ls /root/disks.txt && /opt/mapr/server/disksetup -F /root/disks.txt"
#make sure the folder is here
clush -a sed -i "'s/#export JAVA_HOME=/export JAVA_HOME=\/usr\/lib\/jvm\/java-1.8.0-openjdk-amd64\/jre/g' /opt/mapr/conf/env.sh"
clush -a mkdir -p /mapr
clush -a 'echo "localhost:/mapr /mapr hard,nolock" > /opt/mapr/conf/mapr_fstab'
clush -a systemctl start rpcbind
sleep 2
clush -g zk systemctl start mapr-zookeeper
sleep 10
clush -g zk systemctl status mapr-zookeeper
clush -a systemctl start mapr-warden
maprcli node cldbmaster
now register the cluster
clush -a 'mount -o hard,nolock localhost:/mapr /mapr'
Before mount the disk you might want to register your cluster and apply enterprise trial license, then restart the NFS on each node, you can do that through MCS web interface. To register the cluster: https://community.mapr.com/docs/DOC-1679
At this step, you should have a running MapR cluster, since we didn’t install any ecosystem components, it should be fairly simple and basic. If /mapr folder is not mounted to mapr file system, run “clush -a 'mount -o hard,nolock localhost:/mapr /mapr’”. Also, we should set mapr subnet in /opt/mapr/conf/env.sh, add “export MAPR_SUBNETS=10.0.0.0/24”.
Install Kubernetes 1.7
Now want to install kubernetes master on the CPU node, workers on the GPU nodes. With Kubernetes 1.5.2 and earlier, there are manual procedures. For Kubernetes 1.6 and later, we will use Kubeadm to config and spin up the cluster.
clush -a apt-get update && apt-get install -qy docker.io
clush -a apt-get update && apt-get install -y apt-transport-https
clush -a 'curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -'
cat >> /etc/apt/sources.list.d/kubernetes.list << EOL
deb http://apt.kubernetes.io/ kubernetes-xenial main
EOF
clush -a apt-get update
clush -a "apt-get install -y kubelet kubeadm kubectl kubernetes-cni"
cat >> /etc/systemd/system/kubelet.service.d/10-kubeadm.conf << EOL
Environment="KUBELET_EXTRA_ARGS=--feature-gates=Accelerators=true"
EOL
clush -a "systemctl enable docker && systemctl start docker"
clush -a "systemctl enable kubelet && systemctl start kubelet"
kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=10.0.0.226
cp /etc/kubernetes/admin.conf $HOME/
sudo chown $(id -u):$(id -g) $HOME/admin.conf
export KUBECONFIG=$HOME/admin.conf
echo "export KUBECONFIG=$HOME/admin.conf" | tee -a ~/.bashrc
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel-rbac.yml
kubectl create -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
kubectl taint nodes --all node-role.kubernetes.io/master-
kubeadm join --token c44f75.d6a7a3d68d638b50 10.0.0.226:6443
export KUBECONFIG=/etc/kubernetes/kubelet.conf
kubectl create -f https://git.io/kube-dashboard
kubectl proxy --port=8005
Then on your local machine, use ssh tunnel to access the kubernetes dashboard with:
ssh -N -L 8005:127.0.0.1:8005 UbuntuK
Host UbuntuK
HostName ip-10-0-0-226.ec2.internal
User ubuntu
Port 22
IdentityFile ~/Documents/AWS/mapr-dm.pem
Then go to http://localhost:8005/ui to access the dashboard.
Install Nvidia Libraries
Then, to enable deep learning applications, we need to install Nvidia driver with Cuda and Cudnn on all the gpu nodes. The driver version will be different given the GPU cards in use.
clush -g gpu 'apt-get -y install build-essential cmake g++'
clush -g gpu "cat >> /etc/modprobe.d/blacklist-nouveau.conf << EOL
blacklist nouveau
options nouveau modeset=0
EOL"
clush -g gpu update-initramfs -u
on each node
wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda_8.0.61_375.26_linux-run
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run
wget https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v5.1/prod_20161129/8.0/cudnn-8.0-linux-x64-v5.1-tgz
tar -xvf cudnn-8.0-linux-x64-v6.0.tgz -C /usr/local
cp /usr/local/cuda/lib64/libcudnn* /usr/local/cuda-8.0/lib64/.
bash run the two run files
clush -a "cat >> /root/nvidiastartscript.sh << EOL
#!/bin/bash
/sbin/modprobe nvidia
if [ "$?" -eq 0 ]; then
# Count the number of NVIDIA controllers found.
NVDEVS=`lspci | grep -i NVIDIA`
N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`
N=`expr $N3D + $NVGA - 1`
for i in `seq 0 $N`; do
mknod -m 666 /dev/nvidia$i c 195 $i
done
mknod -m 666 /dev/nvidiactl c 195 255
else
exit 1
fi
/sbin/modprobe nvidia-uvm
if [ "$?" -eq 0 ]; then
D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
mknod -m 666 /dev/nvidia-uvm c $D 0
else
exit 1
fi
EOL"
Execute this bash script on the GPU nodes to set up the nvidia devices.
export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
nvidia-smi should give you the GPU info, and we use that info to label the kubernetes nodes.
kubectl label nodes ip-10-0-0-189 alpha.kubernetes.io/nvidia-gpu-name=GRID_K520
kubectl label nodes ip-10-0-0-75 alpha.kubernetes.io/nvidia-gpu-name=GRID_K520
kubectl label nodes ip-10-0-0-121 alpha.kubernetes.io/nvidia-gpu-name=GRID_K520
At this point, we have set up our running GPU cluster with MapR and Kubernetes 1.7.
If we use kubectl to describe the nodes, we should be able to see the gpu capacity under different nodes.
for cpu nodes: kubectl describe node ip-10-0-0-226:
Capacity:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 8
memory: 32946584Ki
pods: 110
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 8
memory: 32844184Ki
pods: 110
for gpu nodes: kubectl describe node ip-10-0-0-75:
Capacity:
alpha.kubernetes.io/nvidia-gpu: 1
cpu: 8
memory: 15399284Ki
pods: 110
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 1
cpu: 8
memory: 15296884Ki
pods: 110
To summarize, we have installed a MapR MFS only cluster to provide the distributed data layer and installed Kubernetes 1.7 as the orchestration layer. We enabled Kubernetes to manage the GPU, CPU and memory resources on each node in cluster. In the next blog, we will configure the persistent storage to link MapR file systems with kubernetes pods and demonstrate simple distributed deep learning training examples.