My adventures with MicroK8s to enable GPU and use MIG on a DGX A100
I recently had the chance to play with an Nvidia DGX A100 server. The DGX A100 is a beast with 8x NVIDIA A100 Tensor Core GPUs and a total of 320 GB of GPU Memory.
There’s a new Nvidia feature called MIG (Multi Instance GPU) that lets you split one GPU into up to 7 slices so that more containers/pods can use GPUs simultaneously. I ended up using MicroK8s and Ubuntu 20.04 to do this in Kubernetes. It wasn’t the easiest thing to do so I posted my findings on the Kubernetes discuss forum so that others could benefit from my experience.
In summary, the steps were the following:
- Make sure to completely remove Nvidia drivers from the host
- Blocklist the nouveau driver
- Install fabric manager
- Install MicroK8s 1.21/beta
- Enable DNS and make sure it works
- Enable GPU and fabric manager
- Enable MIG
- Create the GPU slices
Check out my post on the Kubernetes discuss forum for more details: https://discuss.kubernetes.io/t/my-adventures-with-microk8s-to-enable-gpu-and-use-mig-on-a-dgx-a100/15366