Fancy hyperparameter tuning with Kubernetes

The usage of hyperparameter techniques has helped many machine learning experts automate the burden of manual empiric experiments to find the optimal (or close to) combination of hyperparameters for a model. With the first publications about the usage of algorithms to do this task dating from the early 2010s, it's safe to assume this is a new area that we're improving on at high speed. The problem of searching for the best permutation of hyperparameters becomes harder as the size of the search space and the underlying dataset used for training increases.

The base case for hyperparameter tuning automatization is to use a random (brute-force-ish) search. Just like a password cracker, this type of approach will exhaust all possible combinations of hyperparameters and assess your metric of interest in order to decide which one produces the better result. It may seem that this approach should be discarded for such hard problems mentioned, but that's not the case, you should always consider effort vs. reward when deciding on which approach to choose, and in terms of effort, a random search is by far the easiest implementation you could perform. Some people discovered creative ways of early-stopping the search, like setting a minimum goal for the metric in order to reduce the number of iterations. If you're not rushing against the clock to release your model then a random search is your best friend.

Upset Monkey


However, even if your goal isn't to prove the infinite monkey theorem, you may have other reasons to release your model earlier. So some people started applying more sophisticated algorithms for this task, like Bayesian and Gradient-based optimizations, and even some specializations of it, such as DARTS. These approaches are able to reduce the time taken to find the optimal solution by using advanced heuristics to "navigate" smarter in the search space.

For most cases, there are already parallel implementations of them and that is also another factor that will help you speed things up if you're hyperparameter tuning in a single machine. One can easily automate this task with the help of libraries such as Optuna and Hyperopt. But what do you do when your datasets and models are so big that it's impractical to use a single machine?

Kubernetes Helm IRL

Hopefully there's a growing community that's using Kubernetes for machine learning tasks. Kubernetes is, by this date, one of the best solutions for scheduling tasks into pools of computing instances (or like the old school engineers like to call it, clusters). With the usage of Kubernetes, one can just configure a job, using a high level language as YAML, and let the orchestrator allocate your job to run smoothly in many instances. By doing that, you can easily achieve higher levels of parallelism, and as long as you have a sufficient budget you can pretty much achieve i̶n̶f̶i̶n̶i̶t̶e̶ parallelism.

A project called Katib is in constant development and makes this process something relatively simple to do. Katib is a cloud native application that has an extensive list of ready to use distributed implementations of many hyperparameter tuning (and specializations of it, like Network Architecture Search, or NAS). Through the combination of Katib and containers (Docker), one can just configure a YAML file to specify what needs to be optimized, and Katib takes care of interacting with Kubernetes APIs to get the job done by using a large amount of CPUs and Memory. It's important to notice that even though Katib does distributed computing at some level, it doesn't help you to perform a distributed training. The value it brings to the table is the repetition of many trials in parallel, so you are able to speed-up at a higher scale. Katib takes care of inter-node communication by abstracting the work of coordinating the many pods that are running to create a unique task from an optimization perspective. At Elemeno, we use Katib on a daily basis to help our customers to get their models to production faster.

The table below shows the difference of doing hyperparameter tuning with Katib to optimize a MNIST classifier implemented with MxNet. The authors ran the same experiment with users allocated to different namespaces in Kubernetes, a typical setup for multi-tenant scenarios, user2 had a resource quota of 6 vCPUs, while user1 was allowed to use 18vCPUs. This shows the advantage of running this kind of job in a distributed setup.

Parallelism comparison | Credits: A Scalable and Cloud-Native Hyperparameter Tuning System, George J. et al. | https://arxiv.org/pdf/2006.02085.pdf

Hyperparameter tuning is one of the last steps in a machine learning development workflow, and the combination of distributed and parallel computing will make you do it more efficiently. Although the initial setup of this stack may be tricky to do, you should ask your ML-Ops team to start looking into it. Elemeno offers an open source platform that makes this setup easier, as well as a cloud managed version of it.

If you liked this article, join our Discord channel and let's continue the discussion.

Refs:

  • Bardenet, R., Brendel, M., Kégl, B., & Sebag, M. (2013). Collaborative hyperparameter tuning. International Conference on Machine Learning, 199–207.
  • Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(2).
  • Liu, H., Simonyan, K., & Yang, Y. (2018). DARTS: Differentiable Architecture Search. CoRR, abs/1806.09055. http://arxiv.org/abs/1806.09055
  • George, J., Gao, C., Liu, R., Liu, H. G., Tang, Y., Pydipaty, R., & Saha, A. K. (2020). A Scalable and Cloud-Native Hyperparameter Tuning System. ArXiv Preprint ArXiv:2006.02085.