MLOps – is something similar to DevOps for micro-services. But there are more ML-related aspects to it, more than just the algorithm, such as data and model management, model versioning, model drift, etc. The promise, at least, and so many tools and services exist in this area, whether commercial or open-source, that it’s pretty easy to get lost and confused if you don’t have some kind of selection criteria.
ML Engineering and MLOps
You need a good data analysis tool or OLAP database. Like Apache Presto, which can digest data from several different data stores or from traditional databases. Something resembling Google BigQuery but in the Opensource space. The ML engineer can use a simple SQL-like query to analyze huge amounts of data, which will matter for algorithm selection and further ML development.
Although this is a full-fledged infrastructure, it can in fact be maintained outside of MLOps, as it is more of a (Big)Data infrastructure. Sometimes the ML project may start with data engineering, but it’s not always the case that it’s the most important part of the project. Sometimes the ML project can start with less data, however, as more data is gathered and analyzed for better model training, a scalable data analysis scheme should be adopted.
Next is the model development environment and associated languages and frameworks. For example, Python libraries like Pandas, Scikit Learn, and programming languages -Python, R. ML frameworks like TensorFlow or PyTorch and data formats like the simple CSV or more efficient formats like Apache Parquet. This area ties into the development environment which is usually Jupyter Notebook. Whereas most data analysis is based on Python and its libraries, for stuff like video analysis it could be C++, Go and related libraries – GStreamer, OpenCV, etc.
MLOps essentials
Model training
The training is done by model frameworks like Tensorflow, Pytorch, etc. There are some more complex frameworks like Horvod that carry out distributed training https://github.com/horovod/horovod using parallel programming frameworks like MPI or Tensorflow based distributed training using https://developer.nvidia.com/nccl. It should be noted that for extremely deep neural networks like those used for image or video analysis, the training will require an NVIDIA GPU with CUDA cores, whereas for data analysis, a CPU may suffice.
Experimental tracking
This might be useful for historically checking the accuracy, precision, confusion matrix, or similar metrics with the data and model, and linked to the model validation phase along with the test set. Possibly more useful in the CI-CD pipeline for ML than in development.
Model Serving – TFServing, Seldon, KFServing, or similar. The point is to ensure that the model is separate from the application and can be independently versioned and updated via CI/CD. One more thing to note is that usually, a great deal of data needs to be fed into the model for uses such as media analysis. Thus, the model and the business microservice should be hosted in the same Kubernetes node and scaled via a load balancing layer.
Model monitoring
In traditional DevOps, the feedback to Ops or developers happened through matrices gathered during deployment via Prometheus or a similar tool and displayed in Grafana dashboards. A similar tool could also be used for model monitoring. However, to verify whether or not the model is accurate in prediction, a manual check or operator involvement or customer feedback integrated into the predict-analyze loop is needed. Or even more advanced A/B testing against other models and comparison of results using complex algorithms such as the Multi-Arm Bandit. Validation of the model with newly available data is also necessary to identify model drift. Model monitoring is certainly a higher level of MLOps skill, the implementation of which would require a very robust infrastructure.
Model, data, and experiment sharing – Here you need to have central, access-controlled sharing between different teams; to do this, a model or data registry is needed.
Model tuning – for which we have popular software like Optuna or Katib.
Kubeflow as MLOps Pipeline component
What is Kubeflow
Kubeflow is an open-source platform, based on Kubernetes, that is designed to simplify the development and deployment of ML systems. Defined in the official documentation as the ML Toolkit for Kubernetes, Kubeflow comprises several components that encompass the various stages of the machine learning development lifecycle. These include notebook development environments, training, model monitoring, hyperparameter tuning, feature management, model service, and, of course, machine learning pipelines.