Data engineers rarely have a say in what’s coming in the systems we’ve built. This presents great challenges where data systems often need to be tolerant about unseen events and at the same time have extra monitoring or QA processes to allow human to determine if the exception actually signals a broader system failure. Machine learning systems have brought this challenge to a new level - in data pipelines, system failures are mostly deterministic or at least reproducible when certain conditions are met. Machine learning applications outputs are stochastic, when exceptions are raised, there are way more probable causes from data to application where stochastic behavior does not make investigation any easier.
Over the past year, I’ve transitioned from building the more traditional ETL or Data Transformation into Machine Learning pipeline. Throughout the years, I’ve worked with various distributed technologies (Cassandra, Kafka, Hadoop, Spark, Kubernetes etc.) and built a few custom distributed data systems myself. The data volume can only grow, computation requirements either for a single instance or ability to scale a cluster compute increases. On the journey to help data science teams to do experiments faster, easier and safer, I’ve learned that scaling machine learning infrastructure is quite different from scaling data pipeline - although they often fall into the same category of Data Engineering.
Think about it, one can scale workers two ways: 1. Train a worker to do work more efficiently 2. Add more workers
Scaling computation is not much different, the former falls into the category of HPC (High Performance Computing) and the latter is called Distributed Computing. I am a long time advocate for Distributed Computing over HPC, at least, for data pipelines. My view has been drastically shifted as I work with more and more machine learning algorithms.
A very distinct difference between Machine Learning and Data Processing algorithms is that, machine learning algorithms, by nature, is very complicated. Now, there are great initiatives to train deep learning models on multi GPUs and have shown significant performance improvement. Outside of deep learning world where the data accumulated though the training phases is not as ridiculously large, training on classification or regression algorithms such as xgboost is often more efficient on one node as opposed to distributed fashion.
Two main differences between machine learning and ETL pipeline infrastructure:
It’s important to think about who the platform is supporting and how they are going to use it. When building data pipelines, developers usually have good understanding about the business logic and expected output. Now, not to say that developers won’t be able to understand machine learning algorithms, it’s just a lot more knowledge transfer is involved than business logics.
The fact that building models requires a lot more expertise that’s outside of the engineering domain, it’s more efficient to have data scientists to focus on developing models. On the other hand, serving predictions in real time and retrain models with new data is a challenging yet common engineering problems. It’s quite common for engineering to work with different disciplines, a collaborative dynamic between the engineering and data science can be key to actually using models in production (beyond research projects).
The relationship between developers and IT/Ops team is not so different from engineering and data science. Developers know their applications the best, it makes most sense to deploy applications their way. However, if everyone wants to do things their way, it’ll be very difficult for IT/ops team to keep up with the demand and at the same time maintain the performance and stability of applications running across the infrastructure.
Fast forward to now, cloud is the mainstream for most small tech teams and enterprises are also catching up. All major cloud as well as successful devops team are building or enhancing platforms and toolings (such as Kubernetes) to enable developers to manage their own application deployment while abstracting out best practices in networking, security, monitoring etc.
It’s clear, engineering needs to build for enablement not creating limitations on data scientists. Machine learning ecosystem is expanding and changing at a extremely fast pace. ETL has been around for decades, even though custom workload can still be introduced, tools such as SQL or Pandas are quite established - it’s easier to find a common ground between business analysts than data scientists and data engineers. Platform supporting data science experimentation needs to provide
Presented with these challenges, our team in Hux at Deloitte Digital had great success with Kubeflow. Things that helped us:
Now, this is not to say Kubeflow and Kubernetes are antidote to all problems. As it’s been running for several months, we are now facing new engineering challenges:
One last note. Those who are on the market to look at machine learning platforms or frameworks, there are many more options. The space is still under rapid development and is far from convergence. Other tools to look out for:
Here’s a even more extended list of tools on reddit – link.
Happy data engineering!