Data engineers rarely have a say in what’s coming in the systems we’ve built. This presents great challenges where data systems often need to be tolerant about unseen events and at the same time have extra monitoring or QA processes to allow human to determine if the exception actually signals a broader system failure. Machine learning systems have brought this challenge to a new level - in data pipelines, system failures are mostly deterministic or at least reproducible when certain conditions are met.
Planning resources for data systems usually involves more than a load balancer, in many data processing pipelines, it’s common to see some of the steps are more resource demanding while others are simple and quick, some needs to be happened in a specific setup (say a spark cluster as opposed to a linux box with python installed) while others don’t. Here are some things to think about when you are building or trying to improve existing data processing pipelines.