At work, the BI environment is often setup and ready to go. At home, when I need to do data analysis myself, it really helps if there’s data pipeline and visualization tools ready to go. Over time, I’ve developed my go-to open source data analytics stack that runs on my local machine. The repo: https://github.com/l1990790120/local-data-stack is self-contained. In this post, I’ll share a bit more details on how it works and how to use it.
Data engineers rarely have a say in what’s coming in the systems we’ve built. This presents great challenges where data systems often need to be tolerant about unseen events and at the same time have extra monitoring or QA processes to allow human to determine if the exception actually signals a broader system failure. Machine learning systems have brought this challenge to a new level - in data pipelines, system failures are mostly deterministic or at least reproducible when certain conditions are met.
Planning resources for data systems usually involves more than a load balancer, in many data processing pipelines, it’s common to see some of the steps are more resource demanding while others are simple and quick, some needs to be happened in a specific setup (say a spark cluster as opposed to a linux box with python installed) while others don’t. Here are some things to think about when you are building or trying to improve existing data processing pipelines.