Data engineers rarely have a say in what’s coming in the systems we’ve built. This presents great challenges where data systems often need to be tolerant about unseen events and at the same time have extra monitoring or QA processes to allow human to determine if the exception actually signals a broader system failure. Machine learning systems have brought this challenge to a new level - in data pipelines, system failures are mostly deterministic or at least reproducible when certain conditions are met. Machine learning applications outputs are stochastic, when exceptions are raised, there are way more probable causes from data to application where stochastic behavior does not make investigation any easier.
Planning resources for data systems usually involves more than a load balancer, in many data processing pipelines, it’s common to see some of the steps are more resource demanding while others are simple and quick, some needs to be happened in a specific setup (say a spark cluster as opposed to a linux box with python installed) while others don’t.
Here are some things to think about when you are building or trying to improve existing data processing pipelines.
I use this little github site to host my jupyter notebook for machine learning projects I’ve done and some toy examples of doing cool visualization with d3 with notebook and python. I’ve been using jekyll for years but I finally got to a point where - I rarely maintain this site and because jekyll is such a flexible and extendable library, everytime I try to update something it’s becoming difficult to navigate, I pulled the trigger to move to hugo recently.
I’ve run a couple classification ML algorithm on the dataset. What makes this problem interesting is that most of the students did not pass the course. I’ve re-sampled the positive cases multiple times to make the algorithms punish the false positive cases more severely.
Showcase on what you can do with IPEDS data API. Choropleth with d3. I’ve also tried this on beaker notebook. More details to come!
Using mpld3 to do visualization in ipython with Kaggle’s airbnb data. First experience is great!
Using ARIMA to forecast college enrollment for 2015 and 2016 at institution level.
Using ARIMA to forecast college enrollment for 2015 and 2016 at state level.