Planning resources for data systems usually involves more than a load balancer, in many data processing pipelines, it’s common to see some of the steps are more resource demanding while others are simple and quick, some needs to be happened in a specific setup (say a spark cluster as opposed to a linux box with python installed) while others don’t.
Here are some things to think about when you are building or trying to improve existing data processing pipelines.
I use this little github site to host my jupyter notebook for machine learning projects I’ve done and some toy examples of doing cool visualization with d3 with notebook and python. I’ve been using jekyll for years but I finally got to a point where - I rarely maintain this site and because jekyll is such a flexible and extendable library, everytime I try to update something it’s becoming difficult to navigate, I pulled the trigger to move to hugo recently.
I’ve run a couple classification ML algorithm on the dataset. What makes this problem interesting is that most of the students did not pass the course. I’ve re-sampled the positive cases multiple times to make the algorithms punish the false positive cases more severely.
Showcase on what you can do with IPEDS data API. Choropleth with d3. I’ve also tried this on beaker notebook. More details to come!
Using mpld3 to do visualization in ipython with Kaggle’s airbnb data. First experience is great!
Using ARIMA to forecast college enrollment for 2015 and 2016 at institution level.
Using ARIMA to forecast college enrollment for 2015 and 2016 at state level.