I have been working professionally for close to 10 years in seven companies (not counting part-time job or internships). Disclaimer: I am no workplace expert. I am an average cog in the wheel who reads and thinks a lot on the subject (and perhaps have been through a few of the not so great workplaces myself). This might or might not be a writing therapy reflecting on my worklife over the years and most recently.
As some of you may know, with the help of many very talented and kind people, I alone with others have been building out a teeny tiny SaaS platform apiobuild.com in the past several months - A low code platform enabling everyone to create automation software at ease. We started the project as pandemic hit, people were getting creative with their career (out of necessity or exploration), we saw a healthy demand where people really needs some form of very niche software yet have no clue how to get started.
Almost in every AI or Machine Learning conferences I’ve been to lately, there’s a track dedicating to biases or “injustices” in algorithmic decisions. Books have been published (Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor, Algorithms of Oppression: How Search Engines Reinforce Racism etc.) and fear has been spread (Elon Musk says AI development should be better regulated, even at Tesla ). The fear of unknown is, perhaps, more persuasive than a realistic survey of the state of AGI (Artificial General Intelligence) development.
Companies who could afford in-house engineering teams often have tendencies to build custom in-house solutions no matter how prevalent that solution already is. To highlight the absurdity, let me give an example: Jupyter Notebook. Despite the space being extremely competitive, Google has Colab, Amazon has Sagemaker, Azure, Databricks, Domino, Binder all offer similar services and the list goes on, Product or Technology still cannot resist the urge to build one (if not multiple) in-house.
At work, the BI environment is often setup and ready to go. At home, when I need to do data analysis myself, it really helps if there’s data pipeline and visualization tools ready to go. Over time, I’ve developed my go-to open source data analytics stack that runs on my local machine. The repo: https://github.com/l1990790120/local-data-stack is self-contained. In this post, I’ll share a bit more details on how it works and how to use it.
About two days ago, one of my coworkers was so fed up by Jenkins and decided to try Github Action. I’ve been thinking about automating publishing this github site since … the day I set it up. At work, if I have to setup a CD pipeline, I’d usually put it on Jenkins. But at home, I just want to sit back and relax, I don’t want to spend my Netflix time fixing Jenkins (which unfortunately it breaks all the time at work).
Data engineers rarely have a say in what’s coming in the systems we’ve built. This presents great challenges where data systems often need to be tolerant about unseen events and at the same time have extra monitoring or QA processes to allow human to determine if the exception actually signals a broader system failure. Machine learning systems have brought this challenge to a new level - in data pipelines, system failures are mostly deterministic or at least reproducible when certain conditions are met.
Planning resources for data systems usually involves more than a load balancer, in many data processing pipelines, it’s common to see some of the steps are more resource demanding while others are simple and quick, some needs to be happened in a specific setup (say a spark cluster as opposed to a linux box with python installed) while others don’t. Here are some things to think about when you are building or trying to improve existing data processing pipelines.