Or Both. But Mostly Latter..

Once upon a time, to build a website, you had to buy a computer. One day, someone had the brilliant idea to buy many computers that nerdy people like myself could pay to use — we called them servers. Almost 20 years ago, AWS had an even more brilliant idea to buy many servers and let less nerdy people pay to use them — we called it cloud compute. Today, everyone can host a website with a few clicks or sentences. Sound familiar?

As some of you may know, with the help of many very talented and kind people, I alone with others have been building out a teeny tiny SaaS platform apiobuild.com in the past several months - A low code platform enabling everyone to create automation software at ease. We started the project as pandemic hit, people were getting creative with their career (out of necessity or exploration), we saw a healthy demand where people really needs some form of very niche software yet have no clue how to get started.

Almost in every AI or Machine Learning conferences I’ve been to lately, there’s a track dedicating to biases or “injustices” in algorithmic decisions. Books have been published (Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor, Algorithms of Oppression: How Search Engines Reinforce Racism etc.) and fear has been spread (Elon Musk says AI development should be better regulated, even at Tesla ).

The fear of unknown is, perhaps, more persuasive than a realistic survey of the state of AGI (Artificial General Intelligence) development. Admittedly, from the very beginning of my career, I despised those who lacks the imagination of how data and algorithms can improve the quality of human decisions - I have always believed that human intelligence could be drastically improved when augmented with the right information at the right time.

Companies who could afford in-house engineering teams often have tendencies to build custom in-house solutions no matter how prevalent that solution already is. To highlight the absurdity, let me give an example: Jupyter Notebook. Despite the space being extremely competitive, Google has Colab, Amazon has Sagemaker, Azure, Databricks, Domino, Binder all offer similar services and the list goes on, Product or Technology still cannot resist the urge to build one (if not multiple) in-house. Not to mention some notable companies spent fortunes on building fully integrated notebooks on their platforms.

At work, the BI environment is often setup and ready to go. At home, when I need to do data analysis myself, it really helps if there’s data pipeline and visualization tools ready to go. Over time, I’ve developed my go-to open source data analytics stack that runs on my local machine. The repo: https://github.com/l1990790120/local-data-stack is self-contained. In this post, I’ll share a bit more details on how it works and how to use it.

About two days ago, one of my coworkers was so fed up by Jenkins and decided to try Github Action. I’ve been thinking about automating publishing this github site since … the day I set it up. At work, if I have to setup a CD pipeline, I’d usually put it on Jenkins. But at home, I just want to sit back and relax, I don’t want to spend my Netflix time fixing Jenkins (which unfortunately it breaks all the time at work). So, I decided to find a way to setup automated static content publishing process (or the fancy term – CD) in Github Action for this site (repo: https://github.com/l1990790120/l1990790120.github.io).

Data engineers rarely have a say in what’s coming in the systems we’ve built. This presents great challenges where data systems often need to be tolerant about unseen events and at the same time have extra monitoring or QA processes to allow human to determine if the exception actually signals a broader system failure. Machine learning systems have brought this challenge to a new level - in data pipelines, system failures are mostly deterministic or at least reproducible when certain conditions are met. Machine learning applications outputs are stochastic, when exceptions are raised, there are way more probable causes from data to application where stochastic behavior does not make investigation any easier.