Lulu Cheng

New York, NY, 10013
lulu.cheng90@gmail.com
l1990790120.github.io/about

Technology ¶

Development: Scala, Python, R, Java, Go
Data: Redshift, Cassandra, Kafka, MQ, Airflow, Superset, Spark, Hive, Presto, Couchbase, MongoDB, Oracle
Infrastructure: Kubernetes, Helm, Argo, Docker, Cloud Run, GKE, AWS, GCP, Jenkins, Cloud Build, EMR, Lambda, Batch, Hadoop, Bash
Visualization/Web App: Python (django, flask), html, css, javascript (d3.js, dc.js, crossfilter.js)
Machine Learning: MLFlow, Kubeflow, XGBoost, LightGBM
BI: Tableau, Alteryx

Work Experience ¶

Machine Learning Engineer, Reddit, New York, NY 2025-04 to Present

Founding member of Reddit’s indexing platform team. Scaled from 0 to 100+ production pipelines within the first year, supporting safety, ads, and content understanding teams across the organization. Led many cross-functional customer team onboarding and improve platform usability, accelerating adoption across product teams. Indexes support core ranking and relevance signals across Reddit’s feeds, ads, and safety systems to drive user engagement and content quality at scale.
Built online and offline LLM inference system serving 200k+ predictions/sec across pipelines. Scaled corresponding offline batch infrastructure to process 100m+ text, image, and video assets for full-history content understanding.
Designed core platform abstractions and config-driven indexing framework enable product teams to define a single pipeline specification that deploys both online and offline inference workloads, each with independent scaling characteristics. Improved developer velocity and enabled product teams to deploy production-grade pipelines without infrastructure overhead such as permissions, security, and scalability.

Software Engineer/Engineering Manager, Block, New York, NY 2021-11 to 2025-04

Lead modernization and consolidation initiative for Block’s data streaming infrastructure, serving both as an IC and EM. Streamlined and unified 7+ disparate data streaming platforms across BUs. Reduce vendor and maintainance cost, operational complexities, security risks from inconsistent implementations. Over the span of 1.5 years, our Data Streaming Infrastructure team has consolidated 4+ separate platforms, enhance development efficiency and accelerate the lifecycle from event streaming to downstream data applications. Implemented company-wide generic data governance and security abstractions and frameworks to strengthen Block’s overall data ecosystem.
Streamline Block’s governance, privacy and security processes. Develop shared layer of abstractions and frameworks spanning from event streams to downstream applications (datalake, feature stores, AI/ML applications, etc.). Led the Data Streaming Infrastructure team to develop toolings, monitoring solutions, and processes. Empowering internal teams to effectively manage their own data products. Significantly improve Block’s overall engineering efficiency, governance protocols, and security posture.
Contribute to Block’s overall data infrastructure, governance, and security strategy. Introduce software development strategy across BUs and product teams to seek balance between flexibility and speed by implementing layers of governance, privacy and security abstractions aligned with our organizational structure. Empowered product teams to build with security, privacy, and safety by default, align with Block’s commitment to robust and responsible software development standards.
Led a team of 7 in the past 2+ years. Stepped in as interim manager for larger Data Streaming Platform organization of 13. Experienced as hands-on EM on day to day team operations, people management, process improvement and higher level organization/cross-team alignment, setting and communicating broader technical direction both internally and externally.

Master Data Engineer/Manager, Data Engineering, Capital One, New York, NY
2020-03 to 2021-11

Machine Learning Platform
- Contribute to platform’s model building and training module. Create abstraction layer for hyperparameter tuning packages (Optuna, SparkML) to allow easy integration on a variety workflow and experimentation tools (Airflow, MLFlow, Dask, Argo).
- Productionize document vulnerability scan model pipeline. Experiment with Hugging Face, Tensorflow, Spark XGBoost against the full dataset (entire Capital One Retail Bank’s S3 documents) to reduce training time.
- Mentor in internal and external Data and AI/ML projects. Working with team(s) of interns and software engineers to research and prototype new AI/ML packages on enterprise platform.
Identity and Fraud
- Leading cross team efforts with product, data science and business operations to “modernize” Capital One’s Identity and Fraud tech and data stack in Retail Banking. Re-architect backend APIs and pipeline to enable realtime analytics, experimentations, dashboard monitoring and machine learning model development.

Senior Data Engineer, Deloitte Digital, New York, NY
2019-05 to 2020-03

Leading adoptions and migration process across group-wide data science and analytics teams to self-service, open-source AI/ML stack. Improve overall stability, quality and performance of AI/ML applications across organization. Streamline and shorten data science experimentation to production cycle time.
Research, evaluate, contribute and deploy AI/ML stack. Integrate and internalize popular modern AI/ML frameworks (Kubeflow, MLFlow, H2O, XGBoost, Bert, Spark) into:
1. YAML configurable pipelines for both ad-hoc experiments and production workflows.
2. Reusable components with unified CI/CD processes to validate, build, share across experiments easily.
3. Optimize kubernetes to scale hyperparameter search. Automate scoring services deployment.
4. Integration with external enterprise data science platforms.

Senior Software Engineer (Tech Lead), PeerIQ, New York, NY
2018-01 to 2019-04

Leading the efforts to transform existing data systems to highly distributable, scalable while maintaining flexibility. Transitioning legacy monolithe ETL application to microservice-architecture leveraging serverless container-based infrastructure, microservice, messaging bus and on-demand function calls (Lambda) with minimum business interruptions and code changes.
Designing and building distributed, serverless ETL system that process with drastically heterogeneous data volumne (GB ~ <5TB)
1. Reduce run time from days+ and dozens of datasets that weren’t able to finish to < 30 mins
2. Without touching a line of business logic code base (10000+ lines)
3. Up to 50% cost savings on the infrastructure (Ability to scale 1 to 1000 instances only when needed in < 5 mins)
4. Increase developer productivity. The serverless setup, allow users to run any version at the same time at any scale for quick validation. Users are able to debug one specific record in and step through with minimal local environment setup.
Designing and building ML models to recommend system parameters based on system logs to offer true zero configuration (other than what needs to be processed) on complex distributed system.
Leading and coaching a team of data engineers. Managing dynamic competing client requests, internal projects to improve data system’s scalability and reliability, supporting engineering of our data products with limited resources.

Software Engineer, PeerIQ, New York, NY
2017-02 to 2018-01

Design and develop big data infrastructure and spark ETL jobs process 20 yrs+ consumer credit records on AWS EMR with Spark (scala) maintain and support a wide range of query engine such as hive and presto, scheduling (airflow) and notebook tools (jupyter, zeppelin) to support analytics query needs. Serverless big data warehouse architecture design with S3 as data store and AWS athena (presto) as query engine.
Design highly scalable production grade machine learning environment to support internal data science needs.
Develop highly scalable, 15~20x faster valuation/projection microservices with python, go through kafak managed with kubernetes.
Design and develop internal/external data API for ETL automation to provide continuous automated data quality monitoring tool for ETL pipeline.

Data Analyst, McGraw Hill Education, New York, NY
2015-05 to 2017-02
Work with management and leadership to develop analytics and dashboards around company strategy and operation

Implement classification algorithms (SVM, Logistic Regression, Decision Tree, Random Forest, KNN and other ensemble methods) to predict fraud orders from 400+ million transaction data
Text mining on customer service inquiries to tag and identify high-demand issues
Develop forecast approach using unsupervised algorithms (EM and K-means) to group similar time-series trends and ARIMA model to forecast on group-aggregate trends
- Forecast 7000 US colleges enrollment in the next three years
- Cluster sales patterns on 1.5+ million titles and use classification algorithms to predict the sales patterns of new titles using only non-transactional features
Apply unsupervised algorithm on 16+ million student records to create segmentation and analyze usage behavior
Work with technical teams and business teams to develop ETL in python that parse text data from legacy system into csv feed for Oracle Supply and Demand Planning system
Develop dashboards and tracking applications for customer service and inventory

Statistical Analyst, Radius Global Market Research, New York, NY
2014-04 to 2015-05
Work with major brands in eCommerce, retail and technology:

Manage and execute analytics requirements of market research projects
Design and develop experiments on front-end to collect user data (html, css, javascript, jquery)
Run algorithms on customer segmentation, price elasticity, shelf display optimization
Develop dashboards in Excel VBA based GUI tools and web applications

Data Analyst, Baldwin Richardson Foods, Rochester, NY
2013-10 to 2014-03

Integrate legacy system into SAP application
Design statistical metrics for real-time reporting in R and Excel to monitor production line

Education ¶

Master of Science, Computer Science, Georgia Institute of Technology
2015-01 to 2017-04

Specialization: Machine Learning
Teaching Assistant for Educational Technology

Master of Art, Economics, Syracuse University, Syracuse, NY
2012-07 to 2013-05

Bachelor of Art, Political Science, National Chengchi University, Taipei, Taiwan
2008-09 to 2012-07

Volunteer ¶

Data Expert, Datakind, New York, NY
2019-04 to 2019-05

Work with Plentiful to analyze their pantry and survey data

Data Expert, Datakind, New York, NY
2018-06 to 2018-12

Work with Commit (education) to setup their big data infrastructure to support data science efforts on Azure

Data Expert, Datakind, New York, NY
2016-04 to 2016-10

Work with Threshold (health care) to setup their analytical data warehouse in MongoDB and develop dashboard in flask + d3

Volunteer, Humanitarian Data Exchange
2014-05 to 2014-11

Volunteer, Statistics Without Borders
2013-11 to 2017-8

Professional Development ¶

coursera.org

Deep Learning Specialization by deeplearning.ai
- Sequence Models
- Convolutional Neural Networks
- Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
- Neural Networks and Deep Learning
- Structuring Machine Learning Projects
An Introduction to Interactive Programming in Python, Rice University
Practical Machine Learning, John Hopkins University
Machine Learning, Stanford University

udacity

Grow with Google Challenge Scholarship
Tensorflow free course