🤓 facts about me:
- passionate about technologies that change people's lives
- coding since 15 years old
- 12+ years of programming, 7+ with python, 7+ in data.
- my main interests are in open-source, self-service data platforms, MLOps, DDD, and TDD
💼 I have many years of experience building data platforms and dev tools in modern tech organizations. From small startups (< 50) to 10k+ corporations I know how to operate in different growth stages.
📚 I've studied at the Federal University of São Paulo (UNIFESP). I have a bachelor's degree in Science and Technology and another in Computer Science. I have also recently finished a master of science in intelligent systems, with my research in the last 4 years focused on automated anomaly detection for data quality exploring novel architectures of AutoML and Metrics Repository. My university is one of the most prestigious in Brazil, fully funded by the Brazilian government. In the years I was there it was elected top 5 in all LATAM by Time Higher Education.
If you're passionate about data quality like me, you'll definitely like my publication which can be read here.
languages | Python (main language), Shell, SQL |
dev/data ops | Git, Github Actions, Drone CI/CD, CircleCI Docker, Kubernetes, Helm Datadog |
data oss | DBT, Apache Spark (PySpark), Databricks, Airflow, Airbyte PostgreSQL, Cassandra, MySQL, MongoDB, DynamoDB, Redis, DuckDB Kafka, NATS.io |
cloud | AWS: S3, EMR, ECR, Athena, RDS, Redshift, Glue, Lambda, SNS, SQS, EC2 GCP: Composer, Cloud Storage, BigQuery, DataStore, Cloud Run, Compute Engine, Kubernetes Engine, Artifact Registry |
OS | Linux, MacOs |
🐍 libs I ❤️ | aiohttp, fastapi, pydantic, typer, scrapy, streamlit, tenacity Test and Quality: pytest, mypy, flake8, isort, black ML/AI/DS: langchain, scikit-learn, prophet, merlion, jupyter, pandas, numpy, matplotlib, seaborn |
- biar: batteries-included async requests tool for python
- thoth: Python tool for profiling-based anomaly monitoring on ETL data pipelines leveraging ML and Apache Spark.
I'm also the co-creator of butterfree a tool for feature engineering and feature store. We created this tool when I was in the first MLOps squad at @quintoandar. It's used for most ML data pipelines there and has 260+ stars on GitHub.
I've also made contributions to the following awesome open-source libraries:
- airflow: the biggest open-source orchestration framework, created by Airbnb
- aws-sdk-pandas: easy data integration with AWS services, created by AWS
- merlion: a time series forecasting library for python created by SalesForce
- sageintacct-sdk-py: a python SDK created by the open-source community for Sage Intacct (a market leader for solutions for accounting, payroll, and payments)
I have a bunch of data engineer test cases which landed me Senior positions in competitive tech companies. So before asking me a take-home assignment, please check these instead 👇
- strider-challenge: a simple typer and sqlmodel application developed with DDD and TDD
- pyspark-pipeline: shows the implementation of a pyspark data aggregation pipeline with automated tests
- legiti-challenge: A nice project solution for building and running pipelines for feature store
- meli-challenge: a solution for the characters interactions problem using graph and spark
Here's an archive of old college projects (don't judge me 😅):
- ntsa: repository for codes, reports, and projects for the Nonlinear Time Series Analysis class from Computer Science Master's Degree Course at Federal University of São Paulo (UNIFESP).
- neural-networs: repository for the projects of the 2019 Neural Networks class at National Institute for Space Research (INPE)
- software-testing: Repository for the projects of the 2020 Software Testing class at the Federal University of São Paulo (UNIFESP)
@rafaelleinio on Discord