šŸ’» Code Repository

Production-grade implementations demonstrating data engineering, predictive modeling, and analytics infrastructure across Python, R, SQL, and Stata. Each system showcases technical work deployed in high-stakes production environments, serving billions of dollars in decision-making and published in peer-reviewed research.

<aside> šŸŽÆ

Repository Highlights

āœ… Production systems supporting multi-billion dollar decisions āœ… Emphasis on scalability, reliability, and documentation āœ… ETL/MLOps pipelines, predictive models, data platforms āœ… Battle-tested in mission-critical environments

</aside>


šŸ Python Projects

<aside> šŸƒ

Garmin Running Shoe Performance: Causal Inference for Equipment Optimization

Status: Complete

The Challenge: Determining the true performance impact of running shoes by isolating causal effects from confounding variables. Simple averages are fundamentally flawed—a faster pace could reflect the shoe, or it could reflect higher effort, lower fatigue, or different run types. The goal was to answer: ceteris paribus (all else held equal), which shoes deliver measurable performance gains?

Technical Approach: Built end-to-end causal inference pipeline analyzing 299 runs from Garmin wearable data. Engineered multi-source JSON merge linking activities to gear metadata. Solved critical confounding variable problems through sophisticated feature engineering:

Implemented dual regression framework: Ridge regression (α=1.0) for coefficient stability and OLS regression (statsmodels) for interpretable p-values and statistical inference. Standardized continuous features to enable coefficient comparison across scales.

Key Findings:

Business Impact: Delivered quantifiable, statistically rigorous equipment recommendations backed by causal inference—moving from correlation to causation. Framework demonstrates how personal wearable analytics can optimize performance decisions when confounding variables are properly controlled.

Tools & Methods: Python (pandas, numpy, statsmodels, scikit-learn), OLS & Ridge Regression, Causal Inference, Time-Series Feature Engineering, Rolling Aggregations, Multicollinearity Resolution, StandardScaler, Statistical Hypothesis Testing

Code Highlights:

# Solving multicollinearity: efficiency ratio as single effort proxy
df_features['efficiency'] = df_features.apply(
    lambda row: row['avgPower'] / row['avgHr'] if row['avgHr'] > 0 else 0,
    axis=1
)

# Disentangling fitness vs. fatigue via rolling time-series
df_features['fatigue_7_day_km'] = df_features['total_distance_km'].rolling('7D').sum()
df_features['fitness_42_day_km'] = df_features['total_distance_km'].rolling('42D').sum()

# Cumulative shoe mileage tracking (groupby with cumsum)
df_features['shoe_mileage_at_run'] = df_features.groupby('displayName')['total_distance_km'].cumsum()

# Dual regression framework for stability + inference
ridge_model = Ridge(alpha=1.0).fit(X_scaled, y_scaled)
model_sm = sm.OLS(y_ols, sm.add_constant(X_ols)).fit()

Regression Model Specification:

$$ \text{Speed}{it} = \beta_0 + \sum{j=1}^{5}\beta_j \text{Shoe}j + \beta_6\text{Efficiency}{it} + \beta_7\text{Fitness}{42d,it} + \beta_8\text{Fatigue}{7d,it} + \beta_9\text{Mileage}{it} + \gamma X{it} + \epsilon_{it} $$

where standardized coefficients enable direct effect-size comparison (R² = 0.728, p < 0.001)

View Full Analysis on GitHub

</aside>


šŸ“Š R Projects

<aside> šŸ”„

Production ETL Automation: U.S. Food Import Data Pipeline (2018)

The Challenge: USDA Economic Research Service required automated, reproducible data products for policy-critical agricultural trade analysis. Manual processing of 14 commodity categories across 20 years (1999-2018) and hundreds of trading partners created bottlenecks—taking days per release, introducing human error, and preventing timely response to executive queries. Raw data contained interleaved value/quantity columns, inconsistent HS code structures, and required sophisticated pattern-matching for subcategory classification.

Technical Approach: Engineered comprehensive automated ETL pipeline processing 14 agricultural commodity categories from master Excel workbook into analysis-ready aggregated datasets:

Pipeline Architecture:

  1. Data Extraction: Automated read_excel() ingestion from multi-sheet workbook (importfoodmaster2018.xlsx) with 47 columns per category
  2. Standardization Layer: Systematic column renaming, removal of metadata rows, consistent schema enforcement across all 14 categories
  3. Value/Quantity Split: Engineered separation of interleaved columns (2018 vs 2018__1) into distinct dataframes using dplyr::select() and full_join()—solving data structure complexity
  4. Feature Engineering via Pattern Matching: Created 70+ binary classification flags using HS code ranges (%in% operators) and regex text matching (grepl()) to categorize subcommodities:
  5. Hierarchical Aggregation: Multi-level aggregate() operations by partner country and commodity type using FUN=sum, filtered by Type (Value/Quantity) and Unit of Measure
  6. Data Integration: Chained full_join() operations across all 14 categories to create unified country-level import totals
  7. Export Layer: Automated CSV generation for downstream analytics

Scale & Complexity:

Code Highlights:

# Solving interleaved value/quantity column structure via split-join pattern
animalsv <- animals %>% 
  select('Partner','HS_Code','Product','UOM',
         '1999','2000','2001',...,'2018',
         'Reporter_Code','Partner_Code','Product_Code') %>%
  mutate(Type='Value')

animalsq <- animals %>% 
  select('Partner','HS_Code','Product','UOM',
         '1999__1','2000__1',...,'2018__1',
         'Reporter_Code','Partner_Code','Product_Code') %>%
  mutate(Type='Quantity')

# Recombine into tidy format
animals <- full_join(animalsq, animalsv, by=NULL)

# Feature engineering: Subcategory classification via HS codes + regex
meats$fresh <- ifelse(meats$HS_Code %in% 20000:20699,
                      ifelse(grepl("r/c", meats$Product, [ignore.case](<http://ignore.case>)=TRUE), 1, 0), 0)
meats$frozen <- ifelse(meats$HS_Code %in% 20000:20699,
                       ifelse(grepl("frz|froz", meats$Product, [ignore.case](<http://ignore.case>)=TRUE), 1, 0), 0)

# Hierarchical aggregation by partner country
meats_fresh <- meats %>% 
  filter(Type=="Value" & fresh==1)
meats_fresh <- aggregate(meats_fresh[,5:24], 
                         by=list(Partner=meats_fresh$Partner), 
                         FUN=sum)

# Final integration: Chaining 14 full_joins for unified country totals
totalcountrytotals <- animal %>% 
  full_join(meats, by=NULL) %>% 
  full_join(vegetables, by=NULL) %>%
  full_join(fish, by=NULL) %>% 
  # ... 10 more categories
  full_join(sweets, by=NULL)

totcountry <- totalcountrytotals %>% 
  filter(Type=="Value") %>%
  aggregate(.[,5:24], by=list(Partner=.$Partner), FUN=sum)

write.csv(totcountry, file="totcountry.csv")

Business Impact:

Production Reliability Features:

Tools & Methods: R (dplyr, tidyr, reshape2, readxl), Pattern Matching (grepl, regex), Hierarchical Aggregation, Multi-Source Data Integration (full_join chains), Feature Engineering, Automated ETL Pipeline, CSV Export for Downstream Analytics

</aside>

<aside> šŸŒ¶ļø

Data Quality Engineering: AMS Bell Pepper Auction Standardization

The Challenge: Raw USDA Agricultural Marketing Service auction data was unusable for analysis—containing inconsistent formats, missing values, non-standardized units, and encoding errors across multiple markets nationwide, rendering the dataset effectively worthless.

Technical Approach: Built comprehensive data cleaning and standardization pipeline that transforms raw, heterogeneous market data into analysis-ready structured datasets. Implements advanced string parsing, regex pattern matching, statistical outlier detection, and business-rule validation. Handles unit conversions, date/time standardization, and categorical variable encoding at scale.

Business Impact: Transformed previously unusable raw data into clean, reliable datasets—unblocking market trend analysis and price forecasting capabilities that directly support agricultural market intelligence.

Tools & Methods: R (tidyverse, stringr), Regular Expressions, Statistical Outlier Detection, Data Validation Frameworks, Unit Standardization

</aside>

<aside> šŸ“¦

Methodology Analytics: State Agricultural Export Attribution Analysis

The Challenge: Two competing methodologies for measuring state-level agricultural exports—Origin of Movement vs. Cash Receipt—produced significantly divergent results, creating stakeholder confusion and undermining confidence in data product reliability.

Technical Approach: Designed comparative analytical framework to rigorously quantify methodology differences, identify systematic bias patterns, and provide data-driven guidance on appropriate methodology selection by use case. Built statistical testing infrastructure, geospatial analysis, and visualization dashboards to communicate findings.

Business Impact: Delivered clear, evidence-based methodology guidance to data users—improving stakeholder confidence in export statistics and enabling informed methodology selection for downstream analytics.

Tools & Methods: R (tidyverse, ggplot2), Comparative Statistical Analysis, Data Visualization, Hypothesis Testing, Geospatial Analysis

</aside>


šŸ—„ļø Enterprise Data Wrangling & Pipeline Readiness (SQL Mastery)

<aside> šŸŽÆ

The Challenge: Federal economic models at EIA and USITC required production-grade ETL/ELT pipelines capable of ingesting, transforming, and validating data from heterogeneous sources—including time-series trade flows, static geospatial datasets, demographic Census tables, and real-time market feeds. The complexity lay not just in querying individual tables, but in orchestrating multi-source joins across systems with inconsistent schemas, missing keys, and varying temporal granularities while maintaining data governance standards and ensuring reproducibility for high-stakes policy analysis.

Technical Approach: Built production SQL workflows leveraging advanced techniques to ensure pipeline reliability and data quality:

Training & Knowledge Transfer: As database manager and technical lead, designed and delivered a comprehensive SQL curriculum for PhD/Master-level analysts at USDA Economic Research Service, covering database schema navigation, advanced JOIN strategies (INNER, LEFT, FULL OUTER, CROSS, self-joins), CTEs vs. subqueries, window functions, and query performance optimization.

Impact: The SQL infrastructure and training program became foundational to mission-critical analytics at multiple federal agencies. Analysts gained independence to extract and transform complex datasets without senior engineering support, accelerating research timelines by weeks and ensuring reproducible, auditable data pipelines that met government data governance standards. This work directly supported the EIA STEO forecasting system and USITC trade analysis.

Tools & Methods: SQL (PostgreSQL, SQL Server), ETL/ELT Pipeline Design, Window Functions, CTEs, Complex Joins, Query Optimization, Data Governance, Technical Training & Mentorship

Access SQL Course Materials

</aside>


šŸ“ˆ Stata Projects

<aside> šŸŽ“

</aside>


<aside> šŸ’”

</aside>