Data professional with cross-functional expertise across data science, analytics, business intelligence, and data engineering. Currently working in marketing analytics at Genesis Motor America, delivering insights across customer behavior, campaign performance, and revenue optimization. AWS Certified Data Engineer – Associate, with hands-on experience designing cloud-based data pipelines and analytics architectures. Brings a hybrid mindset combining data science modeling, business analysis, and engineering execution to solve complex business problems and deliver measurable impact.
Keval Rakholiya
About Me
Work Experience
Where I've worked and the impact I've made.

Leading cross-functional analytics initiatives across the full marketing stack — integrating Adobe Analytics, GA4, CRM, and vehicle sales data into a unified measurement framework. Engineered scalable SQL pipelines and Python ETL workflows to consolidate multi-source marketing data, enabling consistent KPI tracking across owned, earned, and paid channels. Built propensity and attribution models using XGBoost and scikit-learn to predict customer purchase intent and allocate budget with greater precision. Designed A/B testing frameworks with statistical significance validation to evaluate creative and media strategies, directly informing multi-million dollar campaign decisions. Delivered executive-facing Power BI and Tableau dashboards surfacing real-time revenue trends, customer funnel metrics, and audience segmentation insights to VP-level stakeholders.
Key Impact
Tools & Tech

Owned end-to-end reporting infrastructure for operations and field performance metrics across 50+ internal stakeholders. Architected Power BI dashboards connected to SQL Server data warehouses — replacing fragmented Excel workflows and cutting report generation time from days to minutes. Wrote advanced SQL queries and stored procedures to clean, transform, and aggregate data from multiple source systems using SSIS ETL pipelines. Applied time-series analysis and predictive modeling in Python to surface trends in incident frequency, resource utilization, and project cycle times, enabling proactive risk management. Built a fully automated reporting suite using Python and Azure-scheduled jobs, eliminating 20+ hours of weekly manual effort and improving data freshness from weekly to daily.
Key Impact
Tools & Tech

Supported health science faculty research by designing reproducible data pipelines in Python (pandas, NumPy) to standardize and clean clinical and survey datasets across multiple concurrent studies. Applied inferential statistics — including multivariate regression, ANOVA, chi-square tests, and survival analysis — to extract clinically meaningful patterns from 10,000+ patient records with 99% data accuracy. Built statistical models in R and SPSS to analyze population health trends, behavioral risk factors, and intervention outcomes. Created publication-quality visualizations using matplotlib and ggplot2 to communicate findings to both technical collaborators and non-technical faculty audiences. Contributed to peer-reviewed publication pipelines across health behavior and chronic disease domains, delivering analysis-ready datasets and statistical summaries that supported 3 published papers.
Key Impact
Tools & Tech
Education
Degrees, programs, and certifications.
Skills
My full tech stack across data, AI, and engineering.
Data Science & Analytics
Machine Learning & AI
Generative AI & LLM
BI & Analytics Tools
Cloud & Data Engineering
Projects
End-to-end data science and ML projects — predictive analytics, forecasting, and BI dashboards.
A workforce analytics solution that combines interactive dashboards and machine learning to identify attrition patterns and predict employee turnover risk. It helps HR teams take proactive actions by highlighting high-risk employees and key drivers of attrition.
A hybrid time-series forecasting model that uses LSTM to capture market trends and ARIMA to correct prediction errors. The combined approach improves accuracy and stability compared to using a single model for stock price prediction.
Performed exploratory and statistical analysis on vehicle sales data to uncover market trends, pricing behavior, and customer segmentation, laying the foundation for predictive modeling and demand forecasting in the automotive industry.
I Build at Hackathons
I have attended 7+ hackathons across California — from Stanford to San Diego — building data-driven solutions in 24–48 hours alongside some of the most driven people in tech.

LA Hacks 2025
📍 Los Angeles, California
Benchmarked six machine learning classifiers — XGBoost, Random Forest, SVM, Logistic Regression, KNN, and a shallow Neural Network — on a large gaming behavior dataset to predict player churn and session drop-off. Engineered features from raw event logs including session frequency, in-game purchase history, level progression rate, and social interactions. Handled severe class imbalance using SMOTE oversampling and class-weighted loss functions. Evaluated all models on AUC-ROC, precision-recall curves, and F1-score across 5-fold stratified cross-validation. XGBoost outperformed all baselines by 11% on AUC and was selected as the final model. Wrapped findings into an interactive Streamlit report with feature importance charts and player risk segmentation by cohort.

Hacktech 2025
📍 Pasadena, California
Built a heart disease risk prediction system using clinical tabular data from the UCI Heart Disease dataset. Performed comprehensive EDA including correlation heatmaps, distribution plots, and chi-square feature selection tests. Identified chest pain type, max heart rate, ST depression, and number of major vessels as the top predictive signals. Trained and compared Logistic Regression, Decision Tree, and Random Forest models across multiple hyperparameter configurations. The final stacked ensemble achieved 89% accuracy and 0.91 AUC on the holdout test set. Applied SHAP values to generate model explainability reports, making predictions interpretable for a non-technical medical audience. Packaged the final output as a physician-facing risk summary PDF with per-patient risk scores and contributing factor breakdowns.
HackDavis 2025
📍 Davis, California
Built a generative AI assistant for community health workers using LangChain, OpenAI embeddings, and a retrieval-augmented generation (RAG) pipeline grounded in CDC datasets, California Department of Public Health reports, and county-level health surveys. Designed a vector store using ChromaDB to index and retrieve relevant health statistics based on semantic similarity. Users could ask plain-English questions about disease prevalence, food insecurity rates, vaccination coverage, and healthcare access — and receive accurate, cited, data-backed summaries. Implemented prompt chaining to handle multi-turn conversations and context retention. Focused on accessibility for non-technical public health staff in underserved regions, with a clean Streamlit front-end requiring no technical knowledge to operate. Won recognition in the Social Good track.

TreeHacks 2025
📍 Stanford, California
Developed a real-time exercise activity recognition system at the intersection of computer vision and machine learning. Used MediaPipe Pose to extract 33 skeletal landmarks per video frame and OpenCV for live webcam frame capture and preprocessing. Engineered angular joint features from landmark coordinates — including elbow, knee, hip, and shoulder angles — to represent motion patterns rather than raw positions. Trained a lightweight LSTM network on a labeled video dataset of 8 common exercise types, achieving 93% classification accuracy on held-out clips. Added a repetition counter using angle-threshold state machines per exercise type. Demoed a live webcam interface that classifies exercise type and counts reps in real time, presented at the Stanford Healthcare Innovation track showcase to strong audience reception.
SoCal Tech Week 2024
📍 Los Angeles, California
Built a civic data intelligence platform to surface community concerns in underserved Los Angeles neighborhoods using social media and public records. Scraped thousands of posts from Reddit (r/LosAngeles, neighborhood subreddits) and Twitter/X using their APIs, targeting discussions around housing affordability, rent increases, transit access, and public safety. Preprocessed and cleaned text data with spaCy — removing noise, normalizing slang, and extracting named entities for neighborhood tagging. Applied VADER sentiment analysis to score posts at both neighborhood and topic level, then aggregated trends over time. Visualized findings in an interactive Streamlit dashboard featuring choropleth heatmaps by zip code, time-series sentiment trends, and keyword frequency breakdowns. Placed in the top 5 teams in the Social Impact track out of 80+ submissions.

UC Berkeley AI Hackathon 2024
📍 Berkeley, California
Designed and built an end-to-end automotive sales intelligence platform during a 24-hour AI-focused hackathon at UC Berkeley. Architected SSIS ETL pipelines to ingest raw dealership data from CSV exports, CRM APIs, and inventory management systems into a centralized SQL Server staging area. Applied star schema dimensional modeling with fact tables for sales transactions and dimension tables for vehicles, dealerships, regions, and time. Built interactive Power BI dashboards tracking monthly revenue by region, sales velocity by model, customer lifetime value, and inventory turnover rate. Incorporated an AI-assisted anomaly detection module using Python's scikit-learn to flag unusual pricing and discount patterns. The solution reduced manual reporting time by an estimated 70% and surfaced pricing inefficiencies across three vehicle segments, earning the Best Data Engineering award at the event.
DataHacks 2024
📍 San Diego, California
Predicted telecom customer churn by following the full CRISP-DM data science lifecycle — from business understanding through model deployment planning. Explored a real-world telecom dataset covering contract type, monthly charges, tenure, service bundle usage, and support call frequency. Performed targeted EDA to surface churn patterns by segment and applied label encoding and scaling for categorical and numeric features. Trained and compared Decision Tree, Naive Bayes, and KNN classifiers across stratified train/test splits, running grid search hyperparameter optimization for each. Surfaced that contract type and customer tenure were the dominant churn drivers, followed by whether the customer had tech support enabled. Achieved a final model F1-score of 0.84 on the holdout set. Delivered both a technical Jupyter Notebook and a non-technical retention strategy memo with targeted recommendations for reducing churn in high-risk customer segments.
Let's Connect
I'm actively looking for data science, analytics, and AI engineering roles. Whether you have an opportunity, a question, or just want to say hi, my inbox is open.

