Rohan Pujari • Portfolio

Hey, I'm Rohan Pujari and Welcome to my Portfolio

Data Storyteller | Automobile Enthusiast | Kaggle Contributor

New Jersey, USA Open to Work

About Me

I am Rohan Pujari, a dynamic Data Scientist with over 5 years of experience at the forefront of designing and deploying advanced machine learning solutions across the financial services and education sectors. My passion lies in transforming complex, often unstructured, data into actionable intelligence that drives measurable business impact and strategic decision-making.

My expertise spans the entire data lifecycle, from high-performance ETL pipeline optimization and robust data preprocessing to the development and deployment of sophisticated models. I specialize in Natural Language Processing (NLP), including cutting-edge applications of Large Language Models (LLMs) for financial document processing and text summarization, where I've achieved significant efficiency gains (e.g., reducing analyst effort by 70%). My toolkit includes Python, AWS (S3, Glue, Bedrock), Snowflake, SQL and a deep command of various ML frameworks like TensorFlow and Scikit-learn.

I thrive on solving real-world problems, whether it's forecasting customer churn and risk with predictive analytics, optimizing customer engagement policies through reinforcement learning, or enhancing institutional reputation by extracting insights from academic data. I am adept at cross-functional collaboration, translating technical complexities into clear, data-driven strategies. My commitment extends beyond model building; I focus on creating scalable, production-ready solutions that deliver real-time insights and tangible value.

Skills & Technologies

Programming & ML

Python
R
SQL
TensorFlow
PyTorch
Scikit-Learn

Data Engineering

Apache Airflow
Snowflake
Databricks
Apache Spark
Hadoop
ETL Pipelines

Cloud & DevOps

AWS
Google Cloud
ML OPS
Docker
Kubernetes
CI/CD Pipelines

Data Analysis

Tableau
Power BI
A/B Testing
Statistical Analysis
Data Visualization
Time Series Analysis

Education

Master of Sciences in Data Science

New Jersey Institute of Technology (NJIT)
September 2022 - May 2024

Developed strong foundations in machine learning, natural language processing, and data engineering, with hands-on projects simulating real-world challenges in business, education, and research.
Gained practical experience building interactive dashboards and automated data pipelines, working with both structured and unstructured data using tools like Python, SQL, Pandas, and Power BI.
Completed multiple course-based projects involving predictive modeling, text analytics, and data visualization, applying statistical concepts and ML algorithms to real datasets.
Participated in collaborative coursework involving data storytelling, ETL workflows, and decision-making under uncertainty, preparing for data-driven roles in industry.

Relevant Coursework:

Big DataMachine LearningDeep LearningWeb System Development Data VisualizationData Analysis for ManagersApplied StatisticsNatural Language Processing

Post Graduation Diploma in Data Science

International Institute of Information Technology (IIIT-B)
January 2021 - February 2022

Gained in-depth understanding and practical intuition of core machine learning algorithms including regression, classification and clustering.
Gained hands-on experience with deep learning models like CNNs, RNNs, and Transformers for tasks in computer vision, time-series forecasting, and NLP.
Developed strong analytical skills, working on real-world datasets & use cases across various domains.
Realized the critical importance of data preprocessing, exploratory analysis, and feature engineering in successful model building.
Built hands-on experience through multiple academic projects, both individually and in teams, applying data science techniques from end-to-end.

Relevant Coursework:

ExcelMachine LearningTree ModelsAdvance RegressionModel Deployment Neutral NetworksTime Series AnalysisMath For Data Analysis

Work Experience

Data Scientist @ Comnatix LLC

Sept 2024 - Present

Developed and deployed an AI-powered NLP solution to extract insights from financial documents, increasing the speed and accuracy of portfolio reviews by 25% for speed and 15% for accuracy.
Engineered scalable, automated data pipelines to convert unstructured financial documents into structured datasets, improving data accessibility and downstream analysis.
Leveraged AWS Bedrock (Claude-v2, LLaMA 3.1) to create automated text summarization models, reducing analyst effort by 70% and accelerating fund evaluation cycles
Integrated financial document processing with databases like SQLite to enable seamless data ingestion into portfolio optimization and performance tracking dashboards.
Partnered with cross-functional teams to integrate AI-driven insights into investment decision workflows, enhancing model reliability and stakeholder trust

Data Scientist @ New Jersey Institute of Technology

Nov 2022 - May 2024

Developed and deployed an end-to-end machine learning pipeline to preprocess and analyze large-scale educational datasets, improving the predictive accuracy of university rankings by 30%
Automated data workflows using Python & SQL, reducing manual effort by 30%, increasing data handling efficiency.
Designed and implemented a Snowflake data warehouse to store and manage website and institutional datasets, enabling scalable querying and integration with downstream analytics tools.
Connected Snowflake to MicroStrategy to deliver interactive dashboards, integrating predictive analytics and real-time insights that guided university leadership in strategic decision-making.
Enhanced NLP models to perform sentiment analysis and text summarization on institutional surveys and academic papers, improving data-driven decision-making.
Employed supervised learning models, including Random Forest and Decision Trees, to classify institutions, directly supporting strategic initiatives
Applied Principal Component Analysis (PCA) on research datasets to uncover critical insights, contributing to a measurable improvement in the university's R1 ranking.

Data Scientist @ Larsen and Toubro Infotech

Dec 2019 - July 2022

Designed and optimized high-performance ETL data pipelines for BFSI projects, improving reporting efficiency and data accessibility by 20%.
Developed and deployed predictive models using Python, Random Forest, and Logistic Regression to forecast customer churn and risk, enhancing decision-making for lending and insurance portfolios.
Applied NLP techniques to perform sentiment analysis and keyword extraction from customer feedback, resulting in data-driven product improvements.
Automated data ingestion and reporting processes, reducing manual effort by 60% and increasing data update frequency from weekly to daily.
Designed interactive Power BI dashboards to deliver actionable insights and implemented Linux scripts for server monitoring, ensuring system reliability.
Developed SOPs for system migration and automation while managing user permissions, SLA compliance, and deployments to ensure seamless operations and security.
Mentored a team of 5 junior data scientists, resulting in a 50% improvement in project delivery times and overall team efficiency.

Academic Projects

Heart Disease Risk Prediction & Deployment

Real-time heart disease risk prediction using logistic regression deployed on AWS SageMaker with an interactive web frontend.

Tech stack: Python, Jupyter Notebook, pandas, scikit-learn; data prep and analysis in Excel/CSV formats.
Built a logistic regression model using structured health data (symptoms, vitals, lifestyle factors) to predict heart disease risk. Packaged model into .tar.gz, deployed it as a real-time endpoint using AWS SageMaker.
Created a clean Streamlit web app to collect user inputs and invoke the SageMaker endpoint. Applied conditional styling to reflect diagnosis confidence: vibrant UI for risk alerts, muted design for low risk.
Resolved deployment issues including cold starts, model artifact structure, JSON input formatting, and health checks.
Demonstrated production-readiness with understanding of SageMaker internals, cost scaling, latency optimization, and autoscaling. Positioned as a strong showcase of full-stack ML deployment on the cloud.

AI-Powered News Summarization

Automated real-time news scraping & summarization using Transformer-based models.

Tech stack: Python, Flask, HTML/CSS, HuggingFace Transformers, Pegasus (NLP), newspaper3k, feedparser.
Used feedparser to ingest RSS feeds (e.g., NYT), scraped full articles via newspaper3k, and applied Google's PEGASUS model for abstractive summarization.
Built a Flask-based web app with custom HTML/CSS to display a scrollable, inshorts-style feed—each card showing a headline, image, and AI-generated summary.
Optimized Pegasus inference using pre-tokenization and batching to reduce load latency; ensured visual coherence with automated image extraction and fallback handling.
Enhanced user experience by hiding backend logic (like URLs and RSS controls) and maintaining continuous content refreshes based on feed updates.

Alcohol Craving Prediction & Intervention

Real-time craving prediction using wearables and ML, triggering timely behavioral interventions.

Tech stack: Python, pandas; ML models: Random Forest & LSTM; model interpretability: SHAP
Collected sensor data (heart rate, sleep, stress) paired with self-reported mood, engineered time-series features to predict alcohol craving episodes.
Built Random Forest and LSTM models achieving ~82% accuracy; used SHAP to interpret outputs and identify top risk factors like elevated heart rate and poor sleep
Developed an app that triggers personalized strategies (mindfulness, alerts, peer support, journaling, activity suggestions) when craving risk is high.
Proven impact in simulation: Demonstrated a potential 15% reduction in relapse rates and provided insights valuable for behavioral healthcare and further addiction analytics research.

Neural Networks Project - Gesture Recognition

Smart-TV control via webcam-based recognition of five hand gestures using neural network architectures.

Tech stack: Python, OpenCV, TensorFlow/Keras; architectures: 3D Convolutional Network (Conv3D) and CNN + RNN (LSTM/GRU)
Built dataset of 2-3 sec videos (30 frames) across five gestures (thumbs up/down, left/right swipe, stop); preprocessed frames for training
Implemented and compared two models: Conv3D (spatiotemporal video analysis) and CNN+RNN (frame-wise feature extraction + sequence modeling)
Impact: Enabled real-time, remote-free control of TV functions like volume and playback through gesture commands

Telecom Churn Prediction Case Study

Predictive modeling to identify high-risk telecom customers and uncover churn drivers.

Tech stack: Python, pandas, scikit-learn (with feature engineering, class-balancing techniques like oversampling/SMOTE)
Analyzed customer-level data from a leading telecom provider; focused on high-value customers (70th percentile of recharge amounts)
Engineered features, tagged churners based on zero usage criteria, removed churn-phase data; built classification models (e.g., Logistic Regression, Random Forest, XGBoost)
Identified top churn indicators such as drop in incoming calls and recharge patterns; recommended targeted retention strategies

Lead Scoring Case Study

Logistic-regression-based scoring to rank hot leads for an ed-tech company and boost conversions.

Tech stack: Python, pandas, scikit-learn; focused on feature engineering, balancing classes, and model calibration
Dataset: ~9,000 leads from X Education with ~30% baseline conversion; goal was to raise to ~80% through prioritization
Built a logistic regression model to assign each lead a score (0-100); used train/test split, handled missing values and categorical encodings, tuned threshold for sensitivity/specificity
Outcome: ~80% accuracy, ~72% precision; key predictors included time spent on site, total visits, lead origin and occupation; recommended using top-scoring leads for sales outreach

Spotify Data ETL Pipeline with AWS

Automated pipeline to extract, transform & query Spotify playlist data using AWS.

Tech stack: Python, Spotipy, AWS Lambda, S3, Glue, Athena (CloudWatch/EventBridge scheduling)
Built a serverless ETL workflow: fetched JSON data from Spotify API → stored raw data in S3 → triggered Lambda to transform it → cataloged via Glue → enabled SQL querying in Athena
Impact: Enabled automated, scalable music-data analysis—ideal for exploring trends in playlists (e.g. Top 50 India) and powering analytics dashboards

Bike Sharing Demand Analysis

Predicting daily bike-sharing demand using multiple linear regression.

Tech stack:Python, pandas, scikit-learn,AWS SageMaker, AWS IAM, Streamlit, boto3, joblib.
Dataset & goal: Analyzed a bike-sharing provider's historical data to model demand variations; key focus on predicting usage based on date, weather, and seasonality.
Modeling: Built multiple linear regression models to forecast daily bike rentals; engineered features like temperature, humidity, day-of-week, and seasonal trends.
Impact: Identified significant demand drivers weather patterns, months, seasons to inform better bike allocation strategies and operational planning.