Summary

I build ML infrastructure that runs on billions of devices.
Currently leading teams at Qualcomm working on LLM quantization and on-device AI deployment.

Previously,
Built AI Hub (SaaS platform) to enable model optimizations and deployment for edge.
Helped shape Apple's CoreML stack powering all on-device ML use cases in Apple ecosystem.
Worked on NVIDIA's optimizing and code-gen compiler that powers ML training and inference via CUDA.

I enjoy working at the intersection of compilers and machine learning - making models smaller, faster, and accessible to developers worldwide.

Work

Senior Engineering Manager, ML Platform

Nov 2023 - Present

Qualcomm

  • Leading state-of-the-art quantization tool AIMET; Bringing in advanced LLM quantization techniques
  • Shipping LLM quantization recipes with AIMET; Making them accessible to developers worldwide through AI Hub
  • Improving qualcomm developer workflow: quantization, compilation, debugging and deployment on Snapdragon
  • Performance optimizations: latency, memory footprint, graph infrastructure for GenAI

Founding Machine Learning Engineer

Aug 2022 - Nov 2023

Tetra AI (acquired by Qualcomm)

  • Launched AI Hub - first model zoo focused on on-device optimized models and deployment
  • Led Microsoft Teams AI use-cases; Segmentation, Audio, Video-Codec
  • Developed graph infrastructure and graph-to-graph transformations for iOS platform on CoreML Tools
  • Launched SaaS platform for model optimization and deployment

Senior Machine Learning Engineer, ML Platform

June 2019 - Aug 2022

Apple

  • Designed MIL intermediate language - core to all CoreML model deployment
  • Led ONNX-CoreML converter; Core contributor to CoreML Tools
  • Enabled Stable Diffusion and GenAI models on-device
  • Led auto-upgrade tool for model format migration; On-boarded Vision Pro

Intern, SPIR-V Compiler

May 2018 - Aug 2018

NVIDIA

  • Developed compiler optimization controller for phase ordering and parameter tuning

System Software Engineer, Compiler

Jun 2015 - Jul 2017

NVIDIA

  • LLVM compiler optimizations for Tegra Graphics and CUDA
  • DWARF 2.0 debug frame support for CUDA 9.0

Intern, Compiler

Jun 2014 - Apr 2015

NVIDIA

  • PBQP register allocator - improved 98% of graphics/compute use cases

Projects

Talks

Education

MS Computer Science

Stony Brook University

2017 - 2019

BTech Computer Engineering

VIT Pune

2011 - 2015

Recognition

F8 Hackathon Finalist

Presented to Mark Zuckerberg

Huggingface Shoutout

DistilGPT-2 via onnx-coreml

GenAI Hackathon Judge

GenLab hackathon

Skills

C++ Python PyTorch LLVM ONNX CoreML TensorFlow Quantization Compilers On-device ML