Summary

I build ML infrastructure across the entire stack - from apps to compilers.
My work runs on billions of devices worldwide.

Leads quantization at Qualcomm, powering on-device GenAI via AI Hub Workbench.
Previously,
    Built AI Hub - first on-device model zoo, SaaS platform, and apps.
    Shaped Apple's on-device ML stack - conversion tools, CoreML Framework, ML Runtime.
    Built GPU compilers at NVIDIA - CUDA and SPIR-V.

Excited about ML efficiency across the spectrum - on-device, edge, and data center scale.

Work

Senior Engineering Manager, ML Platform

Nov 2023 - Present

Qualcomm

  • Leading state-of-the-art quantization tool AIMET; Bringing in advanced LLM quantization techniques
  • Shipping LLM quantization recipes with AIMET; Making them accessible to developers worldwide through AI Hub
  • Improving qualcomm developer workflow: quantization, compilation, debugging and deployment on Snapdragon
  • Performance optimizations: latency, memory footprint, graph infrastructure for GenAI

Founding Machine Learning Engineer

Aug 2022 - Nov 2023

Tetra AI (acquired by Qualcomm)

  • Launched AI Hub - first model zoo focused on on-device optimized models and deployment
  • Led Microsoft Teams AI use-cases; Segmentation, Audio, Video-Codec
  • Developed graph infrastructure and graph-to-graph transformations for iOS platform on CoreML Tools
  • Launched SaaS platform for model optimization and deployment

Senior Machine Learning Engineer, ML Platform

June 2019 - Aug 2022

Apple

  • Designed MIL intermediate language - core to all CoreML model deployment
  • Led ONNX-CoreML converter; Core contributor to CoreML Tools
  • Enabled Stable Diffusion and GenAI models on-device
  • Led auto-upgrade tool for model format migration; On-boarded Vision Pro

Intern, SPIR-V Compiler

May 2018 - Aug 2018

NVIDIA

  • Developed compiler optimization controller for phase ordering and parameter tuning

System Software Engineer, Compiler

Jun 2015 - Jul 2017

NVIDIA

  • LLVM compiler optimizations for Tegra Graphics and CUDA
  • DWARF 2.0 debug frame support for CUDA 9.0

Intern, Compiler

Jun 2014 - Apr 2015

NVIDIA

  • PBQP register allocator - improved 98% of graphics/compute use cases

Talks

Education

MS Computer Science

Stony Brook University

2017 - 2019

BTech Computer Engineering

VIT Pune

2011 - 2015

Recognition

F8 Hackathon Finalist

Presented to Mark Zuckerberg

Huggingface Shoutout

DistilGPT-2 via onnx-coreml

GenAI Hackathon Judge

GenLab hackathon

Skills

C++ Python PyTorch LLVM ONNX CoreML TensorFlow Quantization Compilers On-device ML