ONNX

This is my blog on how a technology which is introduced by Meta for the interoperability between the pytorch and Caffe models is now started to take over the planet.

Initial days

Meta introduced ONNX as a way for interoperability between their pytorch and Caffe models. Later on, it was adopted by other ML frameworks such as Tensorflow, MXNet, and more.

Next Stage

Even though it is introduced for interoperability, it's course changed since Micorsoft introduced ONNX Runtime as a runtime environment for ONNX models. Since then ONNX models started used in mobiles, laptops, web browsers and embedded devices.

Why ONNX is standing out from others

1. Ability to use WebGPU which gives access to graphics APIs Vulkan (Linux & Android), DirectX (Windows) & Metal (iOS & Mac).
2. Ability to do on-device training
    This feature helps to avoid losing sensitive data to the cloud/server.
3. Compatibility with billion parameter models such as phi4, whisper, llama, sam2, gemma, etc.
4. Support TensorRT (NVIDIA GPUs) & OpenVINO (Intel CPUs)
5. Has APIs for python, JavaScript, rust, cpp, c#, java, kotlin, swift etc.
6. Supports multi-threading (Using web workers in web browsers)
7. Supports graph level optimization and quantization
8. Major open source projects cornerstonejs and transformersjs is already implemented onnx on their platforms.

WebGPU vs WebGL

Both are browser rendering tools. WebGPU is the successor of WebGL, enabling direct use of compute shaders for advanced graphics and parallel processing tasks.

What is WebGL

WebGL is a JavaScript framework used to render 2D and 3D graphics in your browser. Released in 2011, it has become the backbone of almost all graphical rendering on the web. Most major web-based rendering frameworks, such as Three.js, Babylon.js, Box2D, and Cornerstone.js, are built on top of WebGL. Until the last few years, the phrase "rendering on the web" was synonymous with WebGL.

Drawbacks of WebGL

WebGL 2.0, which is powered by OpenGL ES 3.0, which was released in 2008. This means that the capabilities of WebGL are rooted in a graphics API that is over 16 years old.

There has been significant development in consumer-level GPUs, starting with the GeForce 200 Series released in 2008. Since then, the series has progressed through the 300, 400, ..., 900, 10 series, 20 series, and now the 40 series. Alongside this, modern graphics APIs like DirectX 12, Vulkan, and Metal have emerged, enabling advanced capabilities in desktop and mobile gaming. However, browser-based games still rely on WebGL, which is based on OpenGL ES 3.0 (released in 2008). This means web-based rendering is significantly behind compared to desktop and mobile rendering. As a result, web browsers face performance and frame rate limitations, and many advanced features of modern GPUs remain inaccessible due to WebGL's reliance on outdated technology.

What is WebGPU

WebGPU is the successor of WebGL. It's development started in 2017. It enabled us to use the latest graphics cards features. WebGPU is supported on Chrome, Edge, Firefox, Safari etc.
WebGPU supports multi-threading which is helpful while downloading the model without using main thread.
WebGPU is powered by graphics APIs such as DirectX 12, Vulkan and Metal based on the browser platform.

WebGPU for Machine Learning

ML frameworks such as Tensorflow.js and Transformers.js are using WebGPU under the hood. If you want to do any kind of computations on the tensorflow, tensorflow has to convert the input tensor into textual or vertex data and again back to tensor data. WebGPU supports Compute Shader. Compute Shaders are General Purpose Programs that you can run on the GPUs. When tensorflow is directly able to use the computer shaders there is no need to convert the input data into vertex or textures. You can directly perform the computation on the shaders. Because of the less overhead it enabled the developers to run encoder-decoder ML models such as SAM2, Whisper, llama, gemma on the client side.

Advantages of using ML model on the client side

  • Gives privacy to the user data.
  • It enabled offline compute once the model is downloaded.
  • Since no cloud is involved the latency will be lower once the model is downloaded.
  • Cost of running the model is zero since there are no OpenAI, Anthropic or AWS API is involved.

Transformers js

Transformers.js is a powerful library that allows developers to run ML models directly in the browser, leveraging the capabilities of modern web technologies.
Transformers.js is designed to be functionally equivalent to Hugging Face's Python Transformers library, enabling users to run the same pretrained models with a similar API.
This allows for seamless integration of advanced ML tasks without server-side processing.

@huggingface/transformers vs @xenova/transformers

@huggingface/transformers

Supports a wide range of pretrained models from the Hugging Face Model Hub

@xenova/transformers

Developed by popular developer Xenova
Focuses on providing lightweight models optimized for in-browser execution.

Pre-trained models

Pre-trained models are ML models that have been previously trained on large datasets and can be fine-tuned or directly used for various tasks. ex: llama, llava, phi3, sam2, deepseek etc.

Popular Transformers Pipelines

text2text-generation: LaMini-Flan-T5-783M
translation: nllb-200-distilled-600M, Xenova/m2m100_418M, mbart-large-50-many-to-many-mmt
text-generation (for next word prediction): distilgpt2, codegen-350M-mono
automatic-speech-recognition (for speech to text): whisper-tiny.en, whisper-small
image-to-text: vit-gpt2-image-captioning, trocr-small-handwritten
image-segmentation: detr-resnet-50-panoptic
object-detection: detr-resnet-50
document-question-answering (RAG implementation): donut-base-finetuned-docvqa
text-to-speech: speecht5_tts, mms-tts-fra
image-to-image: swin2SR-classical-sr-x2-64

Popular Transformers js Models

Whisper (OpenAI)
GPT2 (OpenAI)
Llava
Llama (Meta)
SAM (Meta)
Cohere
Gemma2 (Google)
Qwen2
Phi3 (Micorsoft)
Mistral
Falcon
MusicGen
YolosObjectDetection
VisionEncoderDecoder