OpenAI’s CLIP in production

2 min readDec 2, 2022

This article was originally posted on our company website. Lakera’s developer platform enables ML teams to ship fail-safe computer vision models.

Deploying state-of-the-art machine learning models can often lead to a myriad of issues due to the dependencies of the more salient packages — most commonly PyTorch and TensorFlow. At Lakera, we have released an implementation of OpenAI’s CLIP model that completely removes the need for PyTorch, enabling you to quickly and seamlessly install this fantastic model in production and on edge devices.

CLIP (Contrastive Language-Image Pre-Training) is powering some of the most exciting image to text applications out there right now. It’s a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. There are three main components that comprise this model:

The text tokeniser, which converts the given natural language into tokens (embeddings).
The image preprocessor, which converts the given image into embeddings.
The CLIP model itself, which outputs the cosine similarities of the text and image embeddings generated above.

The main issue we have found is that all three of these pieces utilise PyTorch — so we decided to simplify things for you!

We achieved this with the following steps:

The text tokeniser was rewritten in NumPy.
We wrote our own image preprocessor, which mimics the functionality of CLIP’s preprocessor.
We exported the CLIP model to an .onnx format, meaning that we have essentially swapped the PyTorch dependency for the lightweight onnxruntime.

Try it out! Don’t forget to give it a star and reach out if you have any feedback!

Written By Daniel Timbrell

OpenAI’s CLIP in production

Written by Lakera

No responses yet