Will we see the GPT-3 moment for computer vision?
It really is the age of great models. Each of these models is larger and more advanced than the previous one. Take, for example, GPT-3 – when it was introduced in 2020, it was the largest language model trained on 175 billion parameters. Fast forward a year, and we already have the GLaM model, which weighs a trillion dollars. Transformer models like GPT-3 and GLaM are transforming natural language processing. We have active conversations around these models that render the roles of writers and even programmers obsolete. While these can be dismissed as speculation, for now it cannot be denied that these great language models have truly transformed the field of NLP.
Could this innovation be extended to other fields, such as computer vision? Can we have a GPT-3 moment for computer vision?
GPT for computer vision
OpenAI recently released GLIDE, a text-to-image generator, where researchers applied guided delivery to the problem of conditional image synthesis of text. For GLIDE, the researchers trained a 3.5 billion parameter diffusion model that uses a text encoder. Then, they compared the CLIP (Contrastive Language-Image Pre-training) advice and the advice without a classifier. They found that the model generated with classifier-less advice was more photorealistic and reflected a wider range of global knowledge. The most striking feature of this model is that it achieves performance comparable to that of DALL.E at less than a third of its parameters.
For the uninitiated, OpenAI had released two models inspired by GPT-3 – DALL.E and CLIP – in 2020. Trained on 12 billion parameters, DALL.E can render an image from scratch and modify aspects of an existing image using text prompts. On the other hand, CLIP is a neural network trained on 400 million pairs of images and text. The OpenAI blog said that CLIP is similar to the GPT family to perform tasks such as character recognition of objects and actions. In particular, the OpenAI team predicted that the CLIP applications would be the classification and generation of images. However, a year later, CLIP found various applications such as content moderation, image search, image similarity, image ranking, object tracking and robotic control.
Even before GLIDE, DALL.E, and CLIP, OpenAI researchers released Image GPT. According to the team, the motivation behind this research was that just as a great transformer model can be trained on language; similar models can be trained on pixel sequences to generate completions and consistent image samples.
Although the GLIDE or DALL.E models are much smaller than GPT-3, they can be considered a precursor to similar large models, even for computer vision.
âGPT-3 has compiled and built models out of the box or based on powerful natural language models using probably the largest linguistic data sets, which allows it to be used in different tasks with just additional settings without any additional training. Computer vision and image processing are ready for a similar leap in highly generalized and applicable AI models – the image datasets, use cases and computational capabilities to train for them are readily available. OpenAI itself is working on the GPT image by working on the generation and completion of images, âsaid Arvind Saraf, Engineering Manager, Drishti Technologies.
He added, âWhile these highly generalized models have great potential to extend the reach and use of computer vision technology, like any standard neural network architecture, it is likely to have limitations. The long-term ethical implications of the same and the potential for issues such as identity theft and identity theft have yet to be explored. However, the potential applications are immense, such as scene reconstruction, event detection, and motion estimation in video analysis and 3D scene modeling.
Having GPT-like models for computer vision has its own challenges. Speaking of the same, Kyle Fernandes, Co-Founder of Memechat, said, âCompanies like EleutherAI and OpenAI are working on GPT images. These models use both natural language and computer vision concepts. Models like GPT require a lot of data; even smaller models like Ada are 25 GB. Having 25 GB for a GPU is huge – imagine that for a model with 175 billion parameters. That is why, while these models are great, you need huge resources for them. Large companies may be able to afford these kinds of resources, but small businesses can struggle to get their hands on this technology. One possible solution might be to create smaller models and tune them for a specific purpose. “
Other Transformer Models for Computer Vision
One of the most popular Transformer models for computer vision was Google’s aptly named Vision Transformer (ViT). It replicates the Transformer architecture for natural language processing and represents input images as sequences. It predicts the class labels for the image and allows models to independently learn the structure of the image. Google has claimed that ViT can outperform state-of-the-art CNN with four times less resources when trained on sufficient data.
As an improvement over Google ViT, Facebook released Data-efficient Image Transformer (DeiT). DeiT can be trained on 1.2 million images against the hundreds of millions of images required for CNN. It is based on a process-specific knowledge distillation procedure which reduces the need for training.
The Facebook researchers also built DEtection TRansformer, which is a bundle-based aggregate loss that uses bipartite matching and a transformer encoder-decoder architecture. DETR uses CNN for representation of the input images before using a position encoder and then passing it to a transformer encoder.