September 24, 2023

Transformers are a type of neural network architecture that began to gain in popularity a few years ago.

If you enjoy articles about A.I. at the intersection of breaking news join AiSupremacy here. I cannot continue to write without community support. (follow the link below).

The Meta trend here is that Self-attention technology is spreading beyond NLP to become a hot general-purpose architecture for machine learning.

What is a Transformer in A.I. Anyways?

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used to be used primarily in the fields of natural language processing (NLP) and computer vision (CV). However is that changing?

Before getting into Transformers, let’s understand why researchers were interested in building something like Transformers inspite of having MLPs , CNNs and RNNs. Think about it, Transformers were originally designed to help Language Translation. So how did they became something that’s being adopted in a lot of places?

For almost a decade, convolutional neural networks (CNNs) dominated computer vision research all around the globe. However, a new method is being proposed which harnesses the power of transformers to make sense out of images.

Transformers enable modelling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM).

No alt text provided for this image

Image: Alexey Dosovitskiy

Transformers – More than Meets the Eye

Transformers are revolutionizing the field of natural language processing with an approach known as attention. That’s just the beginning for this new type of neural network.

Everyone wants a universal model to solve different tasks with accuracy and speed. Transformers use the concept of Attention mechanism. Complementary to other neural architectures like convolutional neural networks and recurrent neural networks, the transformer architecture brings new capability to machine learning.

A.I.’s tool box is increasing in the last few years. Transformers are a brilliant new tool of artificial intelligence. That versatile new hammer is a kind of artificial neural network — a network of nodes that “learn” how to do some task by training on existing data — called a transformer. It was originally designed to handle language, but has recently begun impacting other AI domains.

Self Attention

The self-attention mechanism is a type of attention mechanism which allows every element of a sequence to interact with every others and find out who they should pay more attention to.

The idea behind it is that there might be relevant information in every word in a sentence. So in order for the decoding to be precise, it needs to take into account every word of the input, using attention.

Transformers was a Bitcoin Moment for A.I. 8 Years Later

In some ways “Transformers” is like a Bitcoin moment for A.I. Research. The transformer first appeared in 2017 in a paper that cryptically declared that “Attention Is All You Need.” In other approaches to AI, the system would first focus on local patches of input data and then build up to the whole.

You might be aware of the Elon Musk Satoshi (2009) meme that’s trending this week. This meme shows that the pseudonym ‘Satoshi Nakamoto’ is an assemblage of letters from the names of four electronic companies: Samsung, Toshiba, Nakamichi and Motorola. 

Obviously Transformers have certain advantages in A.I.

In a language model, for example, nearby words would first get grouped together. The transformer, by contrast, runs processes so that every element in the input data connects, or pays attention, to every other element. Researchers refer to this as “self-attention.” This means that as soon as it starts training, the transformer can see traces of the entire data set.

Back in the day pre 2017 you could even say that NLP was, in a sense, behind computer vision. Transformers changed that.

Transformers quickly became the front-runner for applications like word recognition that focus on analyzing and predicting text. It led to a wave of tools, like OpenAI’s Generative Pre-trained Transformer 3 (GPT-3), which trains on hundreds of billions of words and generates consistent new text to an unsettling degree. Now even OpenAI has a lot of competition.

Transformers Are Spreading in the 2020s

While you could argue some aspects of deep learning did hit a wall in the 2015 to 2020 timeframe, Transformers changes that to some extent. You could almost even say that Transformers are a kind of augmented deep learning.

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data.

Researchers in the early 2020s already report that transformers are proving surprisingly versatile. In some vision tasks, like image classification, neural nets that use transformers have become faster and more accurate than those that don’t. Emerging work in other AI areas — like processing multiple kinds of input at once, or planning tasks — suggests transformers can handle even more.

Transformers went from helping NLP catch-up to powering forward Computer Vision. Transformers seem to really be quite transformational across many problems in machine learning, including computer vision as of 2021 and now into 2022.

No alt text provided for this image

So what?

Once a score is calculated by using the above comparisons, they are sent through simple feed forward neuron layers with some normalizations in the end. During training, these attention vectors are learnt.

Think about it, just 10 years ago, disparate subfields of AI had little to say to each other. Deep Learnings’ upgrade has some serious potential.

Just as regular transformers use words to learn about sentences, Vision transformer uses pixels to achieve a similar result for images. However, there is a catch. In contrast to words, individual pixels do not convey any meaning by themselves which was one of the reasons we shifted to convolutional filters that operated upon group of pixels.

Transformers Might be a Convergence Ceremony for Deep Learning

Today in 2022, we have good reason to want to try transformers for the entire spectrum of AI tasks.

This Self Attention has mysterious to unlock. It allows capturing ‘long-term’ information and dependencies between sequence elements.

Emerging work in other AI areas — like processing multiple kinds of input at once, or planning tasks — suggests transformers can handle even more.

Another development demonstrating the flexibility of transformers was made by researchers at UC Berkeley, Facebook AI, and Google, who showed you don’t have to fine-tune the core parameters of a pre-trained language transformer to obtain very strong performance on a different task.

The computer scientist Alexey Dosovitskiy helped create a neural network called the Vision Transformer, which applied the power of a transformer to visual recognition tasks.

From Language to Vision

One of the most promising steps toward expanding the range of transformers began just months after the release of “Attention Is All You Need.” Alexey Dosovitskiy, a computer scientist then at Google Brain Berlin, was working on computer vision, the AI subfield that focuses on teaching computers how to process and classify images.

From Moscow to Berlin, Alexey really has contributed to us a gem.

  • Dosovitskiy was working on one of the biggest challenges in the field, which was to scale up CNNs to train on ever-larger data sets representing images of ever-higher resolution without piling on the processing time.
  • But then he watched transformers displace the previous go-to tools for nearly every AI task related to language. “We were clearly inspired by what was going on,” he said. “They were getting all these amazing results. We started wondering if we could do something similar in vision.” The idea made a certain kind of sense — after all, if transformers could handle big data sets of words, why not pictures?

The rest is history and Transformers are the new kid on the block.

I’m pretty sure it was actually the spirit of Optimus Prime that incarnated into Alexey during those crucial months.

No alt text provided for this image

More Than Meets the Eye – Computer Vision Application

The eventual result was a network dubbed the Vision Transformer, or ViT, which the researchers presented at a conference in May 2021. The architecture of the model was nearly identical to that of the first transformer proposed in 2017, with only minor changes allowing it to analyze images instead of words. Language tends to be discrete,” said Rumshisky, “so a lot of adaptations have to discretize the image.”

This felt like cutting edge stuff in 2020, how quickly the world changes. Now we talk about Transformers in AI all the time.

The ViT team knew they couldn’t exactly mimic the language approach since self-attention on every pixel would be prohibitively expensive in computing time. Instead, they divided the larger image into square units, or tokens. Tokens you say? You see how my Bitcoin analogy is holding up?

But by processing pixels in groups, and applying self-attention to each, the ViT could quickly churn through enormous training data sets, spitting out increasingly accurate classifications.

Transformers are kind of a glorious “win” for Deep Learning.

It is common to train large versions of these models and fine-tune them for different tasks, so they are useful even when the data is scarce. Performances in these models, even with billions of parameters, do not seem to saturate.

The Future of Deep Learning

Self-Supervised Learning and Transformers changes in trajectory of AI in the 2020s as thanks to the pandmeic, Google, Microsoft and Facebook have a lot more money to dedicate to A.I. related R&D. People don’t easily realize this, as the main element of digital transformation that took place due to and because of the pandemic. Meta’s very future in the Metaverse might depend upon the work that Meta AI is doing.

ViT’s success suggested that maybe convolutions aren’t as fundamental to computer vision as researchers believed. It was the death bell for CNNs in some respect.

“I think it is quite likely that CNNs will be replaced by vision transformers or derivatives thereof in the midterm future,” said Neil Houlsby of Google Brain Zurich, who worked with Dosovitskiy to develop ViT.

How Transformers and SSL evolve in the 2020s could potentially be among the most impactful things occuring in artificial intelligence at scale. This will likely lead to A.I. becoming more able to process information simultaneously across domains in a truly sensory parallel processing sense of the term.

Researchers routinely test their models for image classification on the ImageNet database, and at the start of 2022, an updated version of ViT was second only to a newer approach that combines CNNs with transformers. CNNs without transformers, the longtime champs, barely reached the top 10.

This suggests Transformers (arrived in 2017) in just five years have become more or less dominant in certain tasks. I have a feeling Optimus Prime will strike again soon.

Maithra Raghu, a computer scientist at Google Brain, analyzed the Vision Transformer to determine exactly how it “sees” images. The story of Transformers continues.

CNNs vs. Transformers

“In CNNs, you start off being very local and slowly get a global perspective,” said Raghu. A CNN recognizes an image pixel by pixel, identifying features like corners or lines by building its way up from the local to the global.

But in transformers, with self-attention, even the very first layer of information processing makes connections between distant image locations (just as with language). If a CNN’s approach is like starting at a single pixel and zooming out, a transformer slowly brings the whole fuzzy image into focus.

Now that it was clear transformers processed images fundamentally differently from convolutional networks, researchers only grew more excited.

An AI-Generated World

You may have heard of synthetic A.I. companies for training purposes. Well.

Now researchers want to apply transformers to an even harder task: inventing new images. Language tools such as GPT-3 can generate new text based on their training data.

In a paper presented last year, Wang combined two transformer models in an effort to do the same for images, a much harder problem. When the double transformer network trained on the faces of more than 200,000 celebrities, it synthesized new facial images at moderate resolution.

A Transformer based GPT-4 era might just bring in the zeighest of the Deepfake era, if you think about it.

Read Paper on TransGAN

Wang argues that the transformer’s success in generating images is even more surprising than ViT’s prowess in image classification. “A generative model needs to synthesize, needs to be able to add information to look plausible,” he said. And as with classification, the transformer approach is replacing convolutional networks.

This is incredibly already a year old.

What Convergence of Transformers Eh?

Raghu and Wang see potential for new uses of transformers in multimodal processing — a model that can simultaneously handle multiple types of data, like raw images, video and language.

This will functionally mean A.I. could be able to create its own YouTube channel in the not so distant future. You’d actually might think it was made by a human.

Google seems to have taken the lead in the use of Transformers.

Emerging work suggests a spectrum of new uses for transformers in other AI domains, including teaching robots to recognize human body movements, training machines to discern emotions in speech, and detecting stress levels in electrocardiograms. Another program with transformer components is AlphaFold, which made headlines last year for its ability to quickly predict protein structures — a task that used to require a decade of intensive analysis.

Anyways you get the gist of what a Transformer is now right?

We need more computational power to bring the convergence of Transformers into reality though. A transformer requires a higher outlay of computational power in the pre-training phase before it can beat the accuracy of its conventional competitors. So with every Optimus Prime moment, there’s a new bottleneck to overcome.

Many A.I. researchers remain optimistic that new hybrid architectures that draws on the strengths of transformers in ways that today’s researchers can’t predict are highly likely. What do you think? Share it with me in a comment below.

If you enjoy articles about A.I. at the intersection of breaking news join AiSupremacy here. I cannot continue to write without community support. (follow the link below).