The field of digital image creation has recently emerged as a playground for artificial intelligence. The interface typically relies on textual input: a user enters keywords or a detailed description into the computer and the AI then generates an image that meets the criteria. The request can be as detailed as "a green husky riding a unicycle on the freeway, drawn in a comic book style" or as general as "a fantasy castle".
Artificial intelligence today can generate intricate and accurate drawings, respond to a range of requests, and emulate a specific drawing style, such as an oil painting, in a matter of minutes. There are dozens of different AI programs available, each with its own unique focus, offering users different options and showcasing a diverse range of approaches for image creation.
Some programs of this type have even expanded to video creation. However, AI-generated art still has its limitations and elements that it struggles to create correctly.
One natural question that arises is how artificial intelligence can generate original images, and whether these images can be considered art. While the former question has a clear answer, the latter is more elusive and subject to interpretation, and thus will not be addressed here.
More precisely, it can be stated that the challenge faced by current AI systems is defined and limited: to create images based on verbal descriptions provided to them, such as "two cats in a basket". Ultimately, the quality of the product is judged based on how well it fits the verbal description.
To understand how this is done, we will focus here on two of the most advanced AI programs available today: DALL-E 2 by OpenAI and Imagen by Google. These are of course just two examples out of a wide range of AI artists that are currently available on the market, each offering a vast array of interfaces and options.
Since the task is defined as "creating an image based on a verbal description," it can be broken down into two main components: understanding the verbal requirements and creating an image that meets them.
As humans, we are very accustomed to verbal communication, and the first part seems straightforward to us. But in fact, developing the software to accomplish this is a crucial step. The part of the AI system dedicated to understanding language is called a text encoder. DALL-E 2 and Imagen perform this task differently.
DALL-E 2 uses a model called Contrastive Language-Image Pre-training (CLIP). In this method, the AI is exposed to a large database of images - which in this case consists of approximately 15 million items - each with a verbal description of their contents.
The AI learns the connections between the verbal content of the descriptions and the visual features of the images through a process of training. As a result, the AI is also able to determine how well the images it generates match the user’s input text.
In contrast, Imagen relies on a pre-trained and frozen language model called T5-XXL, which stands for "very large Text-to-Text Transfer Transformer". T5-XXL was trained on text datasets containing questions and answers (such as, "What do people do at night? They sleep"), categorizations (such as "Cats are animals"), and translations from one language to another.
The training aims to teach artificial intelligence to accurately complete passages: answer questions, translate text, or classify entities. The model does not rely on images and thus the database available to it is much larger than that available with the CLIP model.
Imagen was then trained on a dataset of images with captions, part of which came from Google's internal dataset and another part from public sources. Google programmers chose not to train Imagen's language encoding independently - the way text is translated into a representation that the AI can understand. Instead, they relied on the encoding received from T5-XXL and kept it frozen.
Once the artificial intelligence has been equipped with a model of language comprehension, the process of training it to create images begins. There are several methods to train artificial intelligence for such tasks. Both DALL-E 2 and Imagen use a technique called "diffusion models," in which they start the training with a known image, and gradually add noise to it.
The noise in this case is not auditory, but rather random distortions in the colors and intensity of each dot (pixel) in the image. In general, any signal transmitted through any communication can contain various types of noise, such as in color, electrical signal, or speech. Noise is expressed as random symbols that overlap the desired symbol, in the same channel and at the same time.
For example, in the case of auditory noise, the noise we are familiar with from everyday life consists of sound waves that emanate from nearby objects and processes happening in the immediate environment and interfere with the reception of the desired signal.
Such signals will be considered noise if they have no meaning from the listener's point of view (white noise), or if they have significance but are simply not the signal that the listener is trying to pick up - for example, loud music that interrupts a conversation.
For our purpose, noise has two important characteristics. The first is its intensity, which refers to the magnitude of the fluctuations that can be expected. The second is the average value around which these fluctuations occur. The intensity of noise is a well-known property and it is easy to find many everyday examples related to sound.
On the other hand, the average value of the noise is not reflected in the auditory context, because it is expressed in waves, which are symmetric oscillations around the air's resting state. This means that the mean value of the oscillation is zero.
On the contrary, in colors, noise can have average values different from zero. One of the ways in which computers represent color is by assigning each pixel in an image three numerical values that correspond to the amount of red, blue, or green color present in that pixel. Different combinations of these values can accurately reproduce all the colors that the human eye is able to see.
Therefore, noise in color can be manifested, for example, as a disturbance in a pixel that randomly changes the degree of green color within it, increasing it on average compared to the original value.
Artificial intelligence training begins by starting with a clear image that the AI is familiar with or that the programmers have identified for it, and gradually adding noise to it in a controlled manner. The noise is added in small increments, so that at each step the AI is able to recognize the change caused by the noise and learn from it.
The process continues until only noise remains in the image, with known and uniform characteristics that are not dependent on the original image or the AI model applied. An example of uniform noise is the "snow" that used to appear on old television screens when they were turned on without receiving any channel.
The process of applying noise to images teaches the AI exactly what kind of noise needs to be applied to the specific image given in order to ultimately reach uniform noise, as well as the intermediate steps towards it - how much noise was added at each step, what was the average value of the image and the strength of the noise around it in the image obtained at that step.
In the next stage, where the artificial intelligence is asked to create an image on demand, it performs the same process in reverse. It starts from the final result of the noise process - uniform noise - and cleans it in a series of small steps until it reaches a clean image that meets the demand.
This process is reminiscent of the approach expressed by the Renaissance artist Michelangelo towards the craft of sculpting: "The finished sculpture already exists in the block of marble. I just need to remove the excess material with a chisel," he explained.
Imagen needs three steps in order to create high-resolution images. First, it takes a "picture" of noise and generates from it a low-resolution version of the desired image in a series of small steps.
Once a clean image is obtained, it enhances the resolution by turning each pixel into a group of pixels at an intermediate resolution, and again removes the noise until a clean image is obtained. In the third step, it repeats the previous step at a higher and final resolution.
In a paper submitted for publication by Google in early 2022, it is claimed that their choice to focus on improving the language encoder, at the expense of enhancing the image creation mechanism, gave Imagen an advantage over competing AIs, including DALL-E 2, VQ-GAN, and LDM. The comparison was made on a well-known database of image requests, for which both Imagen and the competitors generated images.
At this stage, human judges examined the images and were asked to determine which of the images better met the demands. According to Google, Imagen exhibited significantly better performance than its competitors.
DALL-E 2's model was trained on filtered databases that are relatively free from offensive or harmful content. In contrast, Imagen's model was trained on unfiltered databases. On the one hand, this serves as an advantage since training on larger datasets allows for more comprehensive training. The problem is that this also increases the likelihood that the artificial intelligence will produce offensive content from the user's request.
In addition, images of people may sometimes reflect social stereotypes that may be offensive, due to the nature of the data on which the artificial intelligence is trained. Therefore, Google decided not to open its interface to the general public at this stage. On the other hand, DALL-E 2 is open to all.
Despite the recent success of artificial intelligence in generating images, they still have weak points in generating certain objects. For example, AI still exhibits difficulty accurately depicting the palms of hands, human body proportions, and incorporating meaningful text into images.
It is possible that there are additional flaws in the processing of the images that are not immediately apparent to us, and that the weak points mentioned are simply more noticeable to us because we as humans have an inherent ability to recognize when something is amiss with the human form.
Already at a very early stage in the study of artificial intelligence, researchers focused their efforts on the field of image processing. This choice was largely due to the nature of the task, which requires the identification of patterns in visual information.
This is a task that the human brain performs easily and naturally, as do many other animals, However, for traditional computer programs, which required distinct characteristics ahead of the items they were asked to identify in the images, the same task required complex and sophisticated processes.
Over the years, technology companies have enhanced the AI’s capabilities to identify diverse objects in images, and can now perform the opposite, and much more complex task of creating an image based on a description.
We will not delve into the question of whether artificial intelligence demonstrates creativity and originality - this is a philosophical question that falls outside the realm of technical or scientific discussions. What is clear is that this is a noteworthy achievement for this developing technology. This new capability marks a fundamental difference from the challenges that AI faced only a few years ago.