Text-to-image AI

The new disruptor of photography is now charging straight at us.

Feb 05, 2023

Why this, why now?

It’s a subject that’s ballooning as we stand here. I expect to be talking a lot about it in over the several months. The reaction – especially amongst photographers – very much reminds me of the reactions when digital photography first appeared.

Dismissal, scorn, followed by horror, dismay and fear.

Just for this this first mention on this Substack, I want to keep it practical, and tell you about my explorations in the swirling waters of T2I AI. I had great fun. I’ve enjoyed it; every minute of it. Oh, the delight of opening one surprise image after another! It was more creative fun than I’ve had in years. It kept me occupied and out of everyone’s way for months.

My journey

I used AI-generated images to start little visual journeys. I would prompt the T2I with some ideas, then choose one of the images it generated to explore the hints and ideas it held. Sometimes I generated dozens of images before deciding one was interesting enough to work with. Sometimes I was stunned by the first image offered.

The key was in learning how to communicate with a machine that I knew nothing about. Fortunately I received sponsorship from DALL-E of OPEN-AI that allowed me the space to practice the new art of prompt engineering.

Prompt engineering

T2i AI (text-to-image artificial intelligence) systems such as DALL-E, Midjourney, Motionleap are essentially computer applications sitting around at the end of an internet connection waiting for instructions – the input of text prompts – to get to work.

When they receive the text prompt e.g. 'a red rose floating on the surface of a lake with mountains in the background', they dig into their database to start constructing an image. It's important to understand this process does not work by finding images in the database to match your prompt. In fact, there are no images in the database as such; there are generalised 'notions' of things. If AI had to construct images from existing images, it would take a great deal of computing. Instead, results usually come in a few seconds.

Imagine I ask you to draw a red rose. If you had first to find a suitable picture to copy, that could take you quite some time. Instead, as soon as you hear the words 'a red rose', some general notions of a red rose came instantly to mind. You could immediately start drawing. As you do so, you will be critical of what you're doing. It may not look quite right, so you might scrub some parts to make corrections to bring the picture closer to your idea of a red rose.

All the time you’re comparing what you draw with what you have in mind that had been prompted by the words ‘red rose’. That is very much how T2I works.

There are two parts (the the techies call them 'neural networks') to the AI: the generator i.e. the bit that builds the image, and the discriminator, the bit that judges how well the generated image matches your prompt. So the generator will offer various examples of 'rose' building on its 'knowledge' of roses, having been trained on dozens of low-resolution images of roses to 'learn' the general characteristics (that also separate them from other flowers like dandelions, lilies or orchids). The discriminator then compares what the generator delivers with the prompt, asking for new options if not 'satisfied'; or else it accepts the image and sends it to you, the prompter.

How they learn

I’ll write more on this another time, but it’s worth spending a moment to understand the broad lines of machine learning. Not least because I see already the spread of a great deal of misunderstanding. Machine learning is based on the way we, humans, learn.

Let’s say you’re learning bird-watching. You don’t get to see the birds up close and in detail. So you have to learn what different species look like from a distance – a white patch on the wing, the shape of the bill. You learn its shape in flight. You learn their shape and colour when on a branch. When you come across the bird in the wild, you check against your bird ID book, or you ask an expert. At first you can’t tell different but similar species apart but after training, you can be confident about your ID even if you only get a half-second’s glimpse against the light. You can distinguish a sparrow from a thrush from the slimmest of cues.

Machine learning is fairly similar. These systems are given low-resolution images (because even now enormous amounts of computing power is needed) and the system has to tell e.g. sparrow pictures from pictures of a thrush. The images in the training are made more and more blurred or low resolution until no-one can tell the difference between a sparrow and a thrush. One system is called ‘Contrastive Language-Image Pre-training’ which, by now, is fairly self-explanatory.

Now, what makes AI smart is that modern systems 'know' how things should look even if you've not been specific. For example, a rose floating on water will be reflected in it: the image will show a reflection even if you don't specifically ask for reflections. In fact, if you do ask for a reflection, the words you use will push the next part of the prompt towards the back, and that could mean that part of the prompt is given lower prominence or even ignored. This can cause errors because, in fact, having reflections can be taken for granted but while mountains in the background is important to you, it is in fact an optional feature. So a prompt asking for reflections and mountains may deliver the reflection but no mountains. There's a whole lot more to writing prompts and different systems respond differently e.g. to spelling mistakes. Thus arises the fancy new skill 'prompt engineering'.

We're just at the beginning of learning what they can and cannot do. And the systems are developing all the time. You see below three attempts at 'a red rose floating on the surface of a rippling lake with fujiyama in the background' two from DeepAI and one from DALL-E. None are necessarily better than the others; you could say they differ ‘stylistically’. Further, it’s very unlikely they’re based on any actual images of a rose floating on a lake.

Left & centre (top two if you on a phone): DeepAi; right (bottom): DALL-E

Indeed, already, trickles of different styles and genres of text-to-image are emerging. Artists are finding one AI will consistently produce the kinds of image they like, and stick to them. For example, I prefer the art-like rendering of DALL-E to the more ‘real’ textures of Midjourney.

It’s all in the education

As you might expect, much depends on the specifics of a given AI’s training. I found, for instance, that DALL-E 'knows' the difference between the painting styles of Sonia Delaunay and Robert Delaunay (husband and wife abstract painters). But to my disappointment, Vieira da Silva hasn’t been on its syllabus. On the other hand, it has a very good grasp of the painting style of Hilma af Klint, but mainly her ‘spiritualist’ painting. DALL-E generates convincing Max Ernst and Hundertwasser (I nearly said it ‘loves’ those painters). Results with Modigliani and Chagall are variable, though. Yet some systems can't even distinguish between digital art and cartoon. It takes a while to learn each machine’s strengths and weakness.

You’re welcome to pour cold water or boiling water (your choice) on the whole exercise. Like any tool, it can be used with honesty and integrity. Or it can be used for bad purposes. That choice is ours.

Me? Cheerio until the next newsletter. I’m going off to play with this new tool.

May the Light be with you!

Tom

This is adapted from text appearing originally (and still on) on my website.

For examples of NFTs based on text-to-image generation,
please check out my work on Clubrare. All my work started from T2I generation.

I have selected and manipulated 49 images to compile my first NFT book
‘49 Syllables on Life’ available (soon) from Published NFT.

Dave

Feb 6, 2023

Hi Tom,

This is the first I've heard of AI used in this way. It's a fascinating subject and links two of my interests (photography and computing) so I'll certainly look more into whay this has to offer.

Please keep the interesting content coming :-)

Expand full comment

1 reply by TOM ANG