In the space of only a few months, a spectre has come to dominate debates in photography. Is it a new camera? Is it about fantastic lenses? Is it about composition? No; it’s about artificial intelligence.
Specifically, the talk is about the threat to photography from text-to-image image generation (TTIG) artificial intelligence, often referred to as ‘AI photography’. Sloppy use of this term, as we’ll see, is half the reason for the headless running around.
As of early 2023, the photography world found itself at that chaotic stage in which fear and loathing face off excitement and curiosity. Simultaneously, the distance between those who know what is going on and those who can’t conceive how it all works steadily gets wider. And in that chasm lies plenty of room for hype, fear and misunderstanding.
It’s all in the spectrum
Twenty five years ago, digital technology appeared on the photographic horizon like an armed horde about to sweep through our peaceful valley. I heard exclamations like ’It’s the end of photography!’ ‘Our livelihoods are under threat!’ ‘It’ll never be as good!’ ‘It’s not art!’
We hear much the same wailing today: same words, same meaning.
But I’m not going to say it’s just a repeat. It isn’t. Text-to-image generation (TTIG) is certainly disruptive and a big mess-making new kid on the block. But compared to digital photography, that had been held back by computing power, TTIG enters a world now thoroughly digitised, with 25 years’ experience of internet and digital issues. Above all, it’s reaching a public that is digital to, well yes, its fingertips.
Allow me to give you some historical context. This will show that we’re working our way along a spectrum of interactions. We know quite a lot about one end; but what’s unclear is what awaits us at the other.
1983
Believe it or not, 1983 sees the first glimmering of artificial intelligence in photography but its roots go further back to 1977 when the ‘Matrix Metering’ project team was set up within Nippon Kogaku. Its objective was "Making the exposure compensation or AE lock operation unnecessary”. Its work was first implemented in the Nikon FA. By any standards a beautiful picture-making machine, its outstanding feature was in its control of automatic exposure.
For technology that’s 40 years old, the FA’s metering was impressively sophisticated and illustrates for us the main features of task-specific or limited artificial intelligence.
It does one job only (evaluates a scene to determine the camera’s exposure value setting);
It uses information gathered in preparation of its job (compares the exposure values at various places on the metering sensor with a map of exposure values stored in its memory to find a best match);
It doesn’t gather new information while working (the database memory is fixed, so there is no means to learn to improve performance);
It uses the information to do the job in a way that direct or mechanical responses cannot achieve (the information gathered by the metering sensor is many orders more complex than any mechanical process can handle).
There, we’ve just covered one definition of the first level of artificial intelligence. How many instances of task-specific AI in photography can you think of? You may be surprised how long the list is; in fact, I’m going to have to group them.
In camera AI: exposure metering, white balance, Smile Shutter, Eye Focus, 3D auto-focus, highlight control, auto High Dynamic Range, composition …
In management software: face recognition, eye/face focus detection, movement-blur detection, optimal exposure detection, white balance detection …
In image manipulation software: content-aware repair, panorama stitching, auto-anything (levels, white balance, noise correction, etc.), AI sharpening, AI enlargement, lens correction, projection correction, sky replacement …
Limited memory
The next level of AI also works from a store of ‘learned’ information to perform certain tasks. Note: the information is learned: it’s not just a big lake of data, but a lot of data processed and ordered.
Think about a library of books that are shelved neatly, sorted by subject matter and alphabetically by author. To access its information, you walk up and down the dimly-lit aisles checking off Dewey classification numbers, then looking for individual titles. (Ah, for those days we had the leisure to spend an afternoon in the library researching just one paragraph of an essay! Anyway, back to earth …)
Now suppose the library consists of books that summarise other books, and books that condense the summaries. It’s much easier to find information in a single summary of 50 books than by searching through all 50 books. It was the strategy that Encyclopedia Britannica used for some years. Of course, that information will not be fully detailed. Well, the summary will point you to the more detailed sources if you need. AI uses data structured somewhat the same way.
Another way to look at it: the training process for TTIG is analogous to the way the littlies learn. Your kid points to a dog and declares 'cat' but you correct her and say 'dog'. Next animal she points to is a cat but she says 'dog' and you correct her. With more experience (deep learning) she gets 'cat' right 100% of the time. She's not copying or holding images of any particular cat in her head, but she is storing the notion of 'cat-ness' having built a concept comprising the qualities that make a thing a cat. After a while, with continuing refinement of her concept, she can recognise that a lion is also a cat.
So when you ask for a cat in text-to-image generation, the AI looks up all the cat-like instances that it's learned: the shape of ears, eyes, the furriness, the overall shape of body. When you ask for a black cat sleeping, the computer ignores all the cat qualities (white, ginger, standing, running, and so on) that don't correspond to your prompt. But if you ask for a Devon Rex sleeping, it looks up the Devon Rex breed and ignores the Birmans and tabbies in its model.
If you ask for watercolour style, it ignores all the photo-real elements and keeps the blurred, low tonality data. And so on, then it builds an image, conforming to filters about tonality, keeping shape and proportion, while omitting background as a watercolour painting usually does.
High-power
So the system uses its stored model to inform its decisions that depend on following pre-programmed rules. Whether it’s to spot someone loitering with intent in a town square, diagnose a clinical condition, or produce a digital art rendering of a pink rabbit dancing in Times Square, these AI systems are all task-specific. They’re trained to do one job, and just that one job. A chess-playing system isn’t expected to predict air traffic flows – even if many of the computations are similar.
But there are two big differences.
Compared to the 30-odd thousand photos analysed (by hand!) for the Nikon FA, the dataset of today’s AI comprise several billion items. In some ways, that merely reflects vastly improved computing and increased machine learning skills.
(There is the issue is whether it’s right for billions of images to be used freely for the training of the AI. That’s a topic for another time.)
The other difference is conceptually far more important, as it’s a step up in paradigm. Today’s AI systems can learn as they work. They do so in just the way we do: if we get something right, we get a good mark and a pat on the back. For AI a good mark is more practical: you download the results or you can give it a ‘Thumbs up’ response. Or you can send a report that you didn’t get a response matching your request. All that feed-back from us is given to the systems to refine their responses i.e. improve the chances that we will accept the response as matching our prompt.
What’s scary
There’s much fearful talk of AI becoming conscious and doing frightful things (well, it’ll only do what it’s trained to do. Anyway … ). It’s not impossible, and it may happen sooner than later. But, at present, that looks like only a theoretical possibility.
What is stunning the art and photography world is that the advance to images that easily deceive has been so rapid. It’s analogous to the disdain of digital photography when early adopters were told it would never resolve as much as film can. In the case of text-to-image generation, however, there’s been hardly any time between scoffing at the first clumsy text-to-image generations with their squiffy faces and weird hands before they’re convincingly real even under close examination.
In fact, that’s a bit of a chimera. Thanks to the levelling effect of virtualization which removes fine differences between images, artificially generated images easily look like real photographs. Most people consume images on feeds like Instagram or Facebook, so at least 90% of images they look at are very low resolution: typically less than 1000 pixels wide.
Although it’s surprisingly tricky to make images look like they’re real photographs by hand, the large image models such as DALL-E, MidJourney, Stable Diffusion, Adobe Firefly, can all do a very good job. The trick is to combine a smoothness of tonality with a mix of sharp and blurred details within limited colour palette. But a close, informed scrutiny of the images quickly causes them to fall apart. In fact, those experienced with text-to-image models can often spot the ‘look’ of a generated image, or what I call a synthogram.
Scrambling for new vocabulary
Words and their meanings are always scrambling to keep up with changes in society and in technology. We can’t expect meanings to be universally shared, but we can’t operate unless there’s a majority agreement. In the case of AI, the field is so new to the public there’s a great potential for confusion.
So it doesn’t help to talk about ‘AI photography’ to refer to text-to-image generation that doesn’t actually use a camera or light, thereby failing entirely to meet any basic definition of photography.
Of course, text-to-image generations are trained on images. But to say that using cameras to make the training set renders text-to-image a type of photography is like saying hamburgers are made of grass. Sure, without grass feeding the cattle who are then slaughtered and minced eventually to be fried and flipped, we’d have no hamburgers. Likewise, Large Image Models used for text-to-image generations are trained on billions of 224 x 224 pixel images paired with keywords, all mashed up together and sitting in a super-computer waiting to be ground out when we send in a text prompt.
Synthogram
I propose the term ‘synthogram’ to refer to creations generated by text-to-image models such as DALL-E, Midjourney, Firefly, etc.
What text-to-image AI generates is without a camera at the production stage, so it's more of a '-gram' (the difference between a photograph and a photogram is whether a camera is used). But as text-to-image doesn't involve light either, it’s obviously best to drop the prefix 'photo-' altogether.
Some writers have suggested 'synthograph' but I think that leaves too much room for confusion as it glances back to photography. I prefer 'synthogram'.
Further, the way a text-to-image image is generated is closer to writing – building up more and more detail – than to a camera’s capture: it’s more ‘-gram’ than ‘-graph’. Here’s another example: a hologram is a 3-D image constructed without a camera whereas a holograph is a handwritten document.
So we can ask: Is a synthogram a photograph? Answer: No; it's not. Does it look like a photograph? Some synthograms do; yes. Some don’t.
A photograph can be captured to look like a watercolour, but have we captured a watercolour? No; we haven't. A fortiori if we turn a photograph into a charcoal drawing-like image, it's not a charcoal drawing.
So, a synthogram may be built up with the help of photos. But a billion photo-text data-pairs do not a photo make.
Incidentally, text-to-image models also train on paintings, illustrations, sculpture, architecture and the like: synthograms can be built from any of those elements. Indeed, synthograms can be built from combinations of elements.
Next instalment
I’ll be making some first steps to tackling the intellectual property issues around machine learning for text-to-image models in a later posting. I’m confident I won’t get everything right, if only because there are high-profile cases awaiting trail. And the whole merry mess is complicated by different statutory regimes operating in US, UK, and European Union.
A Substack reader asked about where to draw the line between what is acceptable or not when it comes to using AI in photography. That’s worth a whole other posting (if not a book). So that’s on the roadmap too.
Any other ideas to keep me out of mischief? Let me know in the comments or via the Chat.
Looking for more photography reading? Check out the e-books on my website.
If you’re a beginner, the Travel Photographer’s Handbook may be best.
To help learn about judging pictures Picture Editing is for the general reader, Photography Judging is great for the salon photographer or judge.
The Photo Insights book is great dipping fun for all photographers.
All books also available as hardcopy from Barnes&Noble, Amazon, and other online bookstores.
Great article! Very interesting and informative piece thank you
Great article Tom! I like your word "synthogram" as it then separates these ttig images from photography. So many people don't realise we have been working with AI for years now and see them as just updates to PS and LR etc. Lovely how you have pointed this out.
I haven't yet got into these new models yet, but wondering if you have a preference for one above the other.
Also, wonder if there is a way to combine your own photographs with synthograms?
cheers
Gail