AI Art Generation

Paint Along With Frida #1: Getting started with AI Image Generation

Nov 07, 2023

male and female generated AI characters — The Space Rangers - AI generated image

Looking at the AI chatbots (ChatGPT) it was very apparent that they were pretty good at ideation and could generate creative output if prompted in fine detail - i.e. a reasonable 3,000-word short story could be generated from a mere 2,000 words of prompting.

Image generation is quite a different matter. First off, it is slow, really damn slow. Working with square 512px images, it can take 30 steps (or iterations) to create something that starts to look decent - with each step on my moderately powerful PC taking a second and a half. Add some upscaling and it's definitely 'have a cigarette or cup of coffee while it works' territory.

Secondly, you don't get to have a drawn-out conversation setting the full context, like you do with the generative text AIs. You get a couple of prompts (one positive, what you want, and one negative, what you don't want), maybe 100-words and see what you get.

It's remarkable how, with such limited control, you can get the kinds of images you want. But it's definitely a collaborative relationship; there's a strong degree of 'you get what you're given'.

To be fair, there's a lot more to generative art than simply seeing what you get with a 100-word prompt (and this series of articles will work through it all) - but simple prompting is definitely the starting point.

And, there are oodles of online services that will allow you to use generative image AIs. Some are all about the image (e.g. https://dezgo.com/), while others try to build a 'social community' around the act of image creation (e.g. https://creator.nightcafe.studio). Then there are attempts at creating 'image-centred solutions' - make a logo, create an instagram post etc... which seem to me a little pointless and usually poorly implemented.

It's worth checking out sites like Dezgo (or Nightcafe) to get a feel for how generative image AIs behave, but they all have some common pitfalls:

They want to make money. They want you to subscribe or buy credits, with the true power of the AI hampered unless you do!
They don't want to get sued or imprisoned. The AIs are perfectly capable of sticking LeoDi's fizzog onto a picture of Fred West driving his ice-cream truck... so these services use other AIs to prevent that kind of thing^*; so you do not get personal creative freedom.
They're limited in scope. They can only create things out of what they have been shown. As a visual artist, you will have all kinds of things you want to develop that these services just can't draw for you. I've spent days trying to create a human-ferret hybrid - but all the ferrets look like dogs!

[^*Yes! Down with that kind of thing. But careful now, because these are the kinds of use cases that validate over-puritanical censorship]

Exactly what the AI is capable of drawing depends on the 'model' it is given. There are 'anime' models that have been trained on anime-style pictures and can only draw those kinds of images. Stable Diffusion v1.5 has had an enormous amount of training on images of all kinds, at a resolution of 512px. V2.1 has been trained at a higher resolution (768px), but has had a lot of NSFW type content omitted from it's training (go to dezgo and ask it to use Stable Diffusion 2.1 to draw 'a big fat penis'). Other people have supplemented the training of these base models so they can draw ultra-realistic people (e.g. Realistic Vision), or so they can add some (better) NSFW into them (e.g. YiffMix's furry-porn model, which you won't find on dezgo or any other online offer). Some online services go so far as to censor the prompt you want to use, try asking Nightcafe to draw 'a big fat penis' - it just won't draw anything!

Not to say I think you should be drawing big fat penises - but if you want full creative control of your endeavours, if you want to draw something other than just exactly what everyone else can draw, then you will need to create your own AI at home. Which is what I'll look at in my next article.

It's interesting though, that despite all the efforts to control what these AIs can do, they remain inherently sexualised; because they have been trained on what humans did before them! My lead image for this article (Blue Ranger, Red Ranger) didn't prompt for 'naked torso with rippling muscles and bulging crotch' or 'bikini-clad buxom lady' - but that's what it gave me!!! (I'm fine with it...)

Some Configuration Details

We'll see a lot more about how to configure a generative AI in my next article about making one at home - but here's a run-down on the main configuration you'll be faced with when using the online services. They're all pretty much of a muchness...

Prompt I'll do a deep-dive on prompting in a later article, but for now key points are:

Keep them as short as possible to avoid confusing the poor thing.
Compose the prompt as Subject + Scene + Style, e.g. Young lady, blue cape, blue boots. Outerspace, star-field. 16mm lens, wide-angle view.
Remember that AIs have no understanding, so are really bad at responding to prompts for 'large' or 'small', or for a given number of things^*; unless it happened to be trained on the specific concept you are looking for.
When writing the prompt you are trying to tap into 'visual concepts it may have been taught', more than a 'florid description of the scene'**.

[^*When asking for '3', Dezgo gave me 4 butterflies. '3 men' created at first 3 and a half, then on the second go, 2 and a half men... It did manage to draw 'two ladies, one cup']

[^**It doesn't matter how many extra words you add to a prompt, if the AI has never met the specific 'visual concept' it will never manage to draw it.]

Negative prompt list the things you don't want to see. Commonly this includes things like 'out of frame, extra limbs, blurry' - extend to include any other atrocities it conjures up for you.

Resolution or rather, Aspect Ratio typically with online services you get square 512px results with an option to make the image in a (smaller) landscape or portrait format.

CFG or Guidance or Prompt Weight defines how much freedom the AI gets in deciding how to fill the canvas. It will pay attention to your prompt but may decide it needs to add a chair because the person it is drawing is in a sitting pose. With less freedom, it will mangle what it is drawing to fit in with the prompt rather than adjusting the image to make sense of the prompt... so the person it was drawing in a sitting position may end up in a weird squat. At high CFG the image contains pretty much what the prompt asked for, but may be mangled. At low CFG the image may contain all kinds of stuff that wasn't even mentioned.

Sampler without worrying what the sampler is (read up more on samplers if you care to) the key concerns are:

Quality: are the images it produces any good!?
Cost: in terms of how long it takes to create an image, and how much VRAM do they use up.
Convergence: How repeatable the results are. As an artist, you probably want highly repeatable results, so that you know you can introduce tweaks and retry the generation of the image without it suddenly becoming something very different.

The common default seems to DPM++ 2M Karras (requiring typically 20-30 steps to produce a good image) as it offers a good balance of speed, stability (convergence) and quality. Personally, I prioritise quality (which actually might be nonsense in this context, but I do anyway) so I tend to stick with the simple 'Euler' sampler.

Steps or Runtime

The image is generated progressively (see How it works, below). The more the AI is allowed to work on it (defined by the number of steps it can go through, or else how long it gets to work on the image) the better the result - well, to a point, you can overcook images.

Limiting the number of steps (or charging for more steps) is one of the key ways in which the online services entice subscription. The free service typically undercooks images, often leaving them with ugly, deformed artifacts; although you can get lucky on occasion.

If you do have a choice, 30 steps should be workable as a rule of thumb.

Seed

The generative process begins with a pseudo-random number, called the seed. This can be used on further generations and if nothing else is changed, virtually the same image can be re-generated; especially if a convergent (or stable) sampler was chosen. You can then make small changes (to the prompt, or number of steps, for example) to cause the AI to refine its initial stab at the image.

Model

I mentioned models briefly (above) and I will look at them in more detail in a later article. They completely determine what the AI is capable of drawing, in terms of the 'visual concepts' it has learnt during its training. Some models are really good at drawing photorealistic faces, but struggle creating detailed landscapes. Your vision for a work will be the most important reason for choosing a given model - but models have other important characteristics that will need to be borne in mind:

Training Resolution, models will have been trained with 512,768 or 1024-pixel images. If you ask a 512px trained AI to draw a 1024px image there is a strong likelihood that it will repeat itself as it tries to fill the canvass.
Conceptual knowledge, the more the model knows about what you are trying to create the easier life will be!

Style

Most online services let you request a pre-canned style, or you can use the name of a well-known artist and the AI will ape their style for you. This all feels like 'crossing a line' or even 'cheating' to me. With pre-canned styles, you're just going to churn out works that look like what everyone else is doing. Referencing art styles, especially of living artists, just feels cheap to me.

Upscaling

You can typically upscale results by 2, maybe 4, times. This makes generating the image much more costly (either in time or with online services in 'credits'). You can also use independent scaling methods such as Photoshop or Topaz Labs' GigaPixelAI; with illustration-type generations these can be very effective. I can fairly effectively end up with A4-sized images, nowhere near the resolution of a decent digital camera, but certainly good enough for typical online use.

Early Samples of Generated Images

4 early images generated by AI — Early samples of AI generated images

How It Works

The AI doesn't just magic up images from its own understanding of the words you give it, far from it.

What it does do, is draw approximations of things it's seen before that approximately relate to the things you are asking for.

So there are two key questions really: How does it know how to draw stuff; How does it know what you want it to draw - given that it's an AI that has zero understanding of anything?

The answer in both cases is because it's been told.

The AI needs to be taught both the visual concepts that appear in an image, and how to draw such concepts.

So every single image given to the AI to train it also has a list of the concepts that are in the image. Given enough images tagged 'sky' it can eventually spot the pattern that they all have a patch of blue at the top of the images. Ask it for a blue sky and it can draw one for you. Same with red skies. But if you ask it for a 'green sky', it (probably) isn't going to manage to do that - because nobody has ever bothered to show it a bunch of images all tagged 'green sky'.

Certainly the base Stable Diffusion v1.5 model will not draw me a green sky. It doesn't know what 'green' is, it doesn't even know what 'sky' is (it 'knows' nothing). All it has learnt is that images containing the visual concept 'sky' have a patch of blue at the top.

Sometimes it seems as though it understands more than it does, just because of the amount of images it has been trained with.

So by looking for commonalities between lots of images that share the same tag it can learn the visual meaning of that tag. But then, it needs to know how to draw something similar to all those skies it has been shown.

What it doesn't do, is simply store all those training images and then sample from them when called upon to draw a sky; it magics up a whole new sky for you. A sky that has never existed before.

It does this by learning how to draw each of the training images it has been given. But that's really difficult. So first of all, it learns to draw a rough approximation, a very rough approximation.

The AI starts with an image of pseudo-random noise (generated from the pseudo-random seed number) and compares this with a very noisy version of the original image. It then adjusts its image of random noise until it is 'close enough' to the provided noisy version of the original. It ends up with a bunch of calculations that can reliably transform the seed noise pattern into something close enough to the noisy version of the original image.

It is then given a slightly less noisy version of the original image, and in like manner learns how to adjust its own image to end up close enough to this new, less noisy, training image.

This whole process repeats step after step until the AI has a set of calculations that can derive, something close to, the original image from a pure noise starting point. There may be hundreds of thousands (or many more) steps involved in this training.

Simulation of AI image training — Training simulation

It doesn't know anything about what colours are, or what shapes need to be rendered. It only knows how to draw a pure mess of noise, and how to adjust that mess so that it matches progressively less messy versions of the training image.

“How do you make a statue of an elephant? Get the biggest granite block you can find and chip away everything that doesn’t look like an elephant.”
"Why are elephants?", The Plain Dealer (Cleveland, Ohio) 1963

Once it has those calculations it can use them, against any starting mess of noise, to draw something similar in the future.

So, from tagging the AI knows a number of visual concepts. From pattern matching, it knows where those concepts may appear in images. And from noisy training it knows how to draw something similar to those concepts.

And that's all it knows. You can't tell it to draw "an elephant standing on its head", unless it has previously seen pictures with elephants doing head-stands that have been so tagged. You can ask it for "an elephant drinking whisky in a bar" since it knows about bars, whisky and elephants - but the elephant won't actually be drinking the whisky! They'll just all be there.

Hopefully, understanding how it all works should help in getting the results you want. Ultimately it seems quite limited, but you will find existing models that have more varied visual concepts baked in to get you closer to what you want. And of course, there are many more techniques that help complete the generative journey...

Ant’s Substack

Discussion about this post