ghetto edit. didn't finish typing before window expired:
Your description of it isn't wrong, but it's very simplified and it's wrong to then assume real life is as simple as the description. there are layers and layers of further optimism and tweaks occurring on the input text and the output image.
One example is the nice --tile function on mjv3 which effectively wraps the image around a sphere during generation. Since each pixels probability distribution is effected by the colour of it's neighbours, the result is an image that tiles coherently.
Similar tricks are used for in painting and out painting. When I was having trouble generating a 'black punk rocker with dandelion hair' on an early version of midjourney, it was because the dataset was reflecting the overwhelmingly large numbers of photos of white American 2000 pop punk era online. I took the image, masked out the white face and told it to inpaint a black face, which worked well enough. Current openai dataset is built on a larger number of images with, I think, some adversarial training, to try to reduce the biases of our society. I suspect in and out painting is being used under the hood by MJ to carve up images following trad composition rules to help produce pleasing public photos. The latest definitely has specialist subtrained versions for faces and hands, which humans are hardwired to really notice mistakes in.
I've not even touched on the complexity of the natural language prompt processing, but I'll note there is an interaction there with your suggestion "it can't draw what's not in the dataset". Given how huge the tagged image datasets are now, if you can describe it in English, chances are it exists in the dataset, and has enough related concepts to pinpoint it in latent space. I suppose we should define that.
Latent space doesn't exist if you only train it on a single picture, but if you train it on two, then the latent space is all the possible images between the two, including ones where pixels are extrapolated from their neighbours. It's a huge high dimensional space. If we assume a 20x20 pixel image with only RGB channels, each pixel has an xy coordinate in the picture (2 dimensions) and a RGB coordinate in colour space (another three dimensions). For two images, the same xy coordinate pixel could be described as a vector between the first pixel RGB point to the second RGB point. The path you take through colourspace to get from one to the other may not be a smooth straight line (and shouldn't be for RGB as it's coordinate space doesn't map to human colour difference perception. Corrected CIELAB space is better for that). For a pixel's colour, you could talk about a simple probability probability density being 50% one end image and 50% the other end image and a nuanced probability density including multiple places along that vector path (say green between yellow and blue). But pixels are also diffused, so the vector space you computing includes all the other vectors of nearby pixels, and the intermediate iterations. So there are lots of pixel patterns it can reasonably draw that are not included in the original dataset (such as dandelion leaf punk hair).
In terms of practical use, I know a lot of architects who are using it for rapid sketching and moodboard research - after all, the resulting image from the text prompt tells you a lot about society's association with that prompt (as filtered by the dataset). I asked for 'half baroque church, half tank' once, and in the range of image generated I also got tank tops, oil tankers and industrial process tanks. It is a good way to inspire the creative process.
I know a few architects who went further, and trained a local copy on images of brutalist buildings only, in order to explore the latent space between the images to see if the human architect pattern spotter found new patterns in how a set of windows shifted into a set of columns, for example. Talk is on youtube:
https://www.youtube.com/watch?v=ZMlt-ht ... gQ94AaABAg They actually built one of the results for a client.
For me, personally, for sketching I've found stylised images are helpful, as I interpret them into useful concepts for my work:
https://bakefoldprint.wordpress.com/202 ... d-dall-e2/
Another architect I know has married stable diffusion into a second MLtrained program that turns flat 2d images into depthmaps to get 3d effects:
https://www.youtube.com/watch?v=Q_3QVsETjLc&t=1s
A third architect has been using grasshopper/rhino to generate the rough building shapes, and key views, export those as images, using overpainting to create a detailed image. This in analogous to various animation techniques, and helps provide something consistent. I am interested in combining the two effects to be able to reimport the image to generate a more detailed 3d mesh.
Beyond that, and more into the technical weeds, the latent space coordinate assessment of an image has been found to be an incredible image sloppy compression approach -
https://towardsai.net/p/l/stable-diffus ... mpresssion
Likewise the learning about embedded vectors and cosine simialrty between them in the latent space applies to all sorts of ML techniques. I'm currently classifying historic bricks based on a few dozen measures (dimensions) and using the same techniques to measure bricks relative difference/distance from each other, and therefore grouping them by probable commonalities.