Tuesday, June 18, 2024

Stable Diffusion 3 and the possible why

I will begin by citing:

"5.3.1. DATA PREPROCESSING
Pre-Training Mitigations Training data significantly impacts a generative model’s abilities. Consequently, data filtering is effective at constraining undesirable capabilities (Nichol, 2022). Before training at sale, we filter our data for the following categories: (i) Sexual content: We use NSFW-detection models to filter for explicit content. (ii) Aesthetics: We remove images for which our rating systems predict a low score. (iii) Regurgitation: We use a cluster-based deduplication method to remove perceptual and semantic duplicates from the training data; see Ap pendix E.2."

Source: https://arxiv.org/pdf/2403.03206


While I do not personally agree or disagree on the matter, in science the most important thing would be for researches to be conducted free from any external pressure that may impact such research's outcome. Otherwise we may be crossing a border between science and other areas, whether it being a concern, social matter or any other matter causing science retardation.


My questions would be:

- What the responsibility of parties involved in pressuring researchers is?

- Is there any form of cooperation between researches and pressuring entities?

- Is there any responsibility of pressuring parties towards the failed research experiment due to such pressure being conducted (be it financial, expertise or any form that guarantees the success of given research)?


While the release of Stable Diffusion 3 couldn't entirely be called a "failed experiment" per se, the AI struggles to properly represent human subjects while having much less or none issues in other areas of image generation. Trying to make the AI safe most likely caused a huge problem with the said AI being underfit in the proper representation of human subject, rendering it finally unsafe for the viewer in certain circumstances. From accidental "nakedness" we shifted towards accidental horrible mutations when trying to represent a human subject that is not in the casual standing position. This could be well seen as a regression.


If a research is conducted by a company that was put under any form of pressure and the said company is known to produce research papers, what is the protection for such company against any form of issue that the pressuring party has therefor caused? As well as, what is the liability of the pressuring entity if the pressure such entity conducted caused problems of any nature to the research company in question?


My own research in the area has proven to me that AI such as Stable Diffusion can perform zero-shot tasks in answering various questions. When SD 1.5 model is asked how the hand should look like, it usually creates a hand containing 8 fingers. The result changes basing on the question related to the "hand" is asked (a "single" hand can have 4 fingers), usually never rendering a hand with 5 fingers. However, when asked specifically what a "hand's silhouette" should look like, the AI  locked to the same seed responded by generating the most proper hand a human subject should be having, the hand with five fingers and of a proper hand structure, a black shadow-like hand however - various questions regarding a hand with "silhouette" word included was confirming the outcome for me.

Furthermore, if one very closely watches on how Stable Diffusion generates a hand, there is a clear sign of AI confusing reflections and shadows as additional finger structures. Zooming in closely in every image sample will reveal that, even if the hand seemingly has 4 or 5 fingers, there may be opaque fingers still visible near it, or whole hand structure generating itself under the initial hand structure. My conclusion was that the AI struggles to differentiate a normal hand from:

- clasped hands,

- hand shadows (imagine a hand's finger shadows on a cloth), and,

- hand reflections (imagine a well polished desk that reflects parts of the hand back).


Using the same methodology I conducted a small research with SD 3.0. The initial question I asked using T5XXL text encoder was "how a laying down on the floor subject should look like?" The AI responded by generating a dog subject which may give some sort of insight that the AI is overfit with animal subject (specifically dogs) and underfit with human subjects in this area of its knowledge - a shift of data distribution in the training set most likely caused by removal of data that is initially guaranteed or guarantees the success of a research (there might've been unexpected false-positives in detection as well, reducing a valid data that might've helped AI better learn what it is meant to learn).


Furthermore, when asked the same question but specifying that the subject is a "dog" and a "person", the AI answered in the following way:

(samples generated from ComfyUI, using the recommended selection of nodes with separate SD3 clip models, using FP16 version of T5XXL - seed locked)

While a dog subject is rendered without much of a trouble, it is very much noticeable that the result on the right is hardly a body of a human. Interestingly, asking for "human" renders very close representation of the pose the human subject supposedly has when laying down on the floor. By looking at the dimensions of the subject represented by the example on the right side, we can clearly assume that AI tries to "fit" the dimensions of a human body into the dimensions of a typical dog laying down on the floor in whatever means necessary. If the body of a human has to be completely deformed and causing absolutely horrific results, SD3 will gladly do so, maybe in the name of being "safe" or maybe something else that went wrong.

Note that we didn't provide any very specific prompt here, just those two simple questions. While some people advice to be more descriptive, if it was actually a must for SD3 to always have a descriptive input the AI wouldn't render the dog properly either, or render a STOP sign properly by being asked just that. One could argue that a dog laying on the floor is very common, but that's not how one should look at the dataset. We are not putting there anything we can find and call it a day, it has to be properly balanced and reproduction of the said dataset confirmed. Any anomalies in verification samples need to be noted down and observed over the course of training the AI. If the situation doesn't improve or, in worst cases, degrades, the dataset would require an urgent verification - it can as well turn out that it got accidentally corrupted and it's not yet too late to fix.

The need to be descriptive in regards to human subjects means that the AI cannot self-guide itself towards some correct generic/default result in regards to humans, or is heavily confused/polluted/poisoned with the other data that is causing a serious bias towards rendering poses unacceptable for human bodies to find themselves in, to the point of generating extreme deformities. We could as well theorize that there were none human subjects laying on anything in the dataset after its pruning, so the AI has to guess how such subject could look like, being confident enough to try doing that and confident enough that human bodies can undergo a "very uncomfortable shapeshifting."

Obviously picking the correct resolution, and in turn an aspect ratio - as some suggest, does pay a major role in SD as observed with older models (e.g. SD1.5). It's good that the AI doesn't feel too confident to be shrinking or stretching bodies too much in most cases. However, it should be more confident to be cropping the image rendered and still render the body partially rather than render images in too unrealistic "morphed subject" situations that are hard to view, while the AI (probably) seeing it as a form of creativity and doing it when not explicitly asked for.

While the AI needs creativity to refrain itself from being too repetitive and boring, it cannot be passing the borders of sanity too often. Presenting a strange "insane" example every now and then is not in itself bad - see it as a serious person occasionally telling a joke. However, once the prompting is incapable of correcting such outcome at all or not in too tedious way, then it may be a sign that something went horribly wrong in how the AI trained itself and its knowledge is lacking or incorrect.

We could still argue that the anomalies are caused by inference being in its immature state and still needs improvements. Said so, I would be very surprised if such results are being caused by a bug of sorts. Though I wouldn't rule out that there could be something like certain weights being not processed or incorrectly processed, those possibly responsible for certain details. That would only be a theory though as I have no proofs of such being actually the case.

Finally, I myself was guessing that maybe the major problem is that the 2B model cannot handle the things the same way 8B would and possibly still required some babysitting to overcome its limitations. It could theoretically have not enough parameters to be paying correct attention to rarer examples in the provided dataset, filling itself up with what it could and being incapable of changing its ways of understanding later on to, for example, reconstruct proper human poses, limbs amount, so on so fort. To me it's clear that less number of parameters may reduce certain details, but it's unclear what those details may actually be.

 

While I'm not claiming if and in what part the issues with SD3 Medium are related to the below, my overall personal opinion on general purpose AIs, those that are meant to replicate non-specific task oriented results, is that they must be always trained on an unattended (but balanced or weighted) dataset. In the worst case scenario it could be a subtly modified dataset that is proven to not have any significant impact on the final outcome. However, the first method should be the most valid one.

I believe that artificial intelligence was proven to be very much comparable to any other form of intelligence - at least to the point we're trying to replicate human-like intelligence in artificial means. Human intelligence bases on associating one information with another to produce the desired thought outcome that can be then passed on or/and kept to oneself. If such intelligence lacks some form of critical information, the produced thought may be therefor seen incoherent to other recipients of the said thought. By recipient we do not necessarily mean another human but anything that is the recipient of such thought (a person, an animal, a situation, an inanimate object etc., a cause-and-effect action in general.)

Humans are further taught to behave in a particular way while still maintaining their knowledge. Similar methodics should be utilized in the field of AI. This could be post-filtering, fine-tuning or any sort of way that doesn't change the initial understanding of given AI on the matter but still will ensure such AI "refrains" from passing its "artificial thought" in concerning ways.

 

I urge people to calmly debate on things, know when to be sorry and work on finding the right solution. Rather than putting pressure on each other and enforcing one's way onto another's. It's so easy to scream at someone, it's so hard to calmly walk the path of mutual understanding.


With that I will end this post as there is nothing more to say. Please, by all means, do not see that post as an opinion that is meant to be forced upon anyone. I merely share my concerns and understanding on the matter.