NatViS: Natural Vision

Name: NatViS: Natural Vision - v1.0
Rating: 0 (0 reviews)
Author: nDimensional

CHECKPOINT

Original

nDimensional

Updated: Sep 8, 2024 11:42 AM

Run8.2K

Please Read Description

NatViS (Natural Vision) is a photorealistic full-parameter fine-tune of SDXL that uses Natural Language prompting to generate high quality SFW/NSFW images. Trained on 1M+ image-caption pairs on a dataset that’s been expanded and refined for over a year.
Note: NatViS is still being trained. V1 (epoch 68) wrapped up training on July 19th, 2024.

I’ve never been a fan of e-begging, however SDXL fine-tunes at this scale are becoming expensive to tune. So I will begrudgingly ask; if you like what I do and would like to support my models. Consider donating on Ko-Fi 💗
I will be begin posting updates, answering questions, taking feedback, and releasing early access (NOT EXCLUSIVE) models to supporters.

Questions/Feedback/Support

Visit my thread on the Unstable Diffusion Discord

Buy me a coffee ❤

https://ko-fi.com/ndimensional

All donations will be used to fund the creation of new Stable Diffusion fine-tunes and open-source AI tools.

Usage Tips

Note: These are simply recommendations, feel free to experiment.

Prompting

NatViS leverages SDXL’s bigG text-encoder to allow for Natural Language prompting.

What is Natural Language Prompting?
Since the release of Stable Diffusion v1.4 — people have become accustom to comma delimited lists of visually descriptive tags/phrases. This was a necessity for early Stable Diffusion models due to the architecture and choice of text-encoder. With SDXL’s dual text-encoder/tokenizer architecture we are able to write more naturally descriptive prompts.

Simply describe the image you want to generate, just as you would describe the image to a person.

For example;
Comma delimited list: a woman, standing, outdoors, sun beams, dappled light, apple tree, wearing denim jeans, flannel shirt, brown hair, long hair, looking at viewer, highest quality, atmospheric, 35mm, masterpiece

Natural Language: A masterpiece, 35mm-style photo of a woman with long brown hair, standing outdoors in dappled sunlight beneath an apple tree. She wears denim jeans and a flannel shirt, gazing directly at the viewer with an atmospheric quality.

Note: This is just an example to highlight how to write a natural language prompt. For better examples, see the sample images.

Will NatViS Understand Everything I tell it?
Absolutely, not.
Due to various limitations in both the architecture and size of the data I’m able to fine-tune as one person. There will be instances where the model will simply not generate what you want. Often, you experiment with different wording, placement of tokens (i.e., moving a sentence or individual token closer to the start or end of a prompt), remove potentially conflicting tokens, ect… Their really is no definitive solution I can, as it varies from prompt-to-prompt. Unfortunately there will times when no solution/workaround is successful.

Can I still use Tags?
Short answer: Yes
SDXL’s dual text-encoder/tokenizer architecture can process tokens/sequences with both encoders in parallel. Meaning, you don’t have to use natural language prompting.

Note: Since the training data was purely captioned with Natural Language descriptions, not all the common descriptive tags people are familiar with will be understood by the model. Especially Booru, Booru-style tags.

I found a hybrid system works well, as seen in many of the sample images.

For example;
Say you tried your natural language prompt, but want to make the results a bit more cinematic. Instead of modifying the entire prompt; you can simply append cinematic lighting, harmonious, film still, ect.. To the end of your prompt.

Quality Tags/Classifiers? (score_up_x)
Blasphemy.
You can use quality rank/classifiers if you want. But they will not part of the training data.

Negative Prompt
Similar to other SDXL models. Use tags separated with commas and keep it short. Add/Remove tokens from the negative prompt as needed.

Generation Parameters

CFG:

Recommended: 5-7
7+ to enforce a specific style/medium

Sampler/Sampling Steps:
This can be quite subjective, so I will just share what I typically use instead of giving direct recommendations.

Sampler - DPM++ 2M SDE
Scheduler - Karras
Steps - 55

ADetailer: (Extension)
Link
Again, subjective so I’ll just share my settings.

Model - mediapipe_face_full (use mediapipe for photorealism)
Confidence - 0.45
Everything else is default.

CFG Rescale: (Extension)
Link
I forgot that I had this installed, I’m not quite sure if it was enforcing the zero terminal SNR to the noise schedule or not. Since the parameter was null, it shouldn’t have.

Phi - 0

Important

If you struggle to replicate the sample images, even with the exact seed and parameters. It’s likely because of the noise scheduler. I enabled the fix for this in Webui, but had since reinstalled webui and forgot to re-enable it. This only applies to V1 of NatViS.

Training Info

TO-DO
This will take a while to write up. So in the meantime:
TLDR; 1M+ images, processed/cleaned via personal Dataset Toolkit I’m developing, captioned via Multimodal Large Language Model (MLLM) with unified feature space (part of Dataset Toolkit, not GPT). Training Data, Configs, Custom Scripts will be made available and open-sourced when the final version is released. Dataset Toolkit has no announced release date.