Ink Drop
This is a collaboration with META’s internal team and Dartmouth ILIXR to develop an AI-powered drawing tool leveraging Stable Diffsino technology. The focus is on the human-in-the-loop design experience and AI integration.
In our study, we concentrate on methods that manipulate the denoising diffusion process, aiming to expand the scope of control over the generated outcomes and image editing [1]. Crucially, our objective is to involve artists directly in the process, thereby transforming it into a more transparent and interactive 'white box' system.
Guided Image Synthesis
Using Diffusion-Based Editing Techniques
The fundamental idea is to modify the diffusion process in Stable Diffusion to make it transparent and controllable. This is done by visually guiding the process and making it modular. By controlling the scheduler and specific points in the diffusion timeline, artists can achieve more precise results. Furthermore, incorporating drawing software like Photoshop for editing input treats it as a memory unit, preserving the latent states in Photoshop's editing history. This method mimics a Recurrent Neural Network (RNN) by storing the latent states and using the output from one process as the input for the next, leading to highly consistent outcomes.
Visualization of the Denoising Scheduler
In the project, we use ComfyUI for prototyping. The Advanced KSampler always adds noise to the latent followed by completely denoising the noised-up latent, this process is instead controlled by the start_at_step
and end_at_step
settings. This makes it possible to e.g. hand over a partially denoised latent to a separate KSampler Advanced node to finish the process.
denoise = (steps - start_at_step) / steps
Extensive research, including the findings mentioned in [7][10], has analyzed the influence of noise schedules on model performance. Such factors may similarly alter the backward diffusion process, an area we will examine thoroughly in subsequent sections of this document. The subsequent diagram offers a straightforward depiction of how noise schedules affect outcomes. Interestingly, the team at ByteDance uncovered a bug in the commonly used sampling method from the original stable diffusion team, which produces incorrect samples in response to explicit and straightforward prompts [9].
By employing different denoising schedulers,
we notice varied outcomes using an identical sampling method and seed.
Visualization of the Denoising Process
The schedule's step count is crucial. Allowing the sampler more steps enhances accuracy. In this context, we construct a custom node to terminate the denoising diffusion process at a predetermined step.
Modulizing the diffusion process
By controlling the scheduler and the diffusion process, we observe the relationship between creative spot and input consistency in the two plots below. [3]
Having the control of the scheduler and denoising steps, we can use the latent result from the first diffusion process as the input for the second diffusion process.
Redirecting the generation process by given different negative prompts
Merging Concepts from Different LoRAs
Connecting the pipeline in Photoshop
We can utilize Photoshop to modularize the backward diffusion process as a memory block to track positions in latent space. Additionally, leveraging the layer mechanism can yield more consistent results from the model across various inputs.
Reference
[1] Zhang, L., Rao, A., & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.05543v3. https://doi.org/10.48550/arXiv.2302.05543
[2] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021
[3] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
[4] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[5] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
[6] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scenebased text-to-image generation with human priors. In European Conference on Computer Vision (ECCV), pages 89–106. Springer, 2022. 2
[7] Ziyi Chang, George Koulieris, Hubert P. H. Shum, Senior Member, IEEE, On the Design Fundamentals of Diffusion Models: A Survey
[8] Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., & Ermon, S. (Year). SDEdit: Guided image synthesis and editing with stochastic differential equations. Stanford University; Carnegie Mellon University. https://doi.org/10.48550/arXiv.2108.01073
[9] Lin, S., Liu, B., Li, J., & Yang, X. (Year). Common diffusion noise schedules and sample steps are flawed. ByteDance Inc. https://arxiv.org/abs/2305.08891
[10] Deja, Kamil et al. “On Analyzing Generative and Denoising Capabilities of Diffusion-based Deep Generative Models.” ArXiv abs/2206.00070 (2022): n. pag.