Pictorama: Text based Image Editing using Diffusion Model
Keywords:
Image Processing, Computer vision, Pytorch, Stable Diffusion Models, Python, Machine learning.Abstract
This research aims to pioneer image modification through text, integrating natural language descriptions with advanced computer vision and NLP techniques. The primary objective is to bridge human language and image editing, empowering users to convey creative visions effortlessly, revolutionizing the field of image modification. The study employs stable diffusion models, leveraging PyTorch and Python. It builds on prior works like Imagic, LEDITS, and Instructpix2pix, integrating a novel Vector Quantized Diffusion (VQ-Diffusion) model. The model is trained on a dataset of 436 GB containing 3 features an input image, an editing instruction, and an output edited image. Test samples include real images subjected to diverse text prompts for image edits, with disentanglement properties explored. The approach combines text inversion and Box-Constrained Diffusion (BoxDiff) for personalized and conditional image synthesis. The research showcases that stable diffusion models exhibit disentanglement properties, enabling effective modifications without extensive fine-tuning. The introduced BoxDiff and VQ-Diffusion models demonstrate superior performance in spatially constrained and complex scene synthesis, outperforming traditional methods. We are able to observe greater quality in output images with good cohesiveness throughout the image. Runing the model with greater number of steps allows for upheaval in quality. Here we have used 100 steps for greater image quality. The effect of number of steps on the time taken for inference is also studied. Due to the large amount of video memory required for inferencing, we recommend a GPU with >11 GB of video memory. The study adds value by addressing biases, achieving higher speeds, and enhancing image quality, contributing to the evolving landscape of text-to-image synthesis. This research introduces novel approaches in disentanglement, spatially constrained synthesis, and rapid image generation, pushing the boundaries of text-to-image synthesis beyond existing limitations.
Downloads
References
Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., ... & Irani, M.. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)(pp. 6007-6017).Available from : https://doi.org/10.48550/arXiv.2210.09276
Tsaban, L., & Passos, A. LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance. arXiv preprint arXiv:2307.00522 (2023). Available from : https://doi.org/10.48550/arXiv.2307.00522
Brooks, T., Holynski, A., & Efros, A. A. Instructpix2pix: Learning to follow image editing instruc-tions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18392-18402) (2023).. Available from: https://doi.org/10.48550/arXiv.2211.09800
Kawar, Bahjat, et al. "Imagic: Text-based real image editing with diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. Available from : https://doi.org/10.48550/arXiv.2210.09276
Miyake, Daiki, et al. "Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models." arXiv preprint arXiv:2305.16807 (2023). Available from : https://doi.org/10.48550/arXiv.2305.16807
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., … Simonyan, K. Flamingo: A visual language model for few-shot learning.(2022).Available from: https://doi.org/10.48550/arXiv.2204.14198
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021. Available from https://doi.org/10.48550/arXiv.2105.05233
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel CohenOr. An image is worth one word: Personalizing text-toimage generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. Available from : https://doi.org/10.48550/arXiv.2208.01618
Elharrouss, O., Almaadeed, N., Al-Maadeed, S., & Akbari, Y. Image inpainting: A review. Neural Processing Letters, 51, 2007-2028.(2020). Available from : https://doi.org/10.48550/arXiv.1909.06399
Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., ... & Tang, J. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 19822-19835.(2021). Available from : https://doi.org/10.48550/arXiv.2105.13290
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision (pp. 89-106). (2022, October). Cham: Springer Nature Switzerland. Available from : https://doi.org/10.48550/arXiv.2203.13131
Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., ... & Chang, S. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1900-1910). (2023). Available from : https://doi.org/10.48550/arXiv.2212.08698
Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., & Shou, M. Z. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7452-7461). (2023) Available from : https://doi.org/10.48550/arXiv.2307.10816
Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., ... & Guo, B. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10696-10706). (2022) Available from : https://doi.org/10.48550/arXiv.2111.14822
Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee. Visual Instruction Inversion: Image Editing via Visual Prompting. ArXiv Preprint ArXiv:2307.14331. (2023, July 25) Available from : https://doi.org/10.48550/arXiv.2307.14331
Li, J., Lu, W., Yang, M., Zhou, Y., & Yu, W. Text to Image Generation with Semantic-Spatial Aware GAN. ArXiv Preprint ArXiv:2104.00567. (2021, April 6) Available from : https://doi.org/10.48550/arXiv.2104.00567
Gu, X., Yang, Y., Xu, H., Zhou, C., & Wang, Y. Text-Guided Neural Image Inpainting. ArXiv Pre-print ArXiv:2004.03212.(2020, April 7) Retrieved from Available from : https://doi.org/10.1145/3394171.3414017
Dataset Link : https://instruct-pix2pix.eecs.berkeley.edu/clip-filtered-dataset/
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.