Advancements and Challenges in Text-to-Image Synthesis: A Comprehensive Review

Authors

  • Khushboo Patel, Parth Shah

Keywords:

Text-to-image, GAN, Computer Vision, Virtual Reality

Abstract

Text-to-image synthesis, a subfield of generative adversarial networks (GANs), is an exciting area of research that aims to bridge the gap between natural language understanding and computer vision. With recent advancements in deep learning techniques and the availability of large-scale datasets, significant progress has been made in generating realistic and diverse images from textual descriptions. Generating Hi-Fidelity, complex images from text are a challenging task. The ability to generate real images from textual descriptions has profound implications in various domains, including computer vision, multimedia, and virtual reality. This paper provides an in-depth study of state-of-the-art techniques and methodologies for text-to-image synthesis. Also, this paper discusses the various architectural enhancements, models, and evaluation metrics. Finally, the paper concludes by identifying open research issues and future directions that can enhance the performance and capabilities of text-to-image synthesis systems.

Downloads

Download data is not yet available.

References

Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H. (2016, June). Generative adversarial text to image synthesis. In International conference on machine learning (pp. 1060-1069). PMLR.

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D. N. (2017). StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., Metaxas, D. N. (2018). AttnGAN: Fine-Grained Text to Image

Generation with Attentional Generative Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Zhu, J. Y., Zhang, R., Zhang, D., Lu, J., Ziwei, L., Luo, X. (2019). DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Shen, T., Zhou, T., Long, G., Jiang, J., Zhang, C. (2019). Diverse Image Generation via Self-conditioned GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Zhang, H., Goodfellow, I., Metaxas, D., Odena, A. (2019). SelfAttention Generative Adversarial Networks. In Proceedings of the International Conference on Machine Learning (ICML).

Li, Y., Liu, S., Yang, J., Zhou, X. (2019). Controllable Text-to Image Generation. In Proceedings of the AAAI Conference on

Artificial Intelligence (AAAI).

Chen, Y., Yang, Z., Yang, Y., Zhang, M., Zhang, J. (2020). Mirror GAN: Learning Text-to-image Generation by Redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Wang, T., Wang, M., Liu, J., Zhu, J. Y., Tao, A., Kautz, J., Catanzaro, B. (2020). AttnGAN++: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proceedings of the European Conference on Computer Vision (ECCV).

Liu, S., Zhu, Z., Li, N., Luo, X., Shi, J. (2021). CPVT: A Compact Progressive Text-to-Image Synthesis Model with Fine-Grained User Control. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Chen, T., Zhang, Z., Zhang, J. (2019). MirrorGAN: Learning textto-image synthesis by redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4091-4100).

Zhu, J. Y., Park, T., Isola, P., Efros, A. A. (2017). Unpaired imageto-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 2223-2232).

Tan, H., Chan, C. S., Agustsson, E., Veeling, B. S. (2019). TextGAN++: A consistent and controlled text-to-image generative adversarial network. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, pp. 9594-9601).

Zhang, Y., Zhang, Z., Xu, J., Zhang, Z. (2019). StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41(8), 1947-1962.

Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., Belongie, S. (2018). Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 179-196).

Chen, T., Zhang, Z., Zhang, J. (2020). Semantics disentangling for text-to-image generation. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 752-769).

Wang, T., Zhu, M., Torr, P. H. (2020). Towards high-resolution text-to-image synthesis with pixel-wise semantic alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7297-7306).

Dai, B., Zhang, L., Wang, D. (2017). Towards diverse and natural image descriptions via a conditional GAN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 2970-2979).

Wu, Z., Li, Z., Fan, Z. G., Wu, Y., Gan, Y., Pu, J., Li, X. (2023). Learning Monocular Depth in Dynamic Environment via Context aware Temporal Attention. arXiv preprint arXiv:2305.07397.

Tao, M., Bao, B. K., Tang, H., Xu, C. (2023). GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14214-14223).

Zhou, H., Qi, L., Huang, H., Yang, X., Wan, Z., Wen, X. (2022). CANet: Co-attention network for RGB-D semantic segmentation. Pattern Recognition, 124, 108468.

Xia, W., Yang, Y., Xue, J. H., Wu, B. (2021). Tedigan: Text-guided diverse face image generation and manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2256-2265).

Zhu, M., Pan, P., Chen, W., Yang, Y. (2019). Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5802-5810).

Razavi, A., Van den Oord, A., Vinyals, O. (2019). Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32.

Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (pp. 740-755). Springer International Publishing.

Nilsback, M. E., Zisserman, A. (2008, December). Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics Image Processing (pp. 722-729). IEEE.

Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollar,´ P., ... Zweig, G. (2015). From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1473-1482).

Sharma, P., Ding, N., Goodman, S., Soricut, R. (2018, July). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2556-2565).

Naveen, S., Kiran, M. S. R., Indupriya, M., Manikanta, T. V., Sudeep, P. V. (2021). Transformer models for enhancing AttnGAN based text to image generation. Image and Vision Computing, 115, 104284.

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D. N. (2017). Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 5907-5915).

Kliger, M., Fleishman, S. (2018). Novelty detection with gan. arXiv preprint arXiv:1802.10560.

Zhou, R., Jiang, C., Xu, Q. (2021). A survey on generative adversarial network-based text-to-image synthesis. Neurocomputing,451, 316-336.

Mansimov, Elman, et al.” Generating images from captions with attention.” arXiv preprint arXiv:1511.02793 (2015).

Salimans, Tim, et al.” Improved techniques for training gans.” Advances in neural information processing systems 29 (2016).

Heusel, Martin, et al.” Gans trained by a two time-scale update rule converge to a local nash equilibrium.” Advances in neural information processing systems 30 (2017).

Li, Bowen, et al.” Controllable text-to-image generation.” Advances in Neural Information Processing Systems 32 (2019).

Downloads

Published

05.06.2024

How to Cite

Khushboo Patel. (2024). Advancements and Challenges in Text-to-Image Synthesis: A Comprehensive Review. International Journal of Intelligent Systems and Applications in Engineering, 12(3), 4228–4237. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/6137

Issue

Section

Research Article