Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Hybrid Encoding Method for Scene Text Recognition in Low-Resource Uyghur

  • Conference paper
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15037))

Abstract

Current advanced methods for scene text recognition are predominantly based on Transformer architecture, focusing primarily on resource-rich languages such as Chinese and English. However, Transformer-based architectures heavily rely on annotated data and their performance on low-resource data is not satisfactory. This paper proposes a Hybrid Encoding Method (HEM) for Scene Text Recognition in Low-Resource Uyghur, aiming to equip the network with both the long-range association capability of the Transformer for global image context and the function of CNN for capturing local detailed information. Simultaneously, by combining the strengths of CNN and Transformer encodings, the model’s learning capacity can be enhanced in low-resource settings, bolstering its ability to comprehend images while reducing its reliance on annotated data. On the other hand, we construct two Uyghur scene text datasets, namely U1 and U2. Experimental results demonstrate that the proposed hybrid encoding method achieves outstanding performance in low-resource Uyghur scene text recognition, improving accuracy by 15% compared to baseline methods.

This work was supported by the Joint Funds of the National Natural Science Foundation of China (Grant No. U1603262) and the National Natural Science Foundation of China (Grant No. 62137002), Shenzhen Municipal Science and Technology Innovation Committee Project (Grant No. GJGJZD20210408092806017).

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Mandaviya, K., Chaudhuri, A., Badelia, P.: Optical Character Recognition Systems for Different Languages with Soft Computing. Springer, Cham (2019)

    Google Scholar 

  2. Atienza, R.: Vision transformer for fast and efficient scene text recognition. In: International Conference on Document Analysis and Recognition, pp. 319–334. Springer (2021)

    Google Scholar 

  3. Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4715–4723 (2019)

    Google Scholar 

  4. Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive sequence models. In: European Conference on Computer Vision, pp. 178–196. Springer (2022)

    Google Scholar 

  5. Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79 (2018)

    Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale (2020).arXiv:2010.11929

  7. Du, Y., et al.: Svtr: scene text recognition with a single visual model (2022).arXiv:2205.00159

  8. Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7098–7107 (2021)

    Google Scholar 

  9. Fujitake, M.: Dtrocr: Decoder-only transformer for optical character recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 8025–8035 (2024)

    Google Scholar 

  10. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: a reference-free evaluation metric for image captioning (2021).arXiv:2104.08718

  11. Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435 (2022)

    Google Scholar 

  12. Li, M., et al.: Trocr: transformer-based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (2023)

    Google Scholar 

  13. Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing508, 293–304 (2022)

    Article  Google Scholar 

  14. Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094 (2021)

    Google Scholar 

  15. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  16. Sabu, A.M., Das, A.S.: A survey on various optical character recognition techniques. In: 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), pp. 152–155. IEEE (2018)

    Google Scholar 

  17. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.39(11), 2298–2304 (2016)

    Article  Google Scholar 

  18. Song, H., Dong, L., Zhang, W.N., Liu, T., Wei, F.: Clip models are few-shot learners: empirical studies on vqa and visual entailment (2022).arXiv:2203.07190

  19. Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: Reclip: a strong zero-shot baseline for referring expression comprehension (2022).arXiv:2204.05991

  20. Wang, P., Da, C., Yao, C.: Multi-granularity prediction for scene text recognition. In: European Conference on Computer Vision, pp. 339–355. Springer (2022)

    Google Scholar 

  21. Wang, Z., Xie, H., Wang, Y., Xu, J., Zhang, B., Zhang, Y.: Symmetrical linguistic feature distillation with clip for scene text recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 509–518 (2023)

    Google Scholar 

  22. Xu, M., Zhang, J., Xu, L., Silamu, W., Li, Y.: Collaborative encoding method for scene text recognition in low linguistic resources: the Uyghur language case study. Appl. Sci.14(5), 1707 (2024)

    Article  Google Scholar 

  23. Zhao, S., Wang, X., Zhu, L., Yang, Y.: Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model (2023).arXiv:2305.14014

  24. Zheng, T., Chen, Z., Fang, S., Xie, H., Jiang, Y.G.: Cdistnet: perceiving multi-domain character distance for robust text recognition. Int. J. Comput. Vis. 1–19 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. College of Computer Science and Technology, Xinjiang University, Urumqi, Xinjiang, China

    Miaomiao Xu, Jiang Zhang, Lianghui Xu, Yanbing Li & Wushour Silamu

  2. Xinjiang Laboratory of Multi-Language Information Technology, Xinjiang University, Urumqi, Xinjiang, China

    Miaomiao Xu, Yanbing Li & Wushour Silamu

  3. Xinjiang Multilingual Information Technology Research Center, Xinjiang University, Urumqi, Xinjiang, China

    Miaomiao Xu, Yanbing Li & Wushour Silamu

Authors
  1. Miaomiao Xu

    You can also search for this author inPubMed Google Scholar

  2. Jiang Zhang

    You can also search for this author inPubMed Google Scholar

  3. Lianghui Xu

    You can also search for this author inPubMed Google Scholar

  4. Yanbing Li

    You can also search for this author inPubMed Google Scholar

  5. Wushour Silamu

    You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence toYanbing Li orWushour Silamu.

Editor information

Editors and Affiliations

  1. Peking University, Beijing, Beijing, China

    Zhouchen Lin

  2. Nankai University, Tianjin, China

    Ming-Ming Cheng

  3. Chinese Academy of Sciences, Beijing, China

    Ran He

  4. Xinjiang University, Ürümqi, Xinjiang, China

    Kurban Ubul

  5. Xinjiang University, Ürümqi, China

    Wushouer Silamu

  6. Peking University, Beijing, China

    Hongbin Zha

  7. Tsinghua University, Beijing, China

    Jie Zhou

  8. Chinese Academy of Sciences, Beijing, China

    Cheng-Lin Liu

Rights and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, M., Zhang, J., Xu, L., Li, Y., Silamu, W. (2025). Hybrid Encoding Method for Scene Text Recognition in Low-Resource Uyghur. In: Lin, Z.,et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15037. Springer, Singapore. https://doi.org/10.1007/978-981-97-8511-7_7

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp