Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15037))
Included in the following conference series:
198Accesses
Abstract
Current advanced methods for scene text recognition are predominantly based on Transformer architecture, focusing primarily on resource-rich languages such as Chinese and English. However, Transformer-based architectures heavily rely on annotated data and their performance on low-resource data is not satisfactory. This paper proposes a Hybrid Encoding Method (HEM) for Scene Text Recognition in Low-Resource Uyghur, aiming to equip the network with both the long-range association capability of the Transformer for global image context and the function of CNN for capturing local detailed information. Simultaneously, by combining the strengths of CNN and Transformer encodings, the model’s learning capacity can be enhanced in low-resource settings, bolstering its ability to comprehend images while reducing its reliance on annotated data. On the other hand, we construct two Uyghur scene text datasets, namely U1 and U2. Experimental results demonstrate that the proposed hybrid encoding method achieves outstanding performance in low-resource Uyghur scene text recognition, improving accuracy by 15% compared to baseline methods.
This work was supported by the Joint Funds of the National Natural Science Foundation of China (Grant No. U1603262) and the National Natural Science Foundation of China (Grant No. 62137002), Shenzhen Municipal Science and Technology Innovation Committee Project (Grant No. GJGJZD20210408092806017).
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 9380
- Price includes VAT (Japan)
- Softcover Book
- JPY 11725
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Mandaviya, K., Chaudhuri, A., Badelia, P.: Optical Character Recognition Systems for Different Languages with Soft Computing. Springer, Cham (2019)
Atienza, R.: Vision transformer for fast and efficient scene text recognition. In: International Conference on Document Analysis and Recognition, pp. 319–334. Springer (2021)
Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4715–4723 (2019)
Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive sequence models. In: European Conference on Computer Vision, pp. 178–196. Springer (2022)
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale (2020).arXiv:2010.11929
Du, Y., et al.: Svtr: scene text recognition with a single visual model (2022).arXiv:2205.00159
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7098–7107 (2021)
Fujitake, M.: Dtrocr: Decoder-only transformer for optical character recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 8025–8035 (2024)
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: a reference-free evaluation metric for image captioning (2021).arXiv:2104.08718
Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435 (2022)
Li, M., et al.: Trocr: transformer-based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (2023)
Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing508, 293–304 (2022)
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094 (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Sabu, A.M., Das, A.S.: A survey on various optical character recognition techniques. In: 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), pp. 152–155. IEEE (2018)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.39(11), 2298–2304 (2016)
Song, H., Dong, L., Zhang, W.N., Liu, T., Wei, F.: Clip models are few-shot learners: empirical studies on vqa and visual entailment (2022).arXiv:2203.07190
Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: Reclip: a strong zero-shot baseline for referring expression comprehension (2022).arXiv:2204.05991
Wang, P., Da, C., Yao, C.: Multi-granularity prediction for scene text recognition. In: European Conference on Computer Vision, pp. 339–355. Springer (2022)
Wang, Z., Xie, H., Wang, Y., Xu, J., Zhang, B., Zhang, Y.: Symmetrical linguistic feature distillation with clip for scene text recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 509–518 (2023)
Xu, M., Zhang, J., Xu, L., Silamu, W., Li, Y.: Collaborative encoding method for scene text recognition in low linguistic resources: the Uyghur language case study. Appl. Sci.14(5), 1707 (2024)
Zhao, S., Wang, X., Zhu, L., Yang, Y.: Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model (2023).arXiv:2305.14014
Zheng, T., Chen, Z., Fang, S., Xie, H., Jiang, Y.G.: Cdistnet: perceiving multi-domain character distance for robust text recognition. Int. J. Comput. Vis. 1–19 (2023)
Author information
Authors and Affiliations
College of Computer Science and Technology, Xinjiang University, Urumqi, Xinjiang, China
Miaomiao Xu, Jiang Zhang, Lianghui Xu, Yanbing Li & Wushour Silamu
Xinjiang Laboratory of Multi-Language Information Technology, Xinjiang University, Urumqi, Xinjiang, China
Miaomiao Xu, Yanbing Li & Wushour Silamu
Xinjiang Multilingual Information Technology Research Center, Xinjiang University, Urumqi, Xinjiang, China
Miaomiao Xu, Yanbing Li & Wushour Silamu
- Miaomiao Xu
You can also search for this author inPubMed Google Scholar
- Jiang Zhang
You can also search for this author inPubMed Google Scholar
- Lianghui Xu
You can also search for this author inPubMed Google Scholar
- Yanbing Li
You can also search for this author inPubMed Google Scholar
- Wushour Silamu
You can also search for this author inPubMed Google Scholar
Corresponding authors
Correspondence toYanbing Li orWushour Silamu.
Editor information
Editors and Affiliations
Peking University, Beijing, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Xu, M., Zhang, J., Xu, L., Li, Y., Silamu, W. (2025). Hybrid Encoding Method for Scene Text Recognition in Low-Resource Uyghur. In: Lin, Z.,et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15037. Springer, Singapore. https://doi.org/10.1007/978-981-97-8511-7_7
Download citation
Published:
Publisher Name:Springer, Singapore
Print ISBN:978-981-97-8510-0
Online ISBN:978-981-97-8511-7
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative