Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15037))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

198Accesses
1Citations

Abstract

Current advanced methods for scene text recognition are predominantly based on Transformer architecture, focusing primarily on resource-rich languages such as Chinese and English. However, Transformer-based architectures heavily rely on annotated data and their performance on low-resource data is not satisfactory. This paper proposes a Hybrid Encoding Method (HEM) for Scene Text Recognition in Low-Resource Uyghur, aiming to equip the network with both the long-range association capability of the Transformer for global image context and the function of CNN for capturing local detailed information. Simultaneously, by combining the strengths of CNN and Transformer encodings, the model’s learning capacity can be enhanced in low-resource settings, bolstering its ability to comprehend images while reducing its reliance on annotated data. On the other hand, we construct two Uyghur scene text datasets, namely U1 and U2. Experimental results demonstrate that the proposed hybrid encoding method achieves outstanding performance in low-resource Uyghur scene text recognition, improving accuracy by 15% compared to baseline methods.

This work was supported by the Joint Funds of the National Natural Science Foundation of China (Grant No. U1603262) and the National Natural Science Foundation of China (Grant No. 62137002), Shenzhen Municipal Science and Technology Innovation Committee Project (Grant No. GJGJZD20210408092806017).

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 9380; Price includes VAT (Japan)

Softcover Book: JPY 11725; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Correlation-guided decoding strategy for low-resource Uyghur scene text recognition

ArticleOpen access27 November 2024

Transfer Learning for Scene Text Recognition in Indian Languages

Scene text recognition: an Indic perspective

Article15 July 2024

References

Mandaviya, K., Chaudhuri, A., Badelia, P.: Optical Character Recognition Systems for Different Languages with Soft Computing. Springer, Cham (2019)
Google Scholar
Atienza, R.: Vision transformer for fast and efficient scene text recognition. In: International Conference on Document Analysis and Recognition, pp. 319–334. Springer (2021)
Google Scholar
Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4715–4723 (2019)
Google Scholar
Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive sequence models. In: European Conference on Computer Vision, pp. 178–196. Springer (2022)
Google Scholar
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79 (2018)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale (2020).arXiv:2010.11929
Du, Y., et al.: Svtr: scene text recognition with a single visual model (2022).arXiv:2205.00159
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7098–7107 (2021)
Google Scholar
Fujitake, M.: Dtrocr: Decoder-only transformer for optical character recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 8025–8035 (2024)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: a reference-free evaluation metric for image captioning (2021).arXiv:2104.08718
Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435 (2022)
Google Scholar
Li, M., et al.: Trocr: transformer-based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13094–13102 (2023)
Google Scholar
Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing508, 293–304 (2022)
Article Google Scholar
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094 (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Sabu, A.M., Das, A.S.: A survey on various optical character recognition techniques. In: 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), pp. 152–155. IEEE (2018)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.39(11), 2298–2304 (2016)
Article Google Scholar
Song, H., Dong, L., Zhang, W.N., Liu, T., Wei, F.: Clip models are few-shot learners: empirical studies on vqa and visual entailment (2022).arXiv:2203.07190
Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: Reclip: a strong zero-shot baseline for referring expression comprehension (2022).arXiv:2204.05991
Wang, P., Da, C., Yao, C.: Multi-granularity prediction for scene text recognition. In: European Conference on Computer Vision, pp. 339–355. Springer (2022)
Google Scholar
Wang, Z., Xie, H., Wang, Y., Xu, J., Zhang, B., Zhang, Y.: Symmetrical linguistic feature distillation with clip for scene text recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 509–518 (2023)
Google Scholar
Xu, M., Zhang, J., Xu, L., Silamu, W., Li, Y.: Collaborative encoding method for scene text recognition in low linguistic resources: the Uyghur language case study. Appl. Sci.14(5), 1707 (2024)
Article Google Scholar
Zhao, S., Wang, X., Zhu, L., Yang, Y.: Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model (2023).arXiv:2305.14014
Zheng, T., Chen, Z., Fang, S., Xie, H., Jiang, Y.G.: Cdistnet: perceiving multi-domain character distance for robust text recognition. Int. J. Comput. Vis. 1–19 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Xinjiang University, Urumqi, Xinjiang, China
Miaomiao Xu, Jiang Zhang, Lianghui Xu, Yanbing Li & Wushour Silamu
Xinjiang Laboratory of Multi-Language Information Technology, Xinjiang University, Urumqi, Xinjiang, China
Miaomiao Xu, Yanbing Li & Wushour Silamu
Xinjiang Multilingual Information Technology Research Center, Xinjiang University, Urumqi, Xinjiang, China
Miaomiao Xu, Yanbing Li & Wushour Silamu

Authors

Miaomiao Xu
View author publications
You can also search for this author inPubMed Google Scholar
Jiang Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Lianghui Xu
View author publications
You can also search for this author inPubMed Google Scholar
Yanbing Li
View author publications
You can also search for this author inPubMed Google Scholar
Wushour Silamu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence toYanbing Li orWushour Silamu.

Editor information

Editors and Affiliations

Peking University, Beijing, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, M., Zhang, J., Xu, L., Li, Y., Silamu, W. (2025). Hybrid Encoding Method for Scene Text Recognition in Low-Resource Uyghur. In: Lin, Z.,et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15037. Springer, Singapore. https://doi.org/10.1007/978-981-97-8511-7_7

Download citation

DOI:https://doi.org/10.1007/978-981-97-8511-7_7
Published:03 November 2024
Publisher Name:Springer, Singapore
Print ISBN:978-981-97-8510-0
Online ISBN:978-981-97-8511-7
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

Hybrid Encoding Method for Scene Text Recognition in Low-Resource Uyghur

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Correlation-guided decoding strategy for low-resource Uyghur scene text recognition

Transfer Learning for Scene Text Recognition in Indian Languages

Scene text recognition: an Indic perspective

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Access this chapter

Subscribe and save

Buy Now