1Shanghai Univ. of Electric Power (China)
2Univ. of Chinese Academy of Sciences (China)
3Univ. of Moratuwa (Sri Lanka)
*Address all correspondence to Minglei Tong, tongminglei@shiep.edu.cn
ARTICLE - 1 Introduction
- 2 Related Work
- 2.1 Transformer-Free Methods
- 2.2 Hybrid Methods
- 2.3 Purely Transformer Methods
- 3 Methodology
- 3.1 Tokenization
- 3.2 Bidirectional Semantic Fusion Framework
- 3.3 Prediction and Selection Strategy
- 4 Experiment
- 4.1 Datasets and Implementation Details
- 4.2 Ablation Studies
- 4.2.1 Subword-based tokenization
- 4.2.2 Bidirectional semantic fusion framework
- 4.2.3 Selection strategy
- 4.3 Comparison with the State-of-the-Art
- 4.4 Further Analysis and Insights
- 4.5 Limitations
- 5 Conclusion
FIGURES & TABLES REFERENCES CITED BY
Most current approaches in the literature of scene text recognition train the language model via a text dataset far sparser than in natural language processing, resulting in inadequate training. Therefore, we propose a simple transformer encoder–decoder model called the multilingual semantic fusion network (MSFN) that can leverage prior linguistic knowledge to learn robust language features. First, we label the text dataset with forward, backward sequences, and subwords, which are extracted by tokenization with linguistic information. Then we introduce a multilingual model to the decoder corresponding to three different channels of the labeled dataset. The final output is fused by different channels to get more accurate results. In experiments, MSFN achieves cutting-edge performance across six benchmark datasets, and extensive ablative studies have proven the effectiveness of the proposed method. Code is available athttps://github.com/lclee0577/MLViT. |
Proceedings of SPIE (October 20 2022)
Proceedings of SPIE (April 15 2025)
Proceedings of SPIE (October 12 2020)