Bases:Dataset
General OCR dataset class.
OCR dataset supports two types of image to text dataset. One is for tokenizer-based models in which the labels aretokens and the other is char-level models in which the labels are separated by character and the converted to ids.This behavior is specified by thetext_split_type in config which can be eithertokenize orchar_split.
Bases:DatasetConfig
Configuration class for OCR datasets.
path (str) – Path to the dataset.
text_split_type (TextSplitType) – Type of text splitting (CHAR_SPLIT or TOKENIZE).
id2label (Dict[int,str]) – Mapping of label IDs to characters.
text_column (str) – Column name for text in the dataset.
images_paths_column (str) – Column name for image paths in the dataset.
max_length (int) – Maximum length of text.
invalid_characters (list) – List of invalid characters.
reverse_digits (bool) – Whether to reverse the digits in text.
OCRDataset
OCRDatasetConfig
OCRDatasetConfig.id2label
OCRDatasetConfig.images_paths_column
OCRDatasetConfig.invalid_characters
OCRDatasetConfig.max_length
OCRDatasetConfig.name
OCRDatasetConfig.path
OCRDatasetConfig.reverse_digits
OCRDatasetConfig.reverse_text
OCRDatasetConfig.task
OCRDatasetConfig.text_column
OCRDatasetConfig.text_split_type
TextSplitType