Transformers documentation

TrOCR

Transformers

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2021-09-21 and added to Hugging Face Transformers on 2021-10-13.

TrOCR

TrOCR is a text recognition model for both image understanding and text generation. It doesn’t require separate models for image processing or character generation. TrOCR is a simple single end-to-end system that uses a transformer to handle visual understanding and text generation.

You can find all the original TrOCR checkpoints under theMicrosoft organization.

TrOCR architecture. Taken from theoriginal paper.

This model was contributed bynielsr.
Click on the TrOCR models in the right sidebar for more examples of how to apply TrOCR to different image and text tasks.

The example below demonstrates how to perform optical character recognition (OCR) with theAutoModel class.

AutoModel

from transformersimport TrOCRProcessor, VisionEncoderDecoderModelimport requestsfrom PILimport Imageprocessor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")# load image from the IAM dataseturl ="https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"image = Image.open(requests.get(url, stream=True).raw).convert("RGB")pixel_values = processor(image, return_tensors="pt").pixel_valuesgenerated_ids = model.generate(pixel_values)generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]print(generated_text)

Quantization

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to theQuantization overview for more available quantization backends.

The example below usesbitsandbytes to quantize the weights to 8-bits.

# pip install bitsandbytes acceleratefrom transformersimport TrOCRProcessor, VisionEncoderDecoderModel, BitsandBytesConfigimport requestsfrom PILimport Image# Set up the quantization configurationquantization_config = BitsandBytesConfig(load_in_8bit=True)# Use a large checkpoint for a more noticeable impactprocessor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten",    quantization_config=quantization_config)# load image from the IAM dataseturl ="[https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg](https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg)"image = Image.open(requests.get(url, stream=True).raw).convert("RGB")pixel_values = processor(image, return_tensors="pt").pixel_valuesgenerated_ids = model.generate(pixel_values)generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]print(generated_text)

Notes

TrOCR wrapsViTImageProcessor/DeiTImageProcessor andRobertaTokenizer/XLMRobertaTokenizer into a single instance ofTrOCRProcessor to handle images and text.
TrOCR is always used within theVisionEncoderDecoder framework.

Resources

A blog post onAccelerating Document AI with TrOCR.
A blog post on how toDocument AI with TrOCR.
A notebook on how tofinetune TrOCR on IAM Handwriting Database using Seq2SeqTrainer.
An interactive-demo onTrOCR handwritten character recognition.
A notebook oninference with TrOCR and Gradio demo.
A notebook onevaluating TrOCR on the IAM test set.

TrOCRConfig

classtransformers.TrOCRConfig

(vocab_size = 50265d_model = 1024decoder_layers = 12decoder_attention_heads = 16decoder_ffn_dim = 4096activation_function = 'gelu'max_position_embeddings = 512dropout = 0.1attention_dropout = 0.0activation_dropout = 0.0decoder_start_token_id = 2init_std = 0.02decoder_layerdrop = 0.0use_cache = Truescale_embedding = Falseuse_learned_position_embeddings = Truelayernorm_embedding = Truepad_token_id = 1bos_token_id = 0eos_token_id = 2**kwargs)

Parameters

vocab_size (int,optional, defaults to 50265) —Vocabulary size of the TrOCR model. Defines the number of different tokens that can be represented by theinputs_ids passed when callingTrOCRForCausalLM.
d_model (int,optional, defaults to 1024) —Dimensionality of the layers and the pooler layer.
decoder_layers (int,optional, defaults to 12) —Number of decoder layers.
decoder_attention_heads (int,optional, defaults to 16) —Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (int,optional, defaults to 4096) —Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
activation_function (str orfunction,optional, defaults to"gelu") —The non-linear activation function (function or string) in the pooler. If string,"gelu","relu","silu" and"gelu_new" are supported.
max_position_embeddings (int,optional, defaults to 512) —The maximum sequence length that this model might ever be used with. Typically set this to something largejust in case (e.g., 512 or 1024 or 2048).
dropout (float,optional, defaults to 0.1) —The dropout probability for all fully connected layers in the embeddings, and pooler.
attention_dropout (float,optional, defaults to 0.0) —The dropout ratio for the attention probabilities.
activation_dropout (float,optional, defaults to 0.0) —The dropout ratio for activations inside the fully connected layer.
init_std (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
decoder_layerdrop (float,optional, defaults to 0.0) —The LayerDrop probability for the decoder. See the [LayerDrop paper](seehttps://huggingface.co/papers/1909.11556)for more details.
use_cache (bool,optional, defaults toTrue) —Whether or not the model should return the last key/values attentions (not used by all models).
scale_embedding (bool,optional, defaults toFalse) —Whether or not to scale the word embeddings by sqrt(d_model).
use_learned_position_embeddings (bool,optional, defaults toTrue) —Whether or not to use learned position embeddings. If not, sinusoidal position embeddings will be used.
layernorm_embedding (bool,optional, defaults toTrue) —Whether or not to use a layernorm after the word + position embeddings.

This is the configuration class to store the configuration of aTrOCRForCausalLM. It is used to instantiate anTrOCR model according to the specified arguments, defining the model architecture. Instantiating a configurationwith the defaults will yield a similar configuration to that of the TrOCRmicrosoft/trocr-base-handwritten architecture.

Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.

Example:

>>>from transformersimport TrOCRConfig, TrOCRForCausalLM>>># Initializing a TrOCR-base style configuration>>>configuration = TrOCRConfig()>>># Initializing a model (with random weights) from the TrOCR-base style configuration>>>model = TrOCRForCausalLM(configuration)>>># Accessing the model configuration>>>configuration = model.config

TrOCRProcessor

classtransformers.TrOCRProcessor

(image_processor = Nonetokenizer = None**kwargs)

Parameters

image_processor ([ViTImageProcessor/DeiTImageProcessor],optional) —An instance of [ViTImageProcessor/DeiTImageProcessor]. The image processor is a required input.
tokenizer ([RobertaTokenizer/XLMRobertaTokenizer],optional) —An instance of [RobertaTokenizer/XLMRobertaTokenizer]. The tokenizer is a required input.

Constructs a TrOCR processor which wraps a vision image processor and a TrOCR tokenizer into a single processor.

TrOCRProcessor offers all the functionalities of [ViTImageProcessor/DeiTImageProcessor] and[RobertaTokenizer/XLMRobertaTokenizer]. See thecall() anddecode() formore information.

call

(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = Nonetext: typing.Union[str, list[str], list[list[str]], NoneType] = None**kwargs: typing_extensions.Unpack[transformers.models.trocr.processing_trocr.TrOCRProcessorKwargs])

When used in normal mode, this method forwards all its arguments to AutoImageProcessor’s__call__() and returns its output. If used in the context~TrOCRProcessor.as_target_processor this method forwards all its arguments to TrOCRTokenizer’s~TrOCRTokenizer.__call__. Please refer to the docstring of the above two methods for more information.

from_pretrained

(pretrained_model_name_or_path: typing.Union[str, os.PathLike]cache_dir: typing.Union[str, os.PathLike, NoneType] = Noneforce_download: bool = Falselocal_files_only: bool = Falsetoken: typing.Union[str, bool, NoneType] = Nonerevision: str = 'main'**kwargs)

Parameters

pretrained_model_name_or_path (str oros.PathLike) —This can be either:
- a string, themodel id of a pretrained feature_extractor hosted inside a model repo onhuggingface.co.
- a path to adirectory containing a feature extractor file saved using thesave_pretrained() method, e.g.,./my_model_directory/.
- a path or url to a saved feature extractor JSONfile, e.g.,./my_model_directory/preprocessor_config.json.
**kwargs —Additional keyword arguments passed along to bothfrom_pretrained() and~tokenization_utils_base.PreTrainedTokenizer.from_pretrained.

Instantiate a processor associated with a pretrained model.

This class method is simply calling the feature extractorfrom_pretrained(), image processorImageProcessingMixin and the tokenizer~tokenization_utils_base.PreTrainedTokenizer.from_pretrained methods. Please refer to the docstrings of themethods above for more information.

save_pretrained

(save_directorypush_to_hub: bool = False**kwargs)

Parameters

save_directory (str oros.PathLike) —Directory where the feature extractor JSON file and the tokenizer files will be saved (directory willbe created if it does not exist).
push_to_hub (bool,optional, defaults toFalse) —Whether or not to push your model to the Hugging Face model hub after saving it. You can specify therepository you want to push to withrepo_id (will default to the name ofsave_directory in yournamespace).
kwargs (dict[str, Any],optional) —Additional key word arguments passed along to thepush_to_hub() method.

Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that itcan be reloaded using thefrom_pretrained() method.

This class method is simply callingsave_pretrained() andsave_pretrained(). Please refer to the docstrings of themethods above for more information.

batch_decode

(*args**kwargs)

This method forwards all its arguments to PreTrainedTokenizer’sbatch_decode(). Pleaserefer to the docstring of this method for more information.

decode

(*args**kwargs)

This method forwards all its arguments to PreTrainedTokenizer’sdecode(). Please refer tothe docstring of this method for more information.

TrOCRForCausalLM

classtransformers.TrOCRForCausalLM

(config)

Parameters

config (TrOCRForCausalLM) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The TrOCR Decoder with a language modeling head. Can be used as the decoder part ofEncoderDecoderModel and

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneencoder_hidden_states: typing.Optional[torch.FloatTensor] = Noneencoder_attention_mask: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.Tensor] = None)→transformers.modeling_outputs.CausalLMOutputWithCrossAttentions ortuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
What are input IDs?
attention_mask (torch.Tensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
What are attention masks?
encoder_hidden_states (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attentionif the model is configured as a decoder.
encoder_attention_mask (torch.LongTensor of shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used inthe cross-attention if the model is configured as a decoder. Mask values selected in[0, 1]:
- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
past_key_values (~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=True orconfig.use_cache=True.
OnlyCache instance is allowed as input, see ourkv cache guide.If nopast_key_values are passed,DynamicCache will be initialized by default.
The model will output the same cache format that is fed as input.
Ifpast_key_values are used, the user is expected to input only unprocessedinput_ids (those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length) instead of allinput_idsof shape(batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_ids you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_ids indices into associated vectors than themodel’s internal embedding lookup matrix.
labels (torch.LongTensor of shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size] or -100 (seeinput_ids docstring). Tokens with indices set to-100 are ignored(masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
use_cache (bool,optional) —If set toTrue,past_key_values key value states are returned and can be used to speed up decoding (seepast_key_values).
output_attentions (bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentions under returnedtensors for more detail.
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
cache_position (torch.Tensor of shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.

Returns

transformers.modeling_outputs.CausalLMOutputWithCrossAttentions ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.CausalLMOutputWithCrossAttentions or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (TrOCRConfig) and inputs.

loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor of shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
cross_attentions (tuple(torch.FloatTensor),optional, returned whenoutput_attentions=True is passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor (one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).
Cross attentions weights after the attention softmax, used to compute the weighted average in thecross-attention heads.
past_key_values (Cache,optional, returned whenuse_cache=True is passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.
Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (seepast_key_values input) to speed up sequential decoding.

TheTrOCRForCausalLM forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Example:

>>>from transformersimport (...    TrOCRConfig,...    TrOCRProcessor,...    TrOCRForCausalLM,...    ViTConfig,...    ViTModel,...    VisionEncoderDecoderModel,...)>>>import requests>>>from PILimport Image>>># TrOCR is a decoder model and should be used within a VisionEncoderDecoderModel>>># init vision2text model with random weights>>>encoder = ViTModel(ViTConfig())>>>decoder = TrOCRForCausalLM(TrOCRConfig())>>>model = VisionEncoderDecoderModel(encoder=encoder, decoder=decoder)>>># If you want to start from the pretrained model, load the checkpoint with `VisionEncoderDecoderModel`>>>processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")>>>model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")>>># load image from the IAM dataset>>>url ="https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg">>>image = Image.open(requests.get(url, stream=True).raw).convert("RGB")>>>pixel_values = processor(image, return_tensors="pt").pixel_values>>>text ="industry, ' Mr. Brown commented icily. ' Let us have a">>># training>>>model.config.decoder_start_token_id = processor.tokenizer.eos_token_id>>>model.config.pad_token_id = processor.tokenizer.pad_token_id>>>model.config.vocab_size = model.config.decoder.vocab_size>>>labels = processor.tokenizer(text, return_tensors="pt").input_ids>>>outputs = model(pixel_values, labels=labels)>>>loss = outputs.loss>>>round(loss.item(),2)5.30>>># inference>>>generated_ids = model.generate(pixel_values)>>>generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]>>>generated_text'industry, " Mr. Brown commented icily. " Let us have a'

Update on GitHub

←TAPAS TVLT→

Movatterモバイル変換

Transformers

TrOCR

Quantization

Notes

Resources

TrOCRConfig

classtransformers.TrOCRConfig

TrOCRProcessor

classtransformers.TrOCRProcessor

__call__

from_pretrained

save_pretrained

batch_decode

decode

TrOCRForCausalLM

classtransformers.TrOCRForCausalLM

forward

call