This model was released on 2021-09-21 and added to Hugging Face Transformers on 2021-10-13.
TrOCR
TrOCR is a text recognition model for both image understanding and text generation. It doesn’t require separate models for image processing or character generation. TrOCR is a simple single end-to-end system that uses a transformer to handle visual understanding and text generation.
You can find all the original TrOCR checkpoints under theMicrosoft organization.
TrOCR architecture. Taken from theoriginal paper.This model was contributed bynielsr.
Click on the TrOCR models in the right sidebar for more examples of how to apply TrOCR to different image and text tasks.
The example below demonstrates how to perform optical character recognition (OCR) with theAutoModel class.
from transformersimport TrOCRProcessor, VisionEncoderDecoderModelimport requestsfrom PILimport Imageprocessor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")# load image from the IAM dataseturl ="https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"image = Image.open(requests.get(url, stream=True).raw).convert("RGB")pixel_values = processor(image, return_tensors="pt").pixel_valuesgenerated_ids = model.generate(pixel_values)generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]print(generated_text)
Quantization
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to theQuantization overview for more available quantization backends.
The example below usesbitsandbytes to quantize the weights to 8-bits.
# pip install bitsandbytes acceleratefrom transformersimport TrOCRProcessor, VisionEncoderDecoderModel, BitsandBytesConfigimport requestsfrom PILimport Image# Set up the quantization configurationquantization_config = BitsandBytesConfig(load_in_8bit=True)# Use a large checkpoint for a more noticeable impactprocessor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten", quantization_config=quantization_config)# load image from the IAM dataseturl ="[https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg](https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg)"image = Image.open(requests.get(url, stream=True).raw).convert("RGB")pixel_values = processor(image, return_tensors="pt").pixel_valuesgenerated_ids = model.generate(pixel_values)generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]print(generated_text)
Notes
- TrOCR wrapsViTImageProcessor/DeiTImageProcessor andRobertaTokenizer/XLMRobertaTokenizer into a single instance ofTrOCRProcessor to handle images and text.
- TrOCR is always used within theVisionEncoderDecoder framework.
Resources
- A blog post onAccelerating Document AI with TrOCR.
- A blog post on how toDocument AI with TrOCR.
- A notebook on how tofinetune TrOCR on IAM Handwriting Database using Seq2SeqTrainer.
- An interactive-demo onTrOCR handwritten character recognition.
- A notebook oninference with TrOCR and Gradio demo.
- A notebook onevaluating TrOCR on the IAM test set.
TrOCRConfig
classtransformers.TrOCRConfig
<source>(vocab_size = 50265d_model = 1024decoder_layers = 12decoder_attention_heads = 16decoder_ffn_dim = 4096activation_function = 'gelu'max_position_embeddings = 512dropout = 0.1attention_dropout = 0.0activation_dropout = 0.0decoder_start_token_id = 2init_std = 0.02decoder_layerdrop = 0.0use_cache = Truescale_embedding = Falseuse_learned_position_embeddings = Truelayernorm_embedding = Truepad_token_id = 1bos_token_id = 0eos_token_id = 2**kwargs)
Parameters
- vocab_size (
int,optional, defaults to 50265) —Vocabulary size of the TrOCR model. Defines the number of different tokens that can be represented by theinputs_idspassed when callingTrOCRForCausalLM. - d_model (
int,optional, defaults to 1024) —Dimensionality of the layers and the pooler layer. - decoder_layers (
int,optional, defaults to 12) —Number of decoder layers. - decoder_attention_heads (
int,optional, defaults to 16) —Number of attention heads for each attention layer in the Transformer decoder. - decoder_ffn_dim (
int,optional, defaults to 4096) —Dimensionality of the “intermediate” (often named feed-forward) layer in decoder. - activation_function (
strorfunction,optional, defaults to"gelu") —The non-linear activation function (function or string) in the pooler. If string,"gelu","relu","silu"and"gelu_new"are supported. - max_position_embeddings (
int,optional, defaults to 512) —The maximum sequence length that this model might ever be used with. Typically set this to something largejust in case (e.g., 512 or 1024 or 2048). - dropout (
float,optional, defaults to 0.1) —The dropout probability for all fully connected layers in the embeddings, and pooler. - attention_dropout (
float,optional, defaults to 0.0) —The dropout ratio for the attention probabilities. - activation_dropout (
float,optional, defaults to 0.0) —The dropout ratio for activations inside the fully connected layer. - init_std (
float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - decoder_layerdrop (
float,optional, defaults to 0.0) —The LayerDrop probability for the decoder. See the [LayerDrop paper](seehttps://huggingface.co/papers/1909.11556)for more details. - use_cache (
bool,optional, defaults toTrue) —Whether or not the model should return the last key/values attentions (not used by all models). - scale_embedding (
bool,optional, defaults toFalse) —Whether or not to scale the word embeddings by sqrt(d_model). - use_learned_position_embeddings (
bool,optional, defaults toTrue) —Whether or not to use learned position embeddings. If not, sinusoidal position embeddings will be used. - layernorm_embedding (
bool,optional, defaults toTrue) —Whether or not to use a layernorm after the word + position embeddings.
This is the configuration class to store the configuration of aTrOCRForCausalLM. It is used to instantiate anTrOCR model according to the specified arguments, defining the model architecture. Instantiating a configurationwith the defaults will yield a similar configuration to that of the TrOCRmicrosoft/trocr-base-handwritten architecture.
Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs. Read thedocumentation fromPreTrainedConfig for more information.
Example:
>>>from transformersimport TrOCRConfig, TrOCRForCausalLM>>># Initializing a TrOCR-base style configuration>>>configuration = TrOCRConfig()>>># Initializing a model (with random weights) from the TrOCR-base style configuration>>>model = TrOCRForCausalLM(configuration)>>># Accessing the model configuration>>>configuration = model.config
TrOCRProcessor
classtransformers.TrOCRProcessor
<source>(image_processor = Nonetokenizer = None**kwargs)
Parameters
- image_processor ([
ViTImageProcessor/DeiTImageProcessor],optional) —An instance of [ViTImageProcessor/DeiTImageProcessor]. The image processor is a required input. - tokenizer ([
RobertaTokenizer/XLMRobertaTokenizer],optional) —An instance of [RobertaTokenizer/XLMRobertaTokenizer]. The tokenizer is a required input.
Constructs a TrOCR processor which wraps a vision image processor and a TrOCR tokenizer into a single processor.
TrOCRProcessor offers all the functionalities of [ViTImageProcessor/DeiTImageProcessor] and[RobertaTokenizer/XLMRobertaTokenizer]. See thecall() anddecode() formore information.
__call__
<source>(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = Nonetext: typing.Union[str, list[str], list[list[str]], NoneType] = None**kwargs: typing_extensions.Unpack[transformers.models.trocr.processing_trocr.TrOCRProcessorKwargs])
When used in normal mode, this method forwards all its arguments to AutoImageProcessor’s__call__() and returns its output. If used in the context~TrOCRProcessor.as_target_processor this method forwards all its arguments to TrOCRTokenizer’s~TrOCRTokenizer.__call__. Please refer to the docstring of the above two methods for more information.
from_pretrained
<source>(pretrained_model_name_or_path: typing.Union[str, os.PathLike]cache_dir: typing.Union[str, os.PathLike, NoneType] = Noneforce_download: bool = Falselocal_files_only: bool = Falsetoken: typing.Union[str, bool, NoneType] = Nonerevision: str = 'main'**kwargs)
Parameters
- pretrained_model_name_or_path (
stroros.PathLike) —This can be either:- a string, themodel id of a pretrained feature_extractor hosted inside a model repo onhuggingface.co.
- a path to adirectory containing a feature extractor file saved using thesave_pretrained() method, e.g.,
./my_model_directory/. - a path or url to a saved feature extractor JSONfile, e.g.,
./my_model_directory/preprocessor_config.json.
- **kwargs —Additional keyword arguments passed along to bothfrom_pretrained() and
~tokenization_utils_base.PreTrainedTokenizer.from_pretrained.
Instantiate a processor associated with a pretrained model.
This class method is simply calling the feature extractorfrom_pretrained(), image processorImageProcessingMixin and the tokenizer
~tokenization_utils_base.PreTrainedTokenizer.from_pretrainedmethods. Please refer to the docstrings of themethods above for more information.
save_pretrained
<source>(save_directorypush_to_hub: bool = False**kwargs)
Parameters
- save_directory (
stroros.PathLike) —Directory where the feature extractor JSON file and the tokenizer files will be saved (directory willbe created if it does not exist). - push_to_hub (
bool,optional, defaults toFalse) —Whether or not to push your model to the Hugging Face model hub after saving it. You can specify therepository you want to push to withrepo_id(will default to the name ofsave_directoryin yournamespace). - kwargs (
dict[str, Any],optional) —Additional key word arguments passed along to thepush_to_hub() method.
Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that itcan be reloaded using thefrom_pretrained() method.
This class method is simply callingsave_pretrained() andsave_pretrained(). Please refer to the docstrings of themethods above for more information.
batch_decode
<source>(*args**kwargs)
This method forwards all its arguments to PreTrainedTokenizer’sbatch_decode(). Pleaserefer to the docstring of this method for more information.
TrOCRForCausalLM
classtransformers.TrOCRForCausalLM
<source>(config)
Parameters
- config (TrOCRForCausalLM) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
The TrOCR Decoder with a language modeling head. Can be used as the decoder part ofEncoderDecoderModel and
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneencoder_hidden_states: typing.Optional[torch.FloatTensor] = Noneencoder_attention_mask: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[transformers.cache_utils.Cache] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonecache_position: typing.Optional[torch.Tensor] = None)→transformers.modeling_outputs.CausalLMOutputWithCrossAttentions ortuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained usingAutoTokenizer. SeePreTrainedTokenizer.encode() andPreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- encoder_hidden_states (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attentionif the model is configured as a decoder. - encoder_attention_mask (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used inthe cross-attention if the model is configured as a decoder. Mask values selected in[0, 1]:- 1 for tokens that arenot masked,
- 0 for tokens that aremasked.
- past_key_values (
~cache_utils.Cache,optional) —Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attentionblocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.OnlyCache instance is allowed as input, see ourkv cache guide.If no
past_key_valuesare passed,DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’thave their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size),optional) —Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convertinput_idsindices into associated vectors than themodel’s internal embedding lookup matrix. - labels (
torch.LongTensorof shape(batch_size, sequence_length),optional) —Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored(masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - use_cache (
bool,optional) —If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - output_attentions (
bool,optional) —Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returnedtensors for more detail. - output_hidden_states (
bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors formore detail. - return_dict (
bool,optional) —Whether or not to return aModelOutput instead of a plain tuple. - cache_position (
torch.Tensorof shape(sequence_length),optional) —Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids,this tensor is not affected by padding. It is used to update the cache in the correct position and to inferthe complete sequence length.
Returns
transformers.modeling_outputs.CausalLMOutputWithCrossAttentions ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.CausalLMOutputWithCrossAttentions or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (TrOCRConfig) and inputs.
loss (
torch.FloatTensorof shape(1,),optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor),optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads.
cross_attentions (
tuple(torch.FloatTensor),optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Cross attentions weights after the attention softmax, used to compute the weighted average in thecross-attention heads.
past_key_values (
Cache,optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is aCache instance. For more details, see ourkv cache guide.Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.
TheTrOCRForCausalLM forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
Example:
>>>from transformersimport (... TrOCRConfig,... TrOCRProcessor,... TrOCRForCausalLM,... ViTConfig,... ViTModel,... VisionEncoderDecoderModel,...)>>>import requests>>>from PILimport Image>>># TrOCR is a decoder model and should be used within a VisionEncoderDecoderModel>>># init vision2text model with random weights>>>encoder = ViTModel(ViTConfig())>>>decoder = TrOCRForCausalLM(TrOCRConfig())>>>model = VisionEncoderDecoderModel(encoder=encoder, decoder=decoder)>>># If you want to start from the pretrained model, load the checkpoint with `VisionEncoderDecoderModel`>>>processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")>>>model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")>>># load image from the IAM dataset>>>url ="https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg">>>image = Image.open(requests.get(url, stream=True).raw).convert("RGB")>>>pixel_values = processor(image, return_tensors="pt").pixel_values>>>text ="industry, ' Mr. Brown commented icily. ' Let us have a">>># training>>>model.config.decoder_start_token_id = processor.tokenizer.eos_token_id>>>model.config.pad_token_id = processor.tokenizer.pad_token_id>>>model.config.vocab_size = model.config.decoder.vocab_size>>>labels = processor.tokenizer(text, return_tensors="pt").input_ids>>>outputs = model(pixel_values, labels=labels)>>>loss = outputs.loss>>>round(loss.item(),2)5.30>>># inference>>>generated_ids = model.generate(pixel_values)>>>generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]>>>generated_text'industry, " Mr. Brown commented icily. " Let us have a'