Transformers documentation

TextNet

Transformers

You are viewingmain version, which requiresinstallation from source. If you'd likeregular pip install, checkout the latest stable version (v4.57.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2021-11-03 and added to Hugging Face Transformers on 2025-01-08.

TextNet

Overview

The TextNet model was proposed inFAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation by Zhe Chen, Jiahao Wang, Wenhai Wang, Guo Chen, Enze Xie, Ping Luo, Tong Lu. TextNet is a vision backbone useful for text detection tasks. It is the result of neural architecture search (NAS) on backbones with reward function as text detection task (to provide powerful features for text detection).

TextNet backbone as part of FAST. Taken from theoriginal paper.

This model was contributed byRaghavan,jadechoghari andnielsr.

Usage tips

TextNet is mainly used as a backbone network for the architecture search of text detection. Each stage of the backbone network is comprised of a stride-2 convolution and searchable blocks.Specifically, we present a layer-level candidate set, defined as {conv3×3, conv1×3, conv3×1, identity}. As the 1×3 and 3×1 convolutions have asymmetric kernels and oriented structure priors, they may help to capture the features of extreme aspect-ratio and rotated text lines.

TextNet is the backbone for Fast, but can also be used as an efficient text/image classification, we add aTextNetForImageClassification as is it would allow people to train an image classifier on top of the pre-trained textnet weights

TextNetConfig

classtransformers.TextNetConfig

(stem_kernel_size = 3stem_stride = 2stem_num_channels = 3stem_out_channels = 64stem_act_func = 'relu'image_size = [640, 640]conv_layer_kernel_sizes = Noneconv_layer_strides = Nonehidden_sizes = [64, 64, 128, 256, 512]batch_norm_eps = 1e-05initializer_range = 0.02out_features = Noneout_indices = None**kwargs)

Parameters

stem_kernel_size (int,optional, defaults to 3) —The kernel size for the initial convolution layer.
stem_stride (int,optional, defaults to 2) —The stride for the initial convolution layer.
stem_num_channels (int,optional, defaults to 3) —The num of channels in input for the initial convolution layer.
stem_out_channels (int,optional, defaults to 64) —The num of channels in out for the initial convolution layer.
stem_act_func (str,optional, defaults to"relu") —The activation function for the initial convolution layer.
image_size (tuple[int, int],optional, defaults to[640, 640]) —The size (resolution) of each image.
conv_layer_kernel_sizes (list[list[list[int]]],optional) —A list of stage-wise kernel sizes. IfNone, defaults to:[[[3, 3], [3, 3], [3, 3]], [[3, 3], [1, 3], [3, 3], [3, 1]], [[3, 3], [3, 3], [3, 1], [1, 3]], [[3, 3], [3, 1], [1, 3], [3, 3]]].
conv_layer_strides (list[list[int]],optional) —A list of stage-wise strides. IfNone, defaults to:[[1, 2, 1], [2, 1, 1, 1], [2, 1, 1, 1], [2, 1, 1, 1]].
hidden_sizes (list[int],optional, defaults to[64, 64, 128, 256, 512]) —Dimensionality (hidden size) at each stage.
batch_norm_eps (float,optional, defaults to 1e-05) —The epsilon used by the batch normalization layers.
initializer_range (float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
out_features (list[str],optional) —If used as backbone, list of features to output. Can be any of"stem","stage1","stage2", etc.(depending on how many stages the model has). If unset andout_indices is set, will default to thecorresponding stages. If unset andout_indices is unset, will default to the last stage.
out_indices (list[int],optional) —If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on howmany stages the model has). If unset andout_features is set, will default to the corresponding stages.If unset andout_features is unset, will default to the last stage.

This is the configuration class to store the configuration of aTextNextModel. It is used to instantiate aTextNext model according to the specified arguments, defining the model architecture. Instantiating a configurationwith the defaults will yield a similar configuration to that of theczczup/textnet-base. Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs.Read the documentation fromPreTrainedConfigfor more information.

Examples:

>>>from transformersimport TextNetConfig, TextNetBackbone>>># Initializing a TextNetConfig>>>configuration = TextNetConfig()>>># Initializing a model (with random weights)>>>model = TextNetBackbone(configuration)>>># Accessing the model configuration>>>configuration = model.config

TextNetImageProcessor

classtransformers.TextNetImageProcessor

(do_resize: bool = Truesize: typing.Optional[dict[str, int]] = Nonesize_divisor: int = 32resample: Resampling = <Resampling.BILINEAR: 2>do_center_crop: bool = Falsecrop_size: typing.Optional[dict[str, int]] = Nonedo_rescale: bool = Truerescale_factor: typing.Union[int, float] = 0.00392156862745098do_normalize: bool = Trueimage_mean: typing.Union[float, list[float], NoneType] = [0.485, 0.456, 0.406]image_std: typing.Union[float, list[float], NoneType] = [0.229, 0.224, 0.225]do_convert_rgb: bool = True**kwargs)

Parameters

do_resize (bool,optional, defaults toTrue) —Whether to resize the image’s (height, width) dimensions to the specifiedsize. Can be overridden bydo_resize in thepreprocess method.
size (dict[str, int]optional, defaults to{"shortest_edge" -- 640}):Size of the image after resizing. The shortest edge of the image is resized to size[“shortest_edge”], withthe longest edge resized to keep the input aspect ratio. Can be overridden bysize in thepreprocessmethod.
size_divisor (int,optional, defaults to 32) —Ensures height and width are rounded to a multiple of this value after resizing.
resample (PILImageResampling,optional, defaults toResampling.BILINEAR) —Resampling filter to use if resizing the image. Can be overridden byresample in thepreprocess method.
do_center_crop (bool,optional, defaults toFalse) —Whether to center crop the image to the specifiedcrop_size. Can be overridden bydo_center_crop in thepreprocess method.
crop_size (dict[str, int]optional, defaults to 224) —Size of the output image after applyingcenter_crop. Can be overridden bycrop_size in thepreprocessmethod.
do_rescale (bool,optional, defaults toTrue) —Whether to rescale the image by the specified scalerescale_factor. Can be overridden bydo_rescale inthepreprocess method.
rescale_factor (int orfloat,optional, defaults to1/255) —Scale factor to use if rescaling the image. Can be overridden byrescale_factor in thepreprocessmethod.
do_normalize (bool,optional, defaults toTrue) —Whether to normalize the image. Can be overridden bydo_normalize in thepreprocess method.
image_mean (float orlist[float],optional, defaults to[0.485, 0.456, 0.406]) —Mean to use if normalizing the image. This is a float or list of floats the length of the number ofchannels in the image. Can be overridden by theimage_mean parameter in thepreprocess method.
image_std (float orlist[float],optional, defaults to[0.229, 0.224, 0.225]) —Standard deviation to use if normalizing the image. This is a float or list of floats the length of thenumber of channels in the image. Can be overridden by theimage_std parameter in thepreprocess method.Can be overridden by theimage_std parameter in thepreprocess method.
do_convert_rgb (bool,optional, defaults toTrue) —Whether to convert the image to RGB.

Constructs a TextNet image processor.

preprocess

(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]do_resize: typing.Optional[bool] = Nonesize: typing.Optional[dict[str, int]] = Nonesize_divisor: typing.Optional[int] = Noneresample: typing.Optional[PIL.Image.Resampling] = Nonedo_center_crop: typing.Optional[bool] = Nonecrop_size: typing.Optional[int] = Nonedo_rescale: typing.Optional[bool] = Nonerescale_factor: typing.Optional[float] = Nonedo_normalize: typing.Optional[bool] = Noneimage_mean: typing.Union[float, list[float], NoneType] = Noneimage_std: typing.Union[float, list[float], NoneType] = Nonedo_convert_rgb: typing.Optional[bool] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonedata_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None**kwargs)

Parameters

images (ImageInput) —Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. Ifpassing in images with pixel values between 0 and 1, setdo_rescale=False.
do_resize (bool,optional, defaults toself.do_resize) —Whether to resize the image.
size (dict[str, int],optional, defaults toself.size) —Size of the image after resizing. Shortest edge of the image is resized to size[“shortest_edge”], withthe longest edge resized to keep the input aspect ratio.
size_divisor (int,optional, defaults to32) —Ensures height and width are rounded to a multiple of this value after resizing.
resample (int,optional, defaults toself.resample) —Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Onlyhas an effect ifdo_resize is set toTrue.
do_center_crop (bool,optional, defaults toself.do_center_crop) —Whether to center crop the image.
crop_size (dict[str, int],optional, defaults toself.crop_size) —Size of the center crop. Only has an effect ifdo_center_crop is set toTrue.
do_rescale (bool,optional, defaults toself.do_rescale) —Whether to rescale the image.
rescale_factor (float,optional, defaults toself.rescale_factor) —Rescale factor to rescale the image by ifdo_rescale is set toTrue.
do_normalize (bool,optional, defaults toself.do_normalize) —Whether to normalize the image.
image_mean (float orlist[float],optional, defaults toself.image_mean) —Image mean to use for normalization. Only has an effect ifdo_normalize is set toTrue.
image_std (float orlist[float],optional, defaults toself.image_std) —Image standard deviation to use for normalization. Only has an effect ifdo_normalize is set toTrue.
do_convert_rgb (bool,optional, defaults toself.do_convert_rgb) —Whether to convert the image to RGB.
return_tensors (str orTensorType,optional) —The type of tensors to return. Can be one of:
- Unset: Return a list ofnp.ndarray.
- TensorType.PYTORCH or'pt': Return a batch of typetorch.Tensor.
- TensorType.NUMPY or'np': Return a batch of typenp.ndarray.
data_format (ChannelDimension orstr,optional, defaults toChannelDimension.FIRST) —The channel dimension format for the output image. Can be one of:
- "channels_first" orChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" orChannelDimension.LAST: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (ChannelDimension orstr,optional) —The channel dimension format for the input image. If unset, the channel dimension format is inferredfrom the input image. Can be one of:
- "channels_first" orChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" orChannelDimension.LAST: image in (height, width, num_channels) format.
- "none" orChannelDimension.NONE: image in (height, width) format.

Preprocess an image or batch of images.

TextNetImageProcessorFast

classtransformers.TextNetImageProcessorFast

(**kwargs: typing_extensions.Unpack[transformers.models.textnet.image_processing_textnet.TextNetImageProcessorKwargs])

Constructs a fast Textnet image processor.

preprocess

(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]**kwargs: typing_extensions.Unpack[transformers.models.textnet.image_processing_textnet.TextNetImageProcessorKwargs])→<class 'transformers.image_processing_base.BatchFeature'>

Parameters

images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) —Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. Ifpassing in images with pixel values between 0 and 1, setdo_rescale=False.
do_convert_rgb (bool,optional) —Whether to convert the image to RGB.
do_resize (bool,optional) —Whether to resize the image.
size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —Describes the maximum input dimensions to the model.
crop_size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —Size of the output image after applyingcenter_crop.
resample (Annotated[Union[PILImageResampling, int, NoneType], None]) —Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Onlyhas an effect ifdo_resize is set toTrue.
do_rescale (bool,optional) —Whether to rescale the image.
rescale_factor (float,optional) —Rescale factor to rescale the image by ifdo_rescale is set toTrue.
do_normalize (bool,optional) —Whether to normalize the image.
image_mean (Union[float, list[float], tuple[float, ...], NoneType]) —Image mean to use for normalization. Only has an effect ifdo_normalize is set toTrue.
image_std (Union[float, list[float], tuple[float, ...], NoneType]) —Image standard deviation to use for normalization. Only has an effect ifdo_normalize is set toTrue.
do_pad (bool,optional) —Whether to pad the image. Padding is done either to the largest size in the batchor to a fixed square size per image. The exact padding strategy depends on the model.
pad_size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —The size in{"height": int, "width" int} to pad the images to. Must be larger than any image sizeprovided for preprocessing. Ifpad_size is not provided, images will be padded to the largestheight and width in the batch. Applied only whendo_pad=True.
do_center_crop (bool,optional) —Whether to center crop the image.
data_format (Union[str, ~image_utils.ChannelDimension, NoneType]) —OnlyChannelDimension.FIRST is supported. Added for compatibility with slow processors.
input_data_format (Union[str, ~image_utils.ChannelDimension, NoneType]) —The channel dimension format for the input image. If unset, the channel dimension format is inferredfrom the input image. Can be one of:
- "channels_first" orChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" orChannelDimension.LAST: image in (height, width, num_channels) format.
- "none" orChannelDimension.NONE: image in (height, width) format.
device (Annotated[str, None],optional) —The device to process the images on. If unset, the device is inferred from the input images.
return_tensors (Annotated[Union[str, ~utils.generic.TensorType, NoneType], None]) —Returns stacked tensors if set to `pt, otherwise returns a list of tensors.
disable_grouping (bool,optional) —Whether to disable grouping of images by size to process them individually and not in batches.If None, will be set to True if the images are on CPU, and False otherwise. This choice is based onempirical observations, as detailed here:https://github.com/huggingface/transformers/pull/38157
size_divisor (<class 'int'>.size_divisor) —The size by which to make sure both the height and width can be divided.

Returns

<class 'transformers.image_processing_base.BatchFeature'>

data (dict) — Dictionary of lists/arrays/tensors returned by thecall method (‘pixel_values’, etc.).
tensor_type (Union[None, str, TensorType],optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors atinitialization.

TextNetModel

classtransformers.TextNetModel

(config)

Parameters

config (TextNetModel) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

The bare Textnet Model outputting raw hidden-states without any specific head on top.

This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(pixel_values: Tensoroutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)→transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention ortuple(torch.FloatTensor)

Parameters

pixel_values (torch.Tensor of shape(batch_size, num_channels, image_size, image_size)) —The tensors corresponding to the input images. Pixel values can be obtained usingTextNetImageProcessor. SeeTextNetImageProcessor.call() for details (processor_class usesTextNetImageProcessor for processing images).
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Returns

transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (TextNetConfig) and inputs.

last_hidden_state (torch.FloatTensor of shape(batch_size, num_channels, height, width)) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor of shape(batch_size, hidden_size)) — Last layer hidden-state after a pooling operation on the spatial dimensions.
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, num_channels, height, width).
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

TheTextNetModel forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

TextNetForImageClassification

classtransformers.TextNetForImageClassification

(config)

Parameters

config (TextNetForImageClassification) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.

TextNet Model with an image classification head on top (a linear layer on top of the pooled features), e.g. forImageNet.

This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.

forward

(pixel_values: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)→transformers.modeling_outputs.ImageClassifierOutputWithNoAttention ortuple(torch.FloatTensor)

Parameters

pixel_values (torch.FloatTensor of shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingTextNetImageProcessor. SeeTextNetImageProcessor.call() for details (processor_class usesTextNetImageProcessor for processing images).
labels (torch.LongTensor of shape(batch_size,),optional) —Labels for computing the image classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1 a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1 a classification loss is computed (Cross-Entropy).
output_hidden_states (bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_states under returned tensors formore detail.
return_dict (bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.

Returns

transformers.modeling_outputs.ImageClassifierOutputWithNoAttention ortuple(torch.FloatTensor)

Atransformers.modeling_outputs.ImageClassifierOutputWithNoAttention or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (TextNetConfig) and inputs.

loss (torch.FloatTensor of shape(1,),optional, returned whenlabels is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor of shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=True is passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each stage) of shape(batch_size, num_channels, height, width). Hidden-states (alsocalled feature maps) of the model at the output of each stage.

TheTextNetForImageClassification forward method, overrides the__call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call theModuleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.

Examples:

>>>import torch>>>import requests>>>from transformersimport TextNetForImageClassification, TextNetImageProcessor>>>from PILimport Image>>>url ="http://images.cocodataset.org/val2017/000000039769.jpg">>>image = Image.open(requests.get(url, stream=True).raw)>>>processor = TextNetImageProcessor.from_pretrained("czczup/textnet-base")>>>model = TextNetForImageClassification.from_pretrained("czczup/textnet-base")>>>inputs = processor(images=image, return_tensors="pt")>>>with torch.no_grad():...    outputs = model(**inputs)>>>outputs.logits.shapetorch.Size([1,2])

Update on GitHub

←Table Transformer Timm Wrapper→

Movatterモバイル変換

Transformers

TextNet

Overview

Usage tips

TextNetConfig

classtransformers.TextNetConfig

TextNetImageProcessor

classtransformers.TextNetImageProcessor

preprocess

TextNetImageProcessorFast

classtransformers.TextNetImageProcessorFast

preprocess

TextNetModel

classtransformers.TextNetModel

forward

TextNetForImageClassification

classtransformers.TextNetForImageClassification

forward