This model was released on 2021-11-03 and added to Hugging Face Transformers on 2025-01-08.
TextNet
Overview
The TextNet model was proposed inFAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation by Zhe Chen, Jiahao Wang, Wenhai Wang, Guo Chen, Enze Xie, Ping Luo, Tong Lu. TextNet is a vision backbone useful for text detection tasks. It is the result of neural architecture search (NAS) on backbones with reward function as text detection task (to provide powerful features for text detection).
TextNet backbone as part of FAST. Taken from theoriginal paper.This model was contributed byRaghavan,jadechoghari andnielsr.
Usage tips
TextNet is mainly used as a backbone network for the architecture search of text detection. Each stage of the backbone network is comprised of a stride-2 convolution and searchable blocks.Specifically, we present a layer-level candidate set, defined as {conv3×3, conv1×3, conv3×1, identity}. As the 1×3 and 3×1 convolutions have asymmetric kernels and oriented structure priors, they may help to capture the features of extreme aspect-ratio and rotated text lines.
TextNet is the backbone for Fast, but can also be used as an efficient text/image classification, we add aTextNetForImageClassification as is it would allow people to train an image classifier on top of the pre-trained textnet weights
TextNetConfig
classtransformers.TextNetConfig
<source>(stem_kernel_size = 3stem_stride = 2stem_num_channels = 3stem_out_channels = 64stem_act_func = 'relu'image_size = [640, 640]conv_layer_kernel_sizes = Noneconv_layer_strides = Nonehidden_sizes = [64, 64, 128, 256, 512]batch_norm_eps = 1e-05initializer_range = 0.02out_features = Noneout_indices = None**kwargs)
Parameters
- stem_kernel_size (
int,optional, defaults to 3) —The kernel size for the initial convolution layer. - stem_stride (
int,optional, defaults to 2) —The stride for the initial convolution layer. - stem_num_channels (
int,optional, defaults to 3) —The num of channels in input for the initial convolution layer. - stem_out_channels (
int,optional, defaults to 64) —The num of channels in out for the initial convolution layer. - stem_act_func (
str,optional, defaults to"relu") —The activation function for the initial convolution layer. - image_size (
tuple[int, int],optional, defaults to[640, 640]) —The size (resolution) of each image. - conv_layer_kernel_sizes (
list[list[list[int]]],optional) —A list of stage-wise kernel sizes. IfNone, defaults to:[[[3, 3], [3, 3], [3, 3]], [[3, 3], [1, 3], [3, 3], [3, 1]], [[3, 3], [3, 3], [3, 1], [1, 3]], [[3, 3], [3, 1], [1, 3], [3, 3]]]. - conv_layer_strides (
list[list[int]],optional) —A list of stage-wise strides. IfNone, defaults to:[[1, 2, 1], [2, 1, 1, 1], [2, 1, 1, 1], [2, 1, 1, 1]]. - hidden_sizes (
list[int],optional, defaults to[64, 64, 128, 256, 512]) —Dimensionality (hidden size) at each stage. - batch_norm_eps (
float,optional, defaults to 1e-05) —The epsilon used by the batch normalization layers. - initializer_range (
float,optional, defaults to 0.02) —The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - out_features (
list[str],optional) —If used as backbone, list of features to output. Can be any of"stem","stage1","stage2", etc.(depending on how many stages the model has). If unset andout_indicesis set, will default to thecorresponding stages. If unset andout_indicesis unset, will default to the last stage. - out_indices (
list[int],optional) —If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on howmany stages the model has). If unset andout_featuresis set, will default to the corresponding stages.If unset andout_featuresis unset, will default to the last stage.
This is the configuration class to store the configuration of aTextNextModel. It is used to instantiate aTextNext model according to the specified arguments, defining the model architecture. Instantiating a configurationwith the defaults will yield a similar configuration to that of theczczup/textnet-base. Configuration objects inherit fromPreTrainedConfig and can be used to control the model outputs.Read the documentation fromPreTrainedConfigfor more information.
TextNetImageProcessor
classtransformers.TextNetImageProcessor
<source>(do_resize: bool = Truesize: typing.Optional[dict[str, int]] = Nonesize_divisor: int = 32resample: Resampling = <Resampling.BILINEAR: 2>do_center_crop: bool = Falsecrop_size: typing.Optional[dict[str, int]] = Nonedo_rescale: bool = Truerescale_factor: typing.Union[int, float] = 0.00392156862745098do_normalize: bool = Trueimage_mean: typing.Union[float, list[float], NoneType] = [0.485, 0.456, 0.406]image_std: typing.Union[float, list[float], NoneType] = [0.229, 0.224, 0.225]do_convert_rgb: bool = True**kwargs)
Parameters
- do_resize (
bool,optional, defaults toTrue) —Whether to resize the image’s (height, width) dimensions to the specifiedsize. Can be overridden bydo_resizein thepreprocessmethod. - size (
dict[str, int]optional, defaults to{"shortest_edge" -- 640}):Size of the image after resizing. The shortest edge of the image is resized to size[“shortest_edge”], withthe longest edge resized to keep the input aspect ratio. Can be overridden bysizein thepreprocessmethod. - size_divisor (
int,optional, defaults to 32) —Ensures height and width are rounded to a multiple of this value after resizing. - resample (
PILImageResampling,optional, defaults toResampling.BILINEAR) —Resampling filter to use if resizing the image. Can be overridden byresamplein thepreprocessmethod. - do_center_crop (
bool,optional, defaults toFalse) —Whether to center crop the image to the specifiedcrop_size. Can be overridden bydo_center_cropin thepreprocessmethod. - crop_size (
dict[str, int]optional, defaults to 224) —Size of the output image after applyingcenter_crop. Can be overridden bycrop_sizein thepreprocessmethod. - do_rescale (
bool,optional, defaults toTrue) —Whether to rescale the image by the specified scalerescale_factor. Can be overridden bydo_rescaleinthepreprocessmethod. - rescale_factor (
intorfloat,optional, defaults to1/255) —Scale factor to use if rescaling the image. Can be overridden byrescale_factorin thepreprocessmethod. - do_normalize (
bool,optional, defaults toTrue) —Whether to normalize the image. Can be overridden bydo_normalizein thepreprocessmethod. - image_mean (
floatorlist[float],optional, defaults to[0.485, 0.456, 0.406]) —Mean to use if normalizing the image. This is a float or list of floats the length of the number ofchannels in the image. Can be overridden by theimage_meanparameter in thepreprocessmethod. - image_std (
floatorlist[float],optional, defaults to[0.229, 0.224, 0.225]) —Standard deviation to use if normalizing the image. This is a float or list of floats the length of thenumber of channels in the image. Can be overridden by theimage_stdparameter in thepreprocessmethod.Can be overridden by theimage_stdparameter in thepreprocessmethod. - do_convert_rgb (
bool,optional, defaults toTrue) —Whether to convert the image to RGB.
Constructs a TextNet image processor.
preprocess
<source>(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]do_resize: typing.Optional[bool] = Nonesize: typing.Optional[dict[str, int]] = Nonesize_divisor: typing.Optional[int] = Noneresample: typing.Optional[PIL.Image.Resampling] = Nonedo_center_crop: typing.Optional[bool] = Nonecrop_size: typing.Optional[int] = Nonedo_rescale: typing.Optional[bool] = Nonerescale_factor: typing.Optional[float] = Nonedo_normalize: typing.Optional[bool] = Noneimage_mean: typing.Union[float, list[float], NoneType] = Noneimage_std: typing.Union[float, list[float], NoneType] = Nonedo_convert_rgb: typing.Optional[bool] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonedata_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None**kwargs)
Parameters
- images (
ImageInput) —Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. Ifpassing in images with pixel values between 0 and 1, setdo_rescale=False. - do_resize (
bool,optional, defaults toself.do_resize) —Whether to resize the image. - size (
dict[str, int],optional, defaults toself.size) —Size of the image after resizing. Shortest edge of the image is resized to size[“shortest_edge”], withthe longest edge resized to keep the input aspect ratio. - size_divisor (
int,optional, defaults to32) —Ensures height and width are rounded to a multiple of this value after resizing. - resample (
int,optional, defaults toself.resample) —Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Onlyhas an effect ifdo_resizeis set toTrue. - do_center_crop (
bool,optional, defaults toself.do_center_crop) —Whether to center crop the image. - crop_size (
dict[str, int],optional, defaults toself.crop_size) —Size of the center crop. Only has an effect ifdo_center_cropis set toTrue. - do_rescale (
bool,optional, defaults toself.do_rescale) —Whether to rescale the image. - rescale_factor (
float,optional, defaults toself.rescale_factor) —Rescale factor to rescale the image by ifdo_rescaleis set toTrue. - do_normalize (
bool,optional, defaults toself.do_normalize) —Whether to normalize the image. - image_mean (
floatorlist[float],optional, defaults toself.image_mean) —Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - image_std (
floatorlist[float],optional, defaults toself.image_std) —Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - do_convert_rgb (
bool,optional, defaults toself.do_convert_rgb) —Whether to convert the image to RGB. - return_tensors (
strorTensorType,optional) —The type of tensors to return. Can be one of:- Unset: Return a list of
np.ndarray. TensorType.PYTORCHor'pt': Return a batch of typetorch.Tensor.TensorType.NUMPYor'np': Return a batch of typenp.ndarray.
- Unset: Return a list of
- data_format (
ChannelDimensionorstr,optional, defaults toChannelDimension.FIRST) —The channel dimension format for the output image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format.- Unset: Use the channel dimension format of the input image.
- input_data_format (
ChannelDimensionorstr,optional) —The channel dimension format for the input image. If unset, the channel dimension format is inferredfrom the input image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format."none"orChannelDimension.NONE: image in (height, width) format.
Preprocess an image or batch of images.
TextNetImageProcessorFast
classtransformers.TextNetImageProcessorFast
<source>(**kwargs: typing_extensions.Unpack[transformers.models.textnet.image_processing_textnet.TextNetImageProcessorKwargs])
Constructs a fast Textnet image processor.
preprocess
<source>(images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]**kwargs: typing_extensions.Unpack[transformers.models.textnet.image_processing_textnet.TextNetImageProcessorKwargs])→<class 'transformers.image_processing_base.BatchFeature'>
Parameters
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) —Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. Ifpassing in images with pixel values between 0 and 1, setdo_rescale=False. - do_convert_rgb (
bool,optional) —Whether to convert the image to RGB. - do_resize (
bool,optional) —Whether to resize the image. - size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —Describes the maximum input dimensions to the model. - crop_size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —Size of the output image after applyingcenter_crop. - resample (
Annotated[Union[PILImageResampling, int, NoneType], None]) —Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Onlyhas an effect ifdo_resizeis set toTrue. - do_rescale (
bool,optional) —Whether to rescale the image. - rescale_factor (
float,optional) —Rescale factor to rescale the image by ifdo_rescaleis set toTrue. - do_normalize (
bool,optional) —Whether to normalize the image. - image_mean (
Union[float, list[float], tuple[float, ...], NoneType]) —Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - image_std (
Union[float, list[float], tuple[float, ...], NoneType]) —Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue. - do_pad (
bool,optional) —Whether to pad the image. Padding is done either to the largest size in the batchor to a fixed square size per image. The exact padding strategy depends on the model. - pad_size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) —The size in{"height": int, "width" int}to pad the images to. Must be larger than any image sizeprovided for preprocessing. Ifpad_sizeis not provided, images will be padded to the largestheight and width in the batch. Applied only whendo_pad=True. - do_center_crop (
bool,optional) —Whether to center crop the image. - data_format (
Union[str, ~image_utils.ChannelDimension, NoneType]) —OnlyChannelDimension.FIRSTis supported. Added for compatibility with slow processors. - input_data_format (
Union[str, ~image_utils.ChannelDimension, NoneType]) —The channel dimension format for the input image. If unset, the channel dimension format is inferredfrom the input image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format."none"orChannelDimension.NONE: image in (height, width) format.
- device (
Annotated[str, None],optional) —The device to process the images on. If unset, the device is inferred from the input images. - return_tensors (
Annotated[Union[str, ~utils.generic.TensorType, NoneType], None]) —Returns stacked tensors if set to `pt, otherwise returns a list of tensors. - disable_grouping (
bool,optional) —Whether to disable grouping of images by size to process them individually and not in batches.If None, will be set to True if the images are on CPU, and False otherwise. This choice is based onempirical observations, as detailed here:https://github.com/huggingface/transformers/pull/38157 - size_divisor (
<class 'int'>.size_divisor) —The size by which to make sure both the height and width can be divided.
Returns
<class 'transformers.image_processing_base.BatchFeature'>
- data (
dict) — Dictionary of lists/arrays/tensors returned by thecall method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType],optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors atinitialization.
TextNetModel
classtransformers.TextNetModel
<source>(config)
Parameters
- config (TextNetModel) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
The bare Textnet Model outputting raw hidden-states without any specific head on top.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(pixel_values: Tensoroutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)→transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention ortuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.Tensorof shape(batch_size, num_channels, image_size, image_size)) —The tensors corresponding to the input images. Pixel values can be obtained usingTextNetImageProcessor. SeeTextNetImageProcessor.call() for details (processor_classusesTextNetImageProcessor for processing images). - output_hidden_states (
bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors formore detail. - return_dict (
bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (TextNetConfig) and inputs.
last_hidden_state (
torch.FloatTensorof shape(batch_size, num_channels, height, width)) — Sequence of hidden-states at the output of the last layer of the model.pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state after a pooling operation on the spatial dimensions.hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape(batch_size, num_channels, height, width).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
TheTextNetModel forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
TextNetForImageClassification
classtransformers.TextNetForImageClassification
<source>(config)
Parameters
- config (TextNetForImageClassification) —Model configuration class with all the parameters of the model. Initializing with a config file does notload the weights associated with the model, only the configuration. Check out thefrom_pretrained() method to load the model weights.
TextNet Model with an image classification head on top (a linear layer on top of the pooled features), e.g. forImageNet.
This model inherits fromPreTrainedModel. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)
This model is also a PyTorchtorch.nn.Module subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.
forward
<source>(pixel_values: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None)→transformers.modeling_outputs.ImageClassifierOutputWithNoAttention ortuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size),optional) —The tensors corresponding to the input images. Pixel values can be obtained usingTextNetImageProcessor. SeeTextNetImageProcessor.call() for details (processor_classusesTextNetImageProcessor for processing images). - labels (
torch.LongTensorof shape(batch_size,),optional) —Labels for computing the image classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels == 1a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1a classification loss is computed (Cross-Entropy). - output_hidden_states (
bool,optional) —Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors formore detail. - return_dict (
bool,optional) —Whether or not to return aModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.ImageClassifierOutputWithNoAttention ortuple(torch.FloatTensor)
Atransformers.modeling_outputs.ImageClassifierOutputWithNoAttention or a tuple oftorch.FloatTensor (ifreturn_dict=False is passed or whenconfig.return_dict=False) comprising variouselements depending on the configuration (TextNetConfig) and inputs.
- loss (
torch.FloatTensorof shape(1,),optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss. - logits (
torch.FloatTensorof shape(batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax). - hidden_states (
tuple(torch.FloatTensor),optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, +one for the output of each stage) of shape(batch_size, num_channels, height, width). Hidden-states (alsocalled feature maps) of the model at the output of each stage.
TheTextNetForImageClassification forward method, overrides the__call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps whilethe latter silently ignores them.
Examples:
>>>import torch>>>import requests>>>from transformersimport TextNetForImageClassification, TextNetImageProcessor>>>from PILimport Image>>>url ="http://images.cocodataset.org/val2017/000000039769.jpg">>>image = Image.open(requests.get(url, stream=True).raw)>>>processor = TextNetImageProcessor.from_pretrained("czczup/textnet-base")>>>model = TextNetForImageClassification.from_pretrained("czczup/textnet-base")>>>inputs = processor(images=image, return_tensors="pt")>>>with torch.no_grad():... outputs = model(**inputs)>>>outputs.logits.shapetorch.Size([1,2])