Hugging Face Transformers MobileViTV2 Deserialization of Untrusted Data Remote Code Execution Vulnerability. This vulnerability allows remote attackers to execute arbitrary code on affected installations of Hugging Face Transformers. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file.
The specific flaw exists within the handling of configuration files. The issue results from the lack of proper validation of user-supplied data, which can result in deserialization of untrusted data. An attacker can leverage this vulnerability to execute code in the context of the current user. Was ZDI-CAN-24322.
Hugging Face Transformers Trax Model Deserialization of Untrusted Data Remote Code Execution Vulnerability. This vulnerability allows remote attackers to execute arbitrary code on affected installations of Hugging Face Transformers. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file.
The specific flaw exists within the handling of model files. The issue results from the lack of proper validation of user-supplied data, which can result in deserialization of untrusted data. An attacker can leverage this vulnerability to execute code in the context of the current user. Was ZDI-CAN-25012.
Hugging Face Transformers MaskFormer Model Deserialization of Untrusted Data Remote Code Execution Vulnerability. This vulnerability allows remote attackers to execute arbitrary code on affected installations of Hugging Face Transformers. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file.
The specific flaw exists within the parsing of model files. The issue results from the lack of proper validation of user-supplied data, which can result in deserialization of untrusted data. An attacker can leverage this vulnerability to execute code in the context of the current user. Was ZDI-CAN-25191.
A Regular Expression Denial of Service (ReDoS) vulnerability was identified in the huggingface/transformers library, specifically in the file tokenization_nougat_fast.py. The vulnerability occurs in the post_process_single() function, where a regular expression processes specially crafted input. The issue stems from the regex exhibiting exponential time complexity under certain conditions, leading to excessive backtracking. This can result in significantly high CPU usage and potential application downtime, effectively creating a Denial of Service (DoS) scenario. The affected version is v4.46.3.
A Regular Expression Denial of Service (ReDoS) vulnerability was identified in the huggingface/transformers library, specifically in the filetokenization_gpt_neox_japanese.py
of the GPT-NeoX-Japanese model. The vulnerability occurs in the SubWordJapaneseTokenizer class, where regular expressions process specially crafted inputs. The issue stems from a regex exhibiting exponential complexity under certain conditions, leading to excessive backtracking. This can result in high CPU usage and potential application downtime, effectively creating a Denial of Service (DoS) scenario. The affected version is v4.48.1 (latest).
huggingface/transformers (transformers)
Compare Source
Release v4.50.0
New Model Additions
Model-based releases
Starting with version v4.49.0, we have been doing model-based releases, additionally to our traditional, software-based monthly releases. These model-based releases provide a tag from which models may be installed.
Contrarily to our software-releases; these are not pushed to pypi and are kept on our GitHub. Each release has a tag attributed to it, such as:
v4.49.0-Gemma-3
v4.49.0-AyaVision
⚠️ As bugs are identified and fixed on each model, the release tags are updated so that installing from that tag always gives the best experience possible with that model.
Each new model release will always be based on the current state of the main branch at the time of its creation. This ensures that new models start with the latest features and fixes available.
For example, if two models—Gemma-3 and AyaVision—are released from main, and then a fix for gemma3 is merged, it will look something like this:
o---- v4.49.0-Gemma-3 (includes AyaVision, plus main fixes) / \ ---o--o--o--o--o-- (fix for gemma3) --o--o--o main \ o---- v4.49.0-AyaVision
We strive to merge model specific fixes on their respective branches as fast as possible!
Gemma 3

Gemma 3 is heavily referenced in the followingmodel-based release and we recommend reading these if you want all the information relative to that model.
The Gemma 3 model was proposed by Google. It is a vision-language model composed by aSigLIP vision encoder and aGemma 2 language decoder linked by a multimodal linear projection.
It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.
One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.
Shield Gemma2
ShieldGemma 2 is built onGemma 3, is a 4 billion (4B) parameter model that checks the safety of both synthetic and natural images against key categories to help you build robust datasets and models. With this addition to the Gemma family of models, researchers and developers can now easily minimize the risk of harmful content in their models across key areas of harm as defined below:
- No Sexually Explicit content: The image shall not contain content that depicts explicit or graphic sexual acts (e.g., pornography, erotic nudity, depictions of rape or sexual assault).
- No Dangerous Content: The image shall not contain content that facilitates or encourages activities that could cause real-world harm (e.g., building firearms and explosive devices, promotion of terrorism, instructions for suicide).
- No Violence/Gore content: The image shall not contain content that depicts shocking, sensational, or gratuitous violence (e.g., excessive blood and gore, gratuitous violence against animals, extreme injury or moment of death).
We recommend using ShieldGemma 2 as an input filter to vision language models, or as an output filter of image generation systems. To train a robust image safety model, we curated training datasets of natural and synthetic images and instruction-tuned Gemma 3 to demonstrate strong performance.
Aya Vision
AyaVision is heavily referenced in the followingmodel-based release and we recommend reading these if you want all the information relative to that model.

The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.
Aya Vision 8B combines theSiglip2-so400-384-14
vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.
Key features of Aya Vision include:
- Multimodal capabilities in 23 languages
- Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B
- High-quality visual understanding using the Siglip2-so400-384-14 vision encoder
- Seamless integration of visual and textual information in 23 languages.
Mistral 3.1
Mistral 3.1 is heavily referenced in the followingmodel-based release and we recommend reading these if you want all the information relative to that model.

Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
It is ideal for:
- Fast-response conversational agents.
- Low-latency function calling.
- Subject matter experts via fine-tuning.
- Local inference for hobbyists and organizations handling sensitive data.
- Programming and math reasoning.
- Long document understanding.
- Visual understanding.
Smol VLM 2
SmolVLM-2 is heavily referenced in the followingmodel-based release and we recommend reading these if you want all the information relative to that model.

SmolVLM2 is an adaptation of the Idefics3 model with two main differences:
- It uses SmolLM2 for the text model.
- It supports multi-image and video inputs
SigLIP-2
SigLIP-2 is heavily referenced in the followingmodel-based release and we recommend reading these if you want all the information relative to that model.

The SigLIP2 model was proposed inSigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin,
Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen,
Andreas Steiner and Xiaohua Zhai.
The model comes in two variants
- FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
- NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in
transformers
)
Prompt Depth Anything
PromptDepthAnything is a high-resolution, accurate metric depth estimation model that leverages prompting, inspired by its success in vision-language (VLMs) and large language models (LLMs). Using iPhone LiDAR as a prompt, the model generates precise depth maps at up to 4K resolution, unlocking the potential of depth foundation models.

New tool: attention visualization
We add a new tool totransformers
to visualize the attention layout of a given model. It only requires a model ID as input, and will load the relevant tokenizer/model and display what the attention mask looks like. Some examples:
fromtransformers.utils.attention_visualizerimportAttentionMaskVisualizervisualizer=AttentionMaskVisualizer("meta-llama/Llama-3.2-3B-Instruct")visualizer("A normal attention mask")visualizer=AttentionMaskVisualizer("mistralai/Mistral-Small-24B-Instruct-2501")visualizer("A normal attention mask with a long text to see how it is displayed, and if it is displayed correctly")visualizer=AttentionMaskVisualizer("google/paligemma2-3b-mix-224")visualizer("<img> You are an assistant.",suffix="What is on the image?")visualizer=AttentionMaskVisualizer("google/gemma-2b")visualizer("You are an assistant. Make sure you print me")# we should have slidiing on non sliding side by sidevisualizer=AttentionMaskVisualizer("google/gemma-3-27b-it")visualizer("<img>You are an assistant. Make sure you print me")# we should have slidiing on non sliding side by side

Deprecating transformers.agents in favor of smolagents
We are deprecatingtransformers.agents
in favour of thesmolagents
library. Read more about smolagentshere.
Quantization
We support adding custom quantization method by using the@register_quantization_config
and@register_quantizer
decorator:
@​register_quantization_config("custom")classCustomConfig(QuantizationConfigMixin):pass@​register_quantizer("custom")classCustomQuantizer(HfQuantizer):passquantized_model=AutoModelForCausalLM.from_pretrained("facebook/opt-350m",quantization_config=CustomConfig(),torch_dtype="auto")
AMD is developing its in-house quantizer namedQuark released under MIT license, which supports a broad range of quantization pre-processing, algorithms, dtypes and target hardware. You can now load a model quantized by quark library:
### pip install amd-quarkmodel_id="EmbeddedLLM/Llama-3.1-8B-Instruct-w_fp8_per_channel_sym"model=AutoModelForCausalLM.from_pretrained(model_id)model=model.to("cuda")
Torchao is augmented withautoquant
support, CPU-quantization, as well as newAOBaseConfig
object instances for more advanced configuration.
Tensor Parallelism implementation changes
At loading time, the parallelization is now applied module-by-module, so that no memory overhead is required compared to what the final weight distribution will be!
Generation
This release includes two speed upgrades togenerate
:
- Assisted generation now works with ANY model as an assistant, even with
do_sample=True
;
fromtransformersimportpipelineimporttorchprompt="Alice and Bob"checkpoint="google/gemma-2-9b"assistant_checkpoint="double7/vicuna-68m"pipe=pipeline("text-generation",model=checkpoint,assistant_model=assistant_checkpoint,do_sample=True)pipe_output=pipe(prompt,max_new_tokens=50,do_sample=True)print(pipe_output[0]["generated_text"])
- Beam search was vectorized, and should be significantly faster with a large
num_beams
. The speedup is more visible on smaller models, wheremodel.forward
doesn't dominate the total run time.
Documentation
A significant redesign of our documentation has wrapped-up. The goal was to greatly simplify thetransformers
documentation, making it much more easy to navigate. Let us know what you think!
Notable repo maintenance
The research examples folder that was hosted intransformers
is no more. We have moved it out oftransformers
and in the following repo: github.com/huggingface/transformers-research-projects/
We have updated our flex attention support so as to have it be on-par with our Flash Attention 2 support.
More models support flex attention now thanks to@qubvel
- Refactor Attention implementation for ViT-based models by@qubvel in#36545
First integration of hub kernels for deformable detr!
Bugfixes and improvements
Uh oh!
There was an error while loading.Please reload this page.
This PR contains the following updates:
==4.38.0
->==4.50.0
GitHub Vulnerability Alerts
CVE-2024-11392
Hugging Face Transformers MobileViTV2 Deserialization of Untrusted Data Remote Code Execution Vulnerability. This vulnerability allows remote attackers to execute arbitrary code on affected installations of Hugging Face Transformers. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file.
The specific flaw exists within the handling of configuration files. The issue results from the lack of proper validation of user-supplied data, which can result in deserialization of untrusted data. An attacker can leverage this vulnerability to execute code in the context of the current user. Was ZDI-CAN-24322.
CVE-2024-11394
Hugging Face Transformers Trax Model Deserialization of Untrusted Data Remote Code Execution Vulnerability. This vulnerability allows remote attackers to execute arbitrary code on affected installations of Hugging Face Transformers. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file.
The specific flaw exists within the handling of model files. The issue results from the lack of proper validation of user-supplied data, which can result in deserialization of untrusted data. An attacker can leverage this vulnerability to execute code in the context of the current user. Was ZDI-CAN-25012.
CVE-2024-11393
Hugging Face Transformers MaskFormer Model Deserialization of Untrusted Data Remote Code Execution Vulnerability. This vulnerability allows remote attackers to execute arbitrary code on affected installations of Hugging Face Transformers. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file.
The specific flaw exists within the parsing of model files. The issue results from the lack of proper validation of user-supplied data, which can result in deserialization of untrusted data. An attacker can leverage this vulnerability to execute code in the context of the current user. Was ZDI-CAN-25191.
CVE-2024-12720
A Regular Expression Denial of Service (ReDoS) vulnerability was identified in the huggingface/transformers library, specifically in the file tokenization_nougat_fast.py. The vulnerability occurs in the post_process_single() function, where a regular expression processes specially crafted input. The issue stems from the regex exhibiting exponential time complexity under certain conditions, leading to excessive backtracking. This can result in significantly high CPU usage and potential application downtime, effectively creating a Denial of Service (DoS) scenario. The affected version is v4.46.3.
CVE-2025-1194
A Regular Expression Denial of Service (ReDoS) vulnerability was identified in the huggingface/transformers library, specifically in the file
tokenization_gpt_neox_japanese.py
of the GPT-NeoX-Japanese model. The vulnerability occurs in the SubWordJapaneseTokenizer class, where regular expressions process specially crafted inputs. The issue stems from a regex exhibiting exponential complexity under certain conditions, leading to excessive backtracking. This can result in high CPU usage and potential application downtime, effectively creating a Denial of Service (DoS) scenario. The affected version is v4.48.1 (latest).CVE-2025-2099
A vulnerability in the
preprocess_string()
function of thetransformers.testing_utils
module in huggingface/transformers version v4.48.3 allows for a Regular Expression Denial of Service (ReDoS) attack. The regular expression used to process code blocks in docstrings contains nested quantifiers, leading to exponential backtracking when processing input with a large number of newline characters. An attacker can exploit this by providing a specially crafted payload, causing high CPU usage and potential application downtime, effectively resulting in a Denial of Service (DoS) scenario.Release Notes
huggingface/transformers (transformers)
v4.50.0
Compare Source
Release v4.50.0
New Model Additions
Model-based releases
Starting with version v4.49.0, we have been doing model-based releases, additionally to our traditional, software-based monthly releases. These model-based releases provide a tag from which models may be installed.
Contrarily to our software-releases; these are not pushed to pypi and are kept on our GitHub. Each release has a tag attributed to it, such as:
v4.49.0-Gemma-3
v4.49.0-AyaVision
Each new model release will always be based on the current state of the main branch at the time of its creation. This ensures that new models start with the latest features and fixes available.
For example, if two models—Gemma-3 and AyaVision—are released from main, and then a fix for gemma3 is merged, it will look something like this:
We strive to merge model specific fixes on their respective branches as fast as possible!
Gemma 3
Gemma 3 is heavily referenced in the followingmodel-based release and we recommend reading these if you want all the information relative to that model.
The Gemma 3 model was proposed by Google. It is a vision-language model composed by aSigLIP vision encoder and aGemma 2 language decoder linked by a multimodal linear projection.
It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.
One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.
Shield Gemma2
ShieldGemma 2 is built onGemma 3, is a 4 billion (4B) parameter model that checks the safety of both synthetic and natural images against key categories to help you build robust datasets and models. With this addition to the Gemma family of models, researchers and developers can now easily minimize the risk of harmful content in their models across key areas of harm as defined below:
We recommend using ShieldGemma 2 as an input filter to vision language models, or as an output filter of image generation systems. To train a robust image safety model, we curated training datasets of natural and synthetic images and instruction-tuned Gemma 3 to demonstrate strong performance.
Aya Vision
AyaVision is heavily referenced in the followingmodel-based release and we recommend reading these if you want all the information relative to that model.
The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.
Aya Vision 8B combines the
Siglip2-so400-384-14
vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.Key features of Aya Vision include:
Mistral 3.1
Mistral 3.1 is heavily referenced in the followingmodel-based release and we recommend reading these if you want all the information relative to that model.
Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
It is ideal for:
Smol VLM 2
SmolVLM-2 is heavily referenced in the followingmodel-based release and we recommend reading these if you want all the information relative to that model.
SmolVLM2 is an adaptation of the Idefics3 model with two main differences:
SigLIP-2
SigLIP-2 is heavily referenced in the followingmodel-based release and we recommend reading these if you want all the information relative to that model.
The SigLIP2 model was proposed inSigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin,
Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen,
Andreas Steiner and Xiaohua Zhai.
The model comes in two variants
transformers
)Prompt Depth Anything
PromptDepthAnything is a high-resolution, accurate metric depth estimation model that leverages prompting, inspired by its success in vision-language (VLMs) and large language models (LLMs). Using iPhone LiDAR as a prompt, the model generates precise depth maps at up to 4K resolution, unlocking the potential of depth foundation models.
New tool: attention visualization
We add a new tool to
transformers
to visualize the attention layout of a given model. It only requires a model ID as input, and will load the relevant tokenizer/model and display what the attention mask looks like. Some examples:Deprecating transformers.agents in favor of smolagents
We are deprecating
transformers.agents
in favour of thesmolagents
library. Read more about smolagentshere.Quantization
We support adding custom quantization method by using the
@register_quantization_config
and@register_quantizer
decorator:AMD is developing its in-house quantizer namedQuark released under MIT license, which supports a broad range of quantization pre-processing, algorithms, dtypes and target hardware. You can now load a model quantized by quark library:
Torchao is augmented with
autoquant
support, CPU-quantization, as well as newAOBaseConfig
object instances for more advanced configuration.Tensor Parallelism implementation changes
At loading time, the parallelization is now applied module-by-module, so that no memory overhead is required compared to what the final weight distribution will be!
Generation
This release includes two speed upgrades to
generate
:do_sample=True
;num_beams
. The speedup is more visible on smaller models, wheremodel.forward
doesn't dominate the total run time.CandidateGenerator
by@keyboardAnt,@jmamou, and@gauravjain14 in#35029Documentation
A significant redesign of our documentation has wrapped-up. The goal was to greatly simplify the
transformers
documentation, making it much more easy to navigate. Let us know what you think!Notable repo maintenance
The research examples folder that was hosted in
transformers
is no more. We have moved it out oftransformers
and in the following repo: github.com/huggingface/transformers-research-projects/We have updated our flex attention support so as to have it be on-par with our Flash Attention 2 support.
More models support flex attention now thanks to@qubvel
First integration of hub kernels for deformable detr!
Bugfixes and improvements
EsmModelIntegrationTest::test_inference_bitsandbytes
by@faaany in#36225LlavaForConditionalGenerationModelTest::test_config
after#36077 by@ydshieh in#36230/generation
by@gante in#36235test_export_to_onnx
by@gante in#36241test_fast_is_faster_than_slow
by@ydshieh in#36240Speech2TextFeatureExtractor
API. by@KarelVesely84 in#34638pt_tf
equivalence tests by@gante in#36253test_from_pretrained_low_cpu_mem_usage_equal
less flaky by@gante in#36255GenerationTesterMixin
inheritance is correct 🐛 🔫 by@gante in#36180main
by@ydshieh in#36375is_causal
fail with compile by@Cyrilvallez in#36374benchmark.yml
by@ydshieh in#36402CandidateGenerator
by@keyboardAnt in#35029contents: write
by@ydshieh in#36445torch.distributed
-compatibleDynamicCache
by@gante in#36373src/transformers/image_utils.py
by@hmellor in#36435hub_retry
by@ydshieh in#36449TRUST_REMOTE_CODE
forRealmRetriever
for security by@ydshieh in#36511input_ids
passed toPrefixConstrainedLogitsProcessor
is zero by@HiDolen in#36489DataCollatorForLanguageModeling
by@capemox in#36457HybridCache
] disable automatic compilation by@gante in#36620make fix-copies
by@gante in#36664from_pretrained
by@Cyrilvallez in#36033meta
device by@gante in#36543gc.collect()
if only 1 shard is used by@gante in#36721test_eager_matches_sdpa_inference
by@gante in#36650generation_config
, overwrite default values with the model's basegeneration_config
by@gante in#36684Configuration
📅Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻Rebasing: Never, or you tick the rebase/retry checkbox.
🔕Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated byMend Renovate. View therepository job log.