Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit5a4a08a

Browse files
authored
Merge branch 'main' into kernel_mapping_error_resolve
2 parents04e27cb +d08b98b commit5a4a08a

File tree

1,157 files changed

+30524
-65638
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,157 files changed

+30524
-65638
lines changed

‎.github/workflows/get-pr-info.yml‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ on:
4040
description:"The sha of the merge commit for the pull request (created by GitHub) in the base repository"
4141
value:${{ jobs.get-pr-info.outputs.PR_MERGE_COMMIT_SHA }}
4242
PR_MERGE_COMMIT_BASE_SHA:
43-
description:"The sha of the parent commit of thethemerge commit on the target branch in the base repository"
43+
description:"The sha of the parent commit of the merge commit on the target branch in the base repository"
4444
value:${{ jobs.get-pr-info.outputs.PR_MERGE_COMMIT_BASE_SHA }}
4545
PR_HEAD_COMMIT_DATE:
4646
description:"The date of the head sha of the pull request branch in the head repository"

‎.github/workflows/self-comment-ci.yml‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ env:
2727
jobs:
2828
get-pr-number:
2929
name:Get PR number
30-
if:${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
30+
if:${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or", "itazap", "3outeille"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
3131
uses:./.github/workflows/get-pr-number.yml
3232

3333
get-pr-info:

‎CONTRIBUTING.md‎

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -125,9 +125,9 @@ If you're contributing a **vision-language model** (or any multimodal model that
125125
All new models should use the modular architecture pattern. Create a`modular_<model_name>.py` file using the modular model converter:
126126

127127
- Use the CLI,[`transformers add-new-model-like`](https://github.com/huggingface/transformers/blob/main/src/transformers/cli/add_new_model_like.py) to generate a modular skeleton and get started
128-
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well.[Modular guide](./modular_transformers#implementing-a-modular-file) shows a quick way to set up a modular file.
128+
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well.[Modular guide](./docs/source/en/modular_transformers.md#implementing-a-modular-file) shows a quick way to set up a modular file.
129129
- Reuse existing patterns from similar models as much as possible
130-
- You can make the model compatible with inference engines such as vLLM or SGLang, and enable zero-effort integration. See specific requirements for model implementation in["Transformers modeling backend"](./transformers_as_backend#multimodal-models)
130+
- You can make the model compatible with inference engines such as vLLM or SGLang, and enable zero-effort integration. See specific requirements for model implementation in["Transformers modeling backend"](./docs/source/en/transformers_as_backend.md#multimodal-models)
131131

132132
To verify your modular file is correct, run:
133133

‎MIGRATION_GUIDE_V5.md‎

Lines changed: 485 additions & 0 deletions
Large diffs are not rendered by default.

‎README.md‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ pipeline("the secret to baking a really good cake is ")
134134
To chat with a model, the usage pattern is the same. The only difference is you need to construct a chat history (the input to`Pipeline`) between you and the system.
135135

136136
>[!TIP]
137-
>You can also chat with a model directly from the command line.
137+
>You can also chat with a model directly from the command line, as long as[`transformers serve` is running](https://huggingface.co/docs/transformers/main/en/serving).
138138
>```shell
139139
> transformers chat Qwen/Qwen2.5-0.5B-Instruct
140140
>```

‎benchmark_v2/framework/benchmark_config.py‎

Lines changed: 37 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,10 @@
22
importitertools
33
importjson
44
importlogging
5+
fromfunctoolsimportlru_cache
56
fromtypingimportAny
67

7-
fromtransformers.utils.import_utilsimportis_flash_attn_2_available
8+
fromtransformers.utils.import_utilsimportis_flash_attn_2_available,is_kernels_available
89

910

1011
KERNELIZATION_AVAILABLE=False
@@ -18,17 +19,36 @@
1819
logger=logging.getLogger(__name__)
1920

2021

22+
@lru_cache
23+
defis_fa2_or_kernel_available()->bool:
24+
"""Returns True if the flash_attn_2 or a fallback kernel is available"""
25+
# Early return if flash_attn_2 is available
26+
ifis_flash_attn_2_available():
27+
returnTrue
28+
# Early return if kernels is not available
29+
ifnotis_kernels_available():
30+
logger.warning(
31+
"flash_attention_2 is not available. kernels is not installed. Benchmarking flash_attention_2 will not "
32+
"be possible."
33+
)
34+
returnFalse
35+
# If kernels is available, try to get the flash_attn_2 kernel
36+
try:
37+
fromkernelsimportget_kernel
38+
39+
get_kernel("kernels-community/flash-attn")
40+
exceptExceptionas_:
41+
logger.warning(
42+
"flash_attention_2 is not available. kernels is installed, but the flash_attn kernel is not available."
43+
"Benchmarking flash_attention_2 will not be possible."
44+
)
45+
returnFalse
46+
47+
2148
classBenchmarkConfig:
2249
"""Configuration for a single benchmark scenario."""
2350

24-
all_attn_implementations= [
25-
("flash_attention_2",None),
26-
("eager",None),
27-
("sdpa","math"),
28-
("sdpa","flash_attention"),
29-
("flex_attention",None),
30-
]
31-
51+
all_attn_implementations= ["flash_attention_2","eager","sdpa","flex_attention"]
3252
all_compiled_modes= [None,"default","reduce-overhead","max-autotune","max-autotune-no-cudagraphs"]
3353

3454
def__init__(
@@ -41,7 +61,6 @@ def __init__(
4161
sequence_length:int=128,
4262
num_tokens_to_generate:int=128,
4363
attn_implementation:str="eager",
44-
sdpa_backend:str|None=None,
4564
compile_mode:str|None=None,
4665
compile_options:dict[str,Any]|None=None,
4766
kernelize:bool=False,
@@ -59,7 +78,6 @@ def __init__(
5978
self.num_tokens_to_generate=num_tokens_to_generate
6079
# Generation parameters
6180
self.attn_implementation=attn_implementation
62-
self.sdpa_backend=sdpa_backend
6381
# Optimization parameters
6482
self.compile_mode=compile_mode
6583
self.compile_options=compile_optionsifcompile_optionsisnotNoneelse {}
@@ -75,34 +93,21 @@ def check_validity(self, skip_validity_check: bool = False) -> None:
7593
ifskip_validity_check:
7694
return
7795
# Check FA is installed
78-
ifself.attn_implementation=="flash_attention_2"andnotis_flash_attn_2_available():
79-
logger.warning(
80-
"Flash attention does not support compile mode. Defaulting to SDPA w/ flash attention backend."
81-
)
96+
is_fa=self.attn_implementation=="flash_attention_2"
97+
ifis_faandnotis_fa2_or_kernel_available():
98+
logger.warning("Flash attention is not available. Defaulting to SDPA.")
8299
self.attn_implementation="sdpa"
83-
self.sdpa_backend="flash_attention"
84100
# Flash attention does not support compile mode, so we turn it off # FIXME: it would be better to support it
85-
is_fa=self.attn_implementation=="flash_attention_2"
86-
is_fa|=self.attn_implementation=="sdpa"andself.sdpa_backend=="flash_attention"
87-
ifis_fa:
101+
ifis_faandself.compile_modeisnotNone:
88102
logger.warning("Flash attention does not support compile mode. Turning off compile mode.")
89103
self.compile_mode=None
90-
# Handle SDPA backend if not determined by the config (needs to be done before skipping duplicates)
91-
ifself.attn_implementation=="sdpa"andself.sdpa_backendisNone:
92-
default_backend="flash_attention"# FIXME: torch has a _cur_sdpa_kernel_backends but it fails
93-
logger.warning(f"No SDPA backend provided, using{default_backend} instead.")
94-
self.sdpa_backend=default_backend
104+
# Handle continuous batching cases
95105
ifself.continuous_batching:
96106
ifself.attn_implementation=="flex_attention":
97107
logger.error(
98-
"disabling continuous batching because of invalid configuration: flex attention is not supported"
108+
"Disabling continuous batching because of invalid configuration: flex attention is not supported."
99109
)
100110
self.continuous_batching=False
101-
elifself.attn_implementation=="sdpa"andself.sdpa_backendisnotNone:
102-
logger.warning(
103-
"when continuous batching is enabled, sdpa_backend must be None because of the attention mask, setting it to None"
104-
)
105-
self.sdpa_backend="math"
106111

107112
@property
108113
defhash(self)->str:
@@ -115,7 +120,6 @@ def infer_name(self, compact: bool = True) -> str:
115120
gpu_monitor_str="monitored"ifself.gpu_monitoringelse"unmonitored"
116121
dimensions_str=f"b{self.batch_size}_s{self.sequence_length}_n{self.num_tokens_to_generate}"
117122
attn_code=self.attn_implementation
118-
attn_code+=f"_{self.sdpa_backend}"ifself.attn_implementation=="sdpa"else""
119123
compile_str=f"compiled_{self.compile_mode}"ifself.compile_modeisnotNoneelse"uncompiled"
120124
kernelize_str="kernelized"ifself.kernelizeelse"unkernelized"
121125
continuous_batching_str="cb"ifself.continuous_batchingelse"generate"
@@ -125,7 +129,6 @@ def infer_name(self, compact: bool = True) -> str:
125129
gpu_monitor_str= ("with"ifself.gpu_monitoringelse"no")+" GPU monitoring"
126130
dimensions_str=f"batch size{self.batch_size}, sequence length{self.sequence_length},{self.num_tokens_to_generate} generated tokens"
127131
attn_code=f"{self.attn_implementation} attention"
128-
attn_code+=f" with{self.sdpa_backend} backend"ifself.attn_implementation=="sdpa"else""
129132
compile_str="compiled"ifself.compile_modeisnotNoneelse"not compiled"
130133
kernelize_str="kernelized"ifself.kernelizeelse"not kernelized"
131134
continuous_batching_str="continuous batching"ifself.continuous_batchingelse"regular generate"
@@ -145,7 +148,6 @@ def to_dict(self) -> dict[str, Any]:
145148
"sequence_length":self.sequence_length,
146149
"num_tokens_to_generate":self.num_tokens_to_generate,
147150
"attn_implementation":self.attn_implementation,
148-
"sdpa_backend":self.sdpa_backend,
149151
"compile_mode":self.compile_mode,
150152
"compile_options":self.compile_options| {},# to avoid inplace modification of the original dict
151153
"kernelize":self.kernelize,
@@ -162,7 +164,6 @@ def from_dict(cls, data: dict[str, Any], skip_validity_check: bool = False) -> "
162164
sequence_length=data.get("sequence_length",128),
163165
num_tokens_to_generate=data.get("num_tokens_to_generate",128),
164166
attn_implementation=data.get("attn_implementation","eager"),
165-
sdpa_backend=data.get("sdpa_backend"),
166167
compile_mode=data.get("compile_mode"),
167168
compile_options=data.get("compile_options"),
168169
kernelize=data.get("kernelize",False),
@@ -213,7 +214,7 @@ def get_config_by_level(level: int) -> list[BenchmarkConfig]:
213214
configs= []
214215
# Early return if level is greater than 3: we generate all combinations of configs, maybe even w/ all compile modes
215216
iflevel>=3:
216-
forattn_implementation,sdpa_backendinBenchmarkConfig.all_attn_implementations:
217+
forattn_implementationinBenchmarkConfig.all_attn_implementations:
217218
# Usually there is not much to gain by compiling with other modes, but we allow it for level 4
218219
compile_modes=BenchmarkConfig.all_compiled_modesiflevel>=4else [None,"default"]
219220
forcmincompile_modes:
@@ -222,7 +223,6 @@ def get_config_by_level(level: int) -> list[BenchmarkConfig]:
222223
configs.append(
223224
BenchmarkConfig(
224225
attn_implementation=attn_implementation,
225-
sdpa_backend=sdpa_backend,
226226
compile_mode=cm,
227227
kernelize=kernelize_on,
228228
continuous_batching=cb_on,
@@ -240,5 +240,5 @@ def get_config_by_level(level: int) -> list[BenchmarkConfig]:
240240
configs.append(BenchmarkConfig(attn_implementation="sdpa",compile_mode="default"))
241241
configs.append(BenchmarkConfig(attn_implementation="flex_attention",compile_mode="default",kernelize=True))
242242
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2",kernelize=True))
243-
configs.append(BenchmarkConfig(attn_implementation="paged|sdpa",continuous_batching=True))
243+
configs.append(BenchmarkConfig(attn_implementation="sdpa",continuous_batching=True))
244244
returnconfigs

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp