huggingface/text-generation-inferencePublic

NotificationsYou must be signed in to change notification settings
Fork1.2k
Star10.3k

feat: support logit bias in chat request#3186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

drbh wants to merge14 commits intomain

base:main

Choose a base branch

fromsupport-logit-bias-in-chat

Open

feat: support logit bias in chat request#3186

drbh wants to merge14 commits intomainfromsupport-logit-bias-in-chat

Conversation

Copy link

Collaborator

drbh commentedApr 22, 2025

This PR adds support for the previously unused logit_bias parameter in a chat request.

Example request without logit_bias usingQwen/Qwen2-VL-2B-Instruct

curl http://localhost:3000/v1/chat/completions -X POST \-H'Content-Type: application/json' \-d'{    "model": "tgi",    "seed": 42,    "max_tokens": 10,    "messages": [        {            "role": "user",            "content": [                {                    "type": "text",                    "text": "say Hello"                }            ]        }    ]}'

response

{"object":"chat.completion","id":"","created":1745338432,"model":"Qwen/Qwen2-VL-2B-Instruct","system_fingerprint":"3.2.3-dev0-native","choices": [        {"index":0,"message": {"role":"assistant","content":"Hello! How can I help you today?"            },"logprobs":null,"finish_reason":"length"        }    ],"usage": {"prompt_tokens":21,"completion_tokens":10,"total_tokens":31    }}

with logit_bias specified (specificallyHello with a large negative bias to avoid generating it

curl http://localhost:3000/v1/chat/completions -X POST \-H'Content-Type: application/json' \-d'{    "model": "tgi",    "seed": 42,    "max_tokens": 10,    "logit_bias": {        "9707": -100    },    "messages": [        {            "role": "user",            "content": [                {                    "type": "text",                    "text": "say Hello"                }            ]        }    ]}'

it returns a different response; which happens to be in nice greeting in Spanish

{"object":"chat.completion","id":"","created":1745338592,"model":"Qwen/Qwen2-VL-2B-Instruct","system_fingerprint":"3.2.3-dev0-native","choices": [        {"index":0,"message": {"role":"assistant","content":"¡Hola! ¿Cómo puedo ayudarte?"            },"logprobs":null,"finish_reason":"length"        }    ],"usage": {"prompt_tokens":21,"completion_tokens":10,"total_tokens":31    }}

Important

This PR contains breaking changes as thelogit_bias type has changed from a list to a map

drbh mentioned this pull request

Apr 22, 2025

Implement logit_bias correctly#2869

Open

Copy link

CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we do like other arguments, and just send everything initialized instead ofNone

Copy link

CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

yep that makes sense, updated in latest commit. Thanks!

server/text_generation_server/utils/tokens.py Outdated

		self.tokenizer = tokenizer
		self.logit_bias = logit_bias

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is not necessary, once we have the processor we should let go of the other object (Think of this as logit_bias taking ownership)

Copy link

CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

oo yea removed 🙏

server/text_generation_server/utils/logits_process.py Outdated

Comment on lines 646 to 651

		for token_str, bias_value in self.logit_biases.items():
		# Get token ID, either from cache or by computing it
		if token_str not in self.token_id_mapping:
		if token_str.isdigit():
		# If the token string is already a numeric ID
		token_id = int(token_str)
		else:
		# Otherwise, use the tokenizer to get the ID
		tokens = self.tokenizer.encode(token_str, add_special_tokens=False)
		token_id = tokens[0] if tokens else -1 # Use -1 for not found

		self.token_id_mapping[token_str] = token_id

		token_id = self.token_id_mapping[token_str]

		# Apply bias if token ID is valid
		if 0 <= token_id < scores.size(-1):
		scores[:, token_id] += bias_value

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is too slow implementation, the logit_bias must be a tensor precalculated, that we just need to add to the scores.

Copy link

CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

updated along with a much betterLogitBiasProcessor. The bias tensor is now created ininit and simply added viaadd_ in__call__. Thanks 🙏

server/text_generation_server/utils/logits_process.py Outdated

Comment on lines 642 to 643

		if not self.logit_biases:
		return scores

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We should never be in that case since we're only adding the processor when it's not empty, I'd happily switch to an assert here.

Copy link

CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

good point, updated along with logit bias processor changes.

server/text_generation_server/utils/logits_process.py Outdated

		self, logit_biases: Optional[dict], tokenizer: PreTrainedTokenizerBase
		):
		self.tokenizer = tokenizer
		self.logit_biases = logit_biases or {}

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	self.logit_biases=logit_biasesor {}
	self.logit_biases=logit_biases

Copy link

CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

updated along with logit bias processor changes

server/text_generation_server/utils/logits_process.py Outdated

		self.logit_biases = logit_biases or {}

		# Pre-compute token IDs for each token string
		self.token_id_mapping = {}

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Where is the pre-computing ?

Copy link

CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

updated along with logit bias processor changes

server/text_generation_server/utils/logits_process.py

		if 0 <= token_id < scores.size(-1):
		scores[i, token_id] += bias_value

		return scores

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Same no forloop, single tensor addition should be doable.

Copy link

CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

updated along with logit bias processor changes

server/text_generation_server/utils/tokens.py Outdated

		@@ -125,6 +136,7 @@ def from_pb(
		tokenizer=tokenizer,
		grammar=pb.grammar,
		grammar_type=pb.grammar_type,
		logit_bias=dict(pb.logit_bias) if pb.logit_bias else None,

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why ispb.logit_bias not possible ? It would maintain consistency better.

Copy link

CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

oh good catch, simplified in the latest commit

server/text_generation_server/utils/tokens.py Outdated

		@@ -500,6 +530,9 @@ def from_pb(
		fsm_grammar_states=(
		fsm_grammar_states if fsm_grammar_states else [0] * len(pb)
		),
		logit_biases=[
		dict(pb_.logit_bias) if pb_.logit_bias else None for pb_ in pb

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Same here.

Copy link

CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

simplified in the latest commit

server/text_generation_server/utils/logits_process.py Outdated

Comment on lines 709 to 717

		if token_str.isdigit():
		# If the token string is already a numeric ID
		token_id = int(token_str)
		else:
		# Otherwise, use the tokenizer to get the ID
		tokens = self.tokenizer.encode(
		token_str, add_special_tokens=False
		)
		token_id = tokens[0] if tokens else -1 # Use -1 for not found

Copy link

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We should do the sanitation much earlier, way up in the rust code, we also have the tokenizer there, and we can reject requests that contain invalid logit_bias early without having to encode or fail here.

Copy link

CollaboratorAuthor

drbhApr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

great point, I've removed the extra checking logic from the python side and reject invalidate if any values are invalid (token id not in vocab range)

drbh requested a review fromNarsil

April 29, 2025 15:35

drbh force-pushed thesupport-logit-bias-in-chat branch from88010ba to2b996b0Compare

April 30, 2025 14:47

Narsil reviewed

May 2, 2025

View reviewed changes

Copy link

Collaborator

Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks much better, I think I found a flaw in the actual computation, but other than that it looks great.

router/src/validation.rs Outdated

		.collect(),
		)
		}
		_ => None,

Copy link

Collaborator

NarsilMay 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	_ =>None,
	None =>None,

For readability (having all explicit variants makes it easier to know there's no shenanigans if change the values)

Usingparameters.logit_bias.map(|bias| {....}) is another option.

Copy link

CollaboratorAuthor

drbhMay 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

great points and totally agree that a map is cleaner here. Updated in the latest changes for a more simple solution with map

router/src/validation.rs Outdated

Comment on lines 406 to 430

		Some(bias) if !bias.is_empty() => {
		for (token_str, _) in bias.iter() {
		let token_id = token_str.parse::<u32>().map_err(\|_\| {
		ValidationError::LogitBiasInvalid(format!(
		"Token ID {} is not a valid number.",
		token_str
		))
		})?;

		if token_id >= self.vocab_size {
		return Err(ValidationError::LogitBiasInvalid(format!(
		"Token ID {} is out of range. Must be between 0 and {}.",
		token_id,
		self.vocab_size - 1
		)));
		}
		}

		// Transform into the required format
		Some(
		bias.iter()
		.map(\|(k, v)\| (k.parse::<u32>().unwrap(), *v as f32))
		.collect(),
		)
		}

Copy link

Collaborator

NarsilMay 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Didn't you accept actual strings that are tokens before ? I'm fine with this version, but it seems different than before, just making sure it wasn't lost in translation.

I think the code can be simplified a bit.

let logit_bias = request.parameters.logit_bias.map(|bias|{let bias:Result<Vec<_>,_> = bias.into_iter().map(|(token_str, value)|{let token_id:u32 = token_str.parse().map_err(...)?;if token_id >self.vocab_size{                  ....}Ok((token_id, value))})})

Current code is fine, but we could remove some unwrapping + double looping most likely.

Copy link

CollaboratorAuthor

drbhMay 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The intention is to accept token ids as string and not the token value.

I believe that the early changes included some token encoding decoding that is now remove which is an improvement.

Additionally I've updated the logic to be more simple and avoid the unwrap/extra loop in the latest changes

server/text_generation_server/utils/logits_process.py Outdated

Comment on lines 703 to 705

		self.bias_matrix = torch.nn.functional.pad(
		self.bias_matrix, (0, scores.shape[1] - self.bias_matrix.shape[1])
		)

Copy link

Collaborator

NarsilMay 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why do we need the padding, I'm surprised here.

It seems to me that self.bias_matrix is (BS, VOCAB), while scores is (SEQ_LENTHS, VOCAB).

In the most common scenario (decode), it's easy, as BS == SEQ_LENGTHS.

But in prefill, and mixed prefill + decode, by using pad, you're effectively spilling the bias_matrix onto other users, no ?
It seems to me we're needing cu_seqlengths (or whatever we currently have) in order to expand (probably usinghttps://pytorch.org/docs/stable/generated/torch.repeat_interleave.html#torch.repeat_interleave ).

Again ideally we do this at init time, not at call time (so thecall function is literally just an add).

Copy link

CollaboratorAuthor

drbhMay 5, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I was running into a error that appears to be a bug with the configuration ofQwen/Qwen2-VL-2B-Instruct (used in the test) where the vocab size returned onQwen2TokenizerFast is not the correct size (151643 instead of 151936)https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/blob/main/config.json.

The correct vocab size is set downstream within the custom modeling code but this is not accessible by the logit_processor so I've added a hacky patch that sets_vocab_size on the tokenizer if the vocab_size post loading does not match the tokenizer.vocab_size.

This solution as it feels hacky and obtuse yet reliably resolves the issue.. Any ideas on a cleaner approach?

Aside from this I've removed the padding step and now the forward is simply anadd_

drbh added10 commits

May 5, 2025 17:34

feat: support logit bias in chat request

fae510b

feat: include proto changes

da3f18e

fix: adjust the NextTokenChooser logit bias processor

e44703d

fix: include logit_bias in all ValidGenerateRequest's

61a50a8

fix: adjust imports

81656bd

fix: remove deprecated test and fix typing

bb5c875

fix: improve processor logic and refactor

9eeccbf

fix: cleanup typos

b3ead6e

fix: avoid zero'd logit bias mask

465294d

fix: improve validation and transform logic

7659925

drbh force-pushed thesupport-logit-bias-in-chat branch froma174f63 to7659925Compare

May 5, 2025 21:34

drbh added4 commits

May 5, 2025 21:45

fix: remove the bias padding

55d82d4

fix: read vocab size from tokenizer and add hacky patch for qwen2b

b32cd97

fix: prefer patch to be vlm specific

783ca66

fix: linter

551ee3a

Labels

None yet

2 participants

		# Initialize with empty logit biases if none provided
		if logit_biases is None:
		logit_biases = [None] * len(do_sample)

Movatterモバイル変換

feat: support logit bias in chat request#3186

Are you sure you want to change the base?

feat: support logit bias in chat request#3186

Uh oh!

Conversation

drbh commentedApr 22, 2025

Uh oh!

drbh commentedApr 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drbhMay 5, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drbhMay 5, 2025•
edited
Loading