
What does LLM Temperature Actually Mean?
At this point, I thought I knew what temperature means for a LLM. A lower temperature increases determinism, reducing the likelihood of hallucinations or inaccurate responses. Google’s definition echoes this perception:
“The temperature controls the degree of randomness in token selection. The temperature is used for sampling during response generation, which occurs when topP and topK are applied. Lower temperatures are good for prompts that require a more deterministic or less open-ended response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 is deterministic, meaning that the highest probability response is always selected.”
That makes sense - but if temperature is so straightforward, why are my tests withGemini-1.5-Flash-002
so nonsensical???
We’ve been looking into adding what we’re calling “document-level metadata” in the TrustGraph extraction process. While we did add this feature in release0.13.2
, I had been evaluating using a LLM to extract important entities and topics for the entirety of a text corpus. I normally set the temperature to0.0
since this should produce the most accurate extraction. I ran an extraction withGemini-1.5-Flash-002
. Looked good - except for one problem, I had accidentally set the temperature to1.0
. I reran it at0.0
, and the results worked worse. What’s going on?
I’ve never run comparison tests with TrustGraph where I did nothing but vary the temperature, but I decided, why not? For a single document, I did 3 runs, varying only the temperature from0.0
,0.5
,1.0
,1.5
, to2.0
. Yes, the temperature of Gemini goes to2.0
. No, I don’t know why. For other parameters, I settop_p=1.0
,top_k=40
, and output tokens maxed out at8192
for all runs. I also used a JSON schema object for the response type.
Given my understanding of temperature, I expected Gemini to extract more information, returning more objects as the temperature increased. I would think a more deterministic response would be more conservative in how much information would be extracted. Except that didn’t happen. Except, my hypothesis wasn’t really proven wrong either. In fact, I’m not sure what these results mean.
The first document I tested was theRoger’s Commission Report from the NASA Challenger disaster. That PDF extracts to 176k tokens, 17.6% of Gemini-1.5-Flash’s advertised context window. For each run, here’s the number of output tokens:
The second document was another NASA report on the decision making of the Columbia disaster. That PDF extracts to 24.4k tokens, 2.4% of the advertised context window.
The inconsistency of the first set of test runs is inexplicable. Most times, Gemini tried to extract more than the maximum 8192 tokens, returning an incomplete and invalid JSON object. Yet, what about run 2 when even at a temperature of0.0
Gemini returned only 1511 tokens? Why did increasing the temperature to2.0
decrease the output so dramatically? The data is so inconsistent, I don’t know where to begin to draw any conclusions.
The second document data is more consistent. For instance, at a temperature of0.0
, it returned the same amount of tokens all 3 times. When increasing the temperature to0.5
, the responses did increase as I predicted. And then there’s temperature1.0
where the response amounts go down. Beyond1.0
, the responses mostly go down with one outlier at2.0
where the responses were 2x.
With this data, can I draw any meaningful conclusions? Yes, I think I can.
- Long context windows still aren’t reliable. Even at only 17.6% Gemini’s advertised context window, the behavior is shockingly inconsistent.
- At a much smaller context, the temperature behavior seems to be more consistent, but still a bit mysterious.
- For knowledge extraction tasks, temperature doesn’t work the way we think it should.
Sure, the consistency of those 3 runs where it returned the same output all 3 times seems great, but what if we want more? For knowledge extraction and graph building in TrustGraph, we’re trying to extract every important detail from the input document. We don’t want just facts, but any meaningful statements or opinions described in the text. It appears allowing the LLM to introduce some randomness in the response tokens produces more objects for information extraction. Bizarrely, I also noticed that increasing the temperature seemed to return more people than at lower temperatures. Based on cursory glances, none of the responses seemed to be producing hallucinations, but that experiment will require more testing.
Top comments(0)
For further actions, you may consider blocking this person and/orreporting abuse