Movatterモバイル変換

Jump to content

TrueSkill

From Wikipedia, the free encyclopedia

Rating system supporting games with more than 2 players

TrueSkill is a skill-based ranking system developed byMicrosoft for use with video game matchmaking on theXbox network. Unlike the popularElo rating system, which was initially designed forchess, TrueSkill is designed to support games with more than two players.^[1]^[2]^[3] In 2018, Microsoft published details about an extended version of TrueSkill, named TrueSkill2.^[4]

It is based on aThurstonian model with a Gaussian score distribution. It does not satisfyLuce's Choice Axiom.^[5]

History and use

TrueSkill was developed by researchers atMicrosoft Research as a Bayesian skill rating system designed for online games in which players often compete in teams and matches may involve more than two players or teams.^[2] In their first published description of the system, Ralf Herbrich, Tom Minka and Thore Graepel evaluated the method using match data collected by Bungie Studios during the beta test ofHalo 2 and described its operation in theXbox Live service.^[2]

Christopher Bishop later wrote that TrueSkill was deployed on Xbox Live in 2005 and has operated continuously since then, processing millions of game outcomes per day.^[6] Herbrich and colleagues reported that, as of September 2005, Xbox Live had over 2 million subscribed users and that the Xbox 360 Live service used TrueSkill for automatic player rating and matchmaking, processing hundreds of thousands of games per day.^[2]

In a survey of ranking systems for games and assessments, Educational Testing Service researchers described TrueSkill as a generalization of Elo-style ratings for multiplayer and team competitions and noted its use in Microsoft's Xbox online games.^[7] Later research from Microsoft proposed revisions and extensions to the original model, including TrueSkill2.^[4]

Statistical model

In the model described by Herbrich, Minka and Graepel, each player's skill is treated as an unobserved (latent) quantity, and the system maintains a probability distribution over that skill.^[2] The prior over player skills is assumed to factorise across players, with each individual skill modelled as aGaussian distribution with mean $\mu _{i}$ and variance $\sigma _{i}^{2}$ representing the system's uncertainty about that player's skill.^[2]

Match outcomes are modelled as arising from noisy performances. For a given game, each player $i {\displaystyle i}$ generates a latent performance value $p_{i}$ that is normally distributed around their skill $s_{i}$ , with a fixed performance variance parameter $\beta ^{2}$ :^[2] $s_{i}\sim {\mathcal {N}}(\mu _{i},\sigma _{i}^{2}),\qquad p_{i}\sim {\mathcal {N}}(s_{i},\beta ^{2}).$ The parameter $\beta$ controls how strongly a single game's result is expected to vary around underlying skill (smaller $\beta$ implies less randomness in outcomes).^[2]

For games with teams, the model assumes an additive team-performance structure: the latent performance $t_{j}$ of team $j {\displaystyle j}$ is the sum of the performances of its members, $\textstyle t_{j}:=\sum _{i\in A_{j}}p_{i}$ , where $A_{j}$ is the set of players assigned to that team.^[2] A match result is represented as a ranking of the teams (with ties possible), and the likelihood of the observed ranking is defined in terms of inequalities between the corresponding team-performance variables.^[2]

Draws are incorporated by introducing a draw margin $\epsilon >0$ . For two adjacent teams in the ranking, a win is modelled by requiring the higher-ranked team to exceed the lower-ranked team by more than $\epsilon$ , while a draw is modelled by requiring their performance difference to lie within $\pm \epsilon$ .^[2] Herbrich and colleagues related $\epsilon$ to an assumed or empirically estimated draw probability, allowing the draw rate to be calibrated for a given game mode.^[2]

Inference and rating updates

TrueSkill represents the joint model of skills, performances and observed outcomes as afactor graph.^[2] Inference is then framed as the computation of approximate single-variable marginals by message passing using thesum-product algorithm.^[2] For a single match, most of the messages in the graph can be represented as one-dimensional Gaussians, which makes the update computationally efficient.^[2]

A key difficulty is that the outcome constraints (wins and draws) introduce non-Gaussian terms. In the factor graph, these appear as comparison factors that impose inequalities on performance differences (and, for draws, on an interval around zero).^[2] The corresponding exact messages are non-Gaussian; the TrueSkill paper appliesexpectation propagation (EP) to approximate these messages by moment matching, replacing the intractable truncated-Gaussian forms with Gaussian approximations that have the same mean and variance.^[2]^[8] Because the comparison messages are approximate, the algorithm iterates message updates along paths connecting the affected variables until the approximate marginals stabilise.^[2]

After inference for a match, each player's updated marginal remains Gaussian and can be summarised by an updated mean $\mu$ and standard deviation $\sigma$ for use in later matches.^[2] The original system also includes adynamics variance parameter (often written $\tau ^{2}$ ) to model skill changes over time between games.^[2]

Parameterisation and rating display

TrueSkill maintains a Gaussian belief distribution ${\mathcal {N}}(\mu _{i},\sigma _{i}^{2})$ for each player's skill, but practical deployments must also choose a numerical scale and values for model parameters such as the performance noise and the rate at which skills are allowed to drift over time.^[2] Herbrich and colleagues reported that Xbox Live used an initial prior scale with $\mu _{0}=25$ and $\sigma _{0}^{2}=(25/3)^{2}$ , corresponding to a prior in which negative skills are unlikely.^[2]

In the same deployment description, the variance of a single-game performance was set relative to the prior uncertainty as $\beta ^{2}=(\sigma _{0}/2)^{2}$ , and the between-gamedynamics variance (modelling skill changes over time) was set to $\tau ^{2}=(\sigma _{0}/100)^{2}$ .^[2]

For player-facing displays such as leaderboards, Xbox Live displayed a conservative estimate rather than the posterior mean alone. The TrueSkill paper described displaying a player's skill as the 1% lower quantile of the belief distribution, $\mu _{i}-3\sigma _{i}$ .^[2] Herbrich and colleagues wrote that this choice was intended to ensure that top positions in leaderboards were held by players who were both highly rated and estimated with high certainty, and noted that the prior implied an initial displayed value of $0=\mu _{0}-3\sigma _{0}$ .^[2]

Worked example

The following worked example illustrates how TrueSkill updates ratings after a three-player free-for-all match, using the model and win-update equations described by Herbrich, Minka and Graepel.^[2] In TrueSkill, each player $i {\displaystyle i}$ has a skill belief distribution ${\mathcal {N}}(\mu _{i},\sigma _{i}^{2})$ , where $\mu$ is the mean skill estimate and $\sigma$ is the uncertainty.

After a match, the system combines the prior belief with the match result to obtain a new belief distribution (theposterior). Theposterior mean is the mean of that updated distribution, i.e., the system's best single-number estimate of the player's skill given the evidence so far.^[2]

In the original Xbox Live scale described in the TrueSkill paper, a common choice for the performance noise is $\beta =25/6$ .^[2] This example uses that value and assumes no draws.

Displayed rating

In some deployments, a conservative skill estimate is displayed rather than the posterior mean. The TrueSkill paper describes displaying the 1% lower quantile of the belief distribution, which for aGaussian distribution is approximately $\mu -3\sigma$ .^[2] In plain terms, this is a value that the system expects the player's true skill to exceed about 99% of the time, given the current uncertainty. As a result, a player with a high mean but large uncertainty (large $\sigma$ ) will have a lower displayed rating than a similarly rated player whose skill is estimated more confidently.^[2]

Setup

Three players enter a match with the following prior ratings:

Player	$\mu$ (before)	$\sigma$ (before)	Displayed (before) $\mu -3\sigma$
Alice	30	6	12
Ben	25	7	4
Chloe	20	8	−4

The game finishes with Chloe in first place, Alice in second place, and Ben in third place.

How a three-player result is represented

For a strict ranking with no ties, the factor-graph formulation represents the result using adjacent comparison constraints (Chloe beats Alice; Alice beats Ben).^[2] There is no separate "Chloe beats Ben" factor in this minimal representation because it is implied by the ranking: if Chloe is above Alice and Alice is above Ben, then Chloe is above Ben.

In the full TrueSkill algorithm, message passing in the factor graph uses both adjacent constraints together to update all three players in a single inference problem (and may iterate).^[2] The step-by-step calculations below apply the standard two-player win update to each adjacent comparison in sequence to make the mechanics transparent; Chloe's advantage over Ben is still reflected through the chain, because Chloe's win pushes Alice down, and Alice's win pushes Ben down.

Win update equations (two players)

For a winner $w {\displaystyle w}$ and loser $l {\displaystyle l}$ , define $c={\sqrt {\sigma _{w}^{2}+\sigma _{l}^{2}+2\beta ^{2}}},\qquad t={\frac {\mu _{w}-\mu _{l}}{c}}.$

Let $\phi$ and $\Phi$ be thestandard normal probability density and cumulative distribution functions. For a win (no draw margin), define $v(t)={\frac {\phi (t)}{\Phi (t)}},\qquad w(t)=v(t){\bigl (}v(t)+t{\bigr )}.$

The mean updates are $\mu _{w}'=\mu _{w}+{\frac {\sigma _{w}^{2}}{c}}v(t),\qquad \mu _{l}'=\mu _{l}-{\frac {\sigma _{l}^{2}}{c}}v(t),$ and the uncertainty decreases according to $\sigma _{w}'^{2}=\sigma _{w}^{2}\left(1-{\frac {\sigma _{w}^{2}}{c^{2}}}w(t)\right),\qquad \sigma _{l}'^{2}=\sigma _{l}^{2}\left(1-{\frac {\sigma _{l}^{2}}{c^{2}}}w(t)\right).$

Step 1: Chloe beats Alice

Using Chloe $(\mu =20,\sigma =8)$ as winner and Alice $(\mu =30,\sigma =6)$ as loser, with $\beta =25/6$ :

$c=11.61$ , $t=-0.86$
$v(t)=1.42$ , $w(t)=0.78$

Applying the update gives:

Chloe: $\mu$ increases to $27.80 {\displaystyle 27.80}$ ; $\sigma$ decreases to $6.34 {\displaystyle 6.34}$
Alice: $\mu$ decreases to $25.61 {\displaystyle 25.61}$ ; $\sigma$ decreases to $5.33 {\displaystyle 5.33}$

Step 2: Alice beats Ben

The second comparison uses Alice's updated rating from step 1, with Ben unchanged so far. Alice $(\mu =25.61,\sigma =5.33)$ beats Ben $(\mu =25,\sigma =7)$ :

$c=10.59$ , $t=0.06$
$v(t)=0.76$ , $w(t)=0.62$

Applying the update gives:

Alice: $\mu$ increases to $27.66 {\displaystyle 27.66}$ ; $\sigma$ decreases to $4.89 {\displaystyle 4.89}$
Ben: $\mu$ decreases to $21.48 {\displaystyle 21.48}$ ; $\sigma$ decreases to $5.97 {\displaystyle 5.97}$

Ratings before and after the match

Player	$\mu$ (before)	$\mu$ (after)	$\sigma$ (before)	$\sigma$ (after)	Displayed (before) $\mu -3\sigma$	Displayed (after) $\mu -3\sigma$
Alice	30	27.66	6	4.89	12	12.97
Ben	25	21.48	7	5.97	4	3.56
Chloe	20	27.80	8	6.34	−4	8.79

In this example, Chloe's first-place finish requires her performance to exceed Alice's, and Alice's to exceed Ben's, so the inferred skill distributions shift accordingly. Chloe's large increase in $\mu$ reflects that her win over a substantially higher-rated opponent was unexpected under the prior ratings, while the decreases in $\sigma$ reflect that the match provides additional information about the players' relative skills.

TrueSkill2

In 2018, Microsoft researchers publishedTrueSkill2 as an extension of the original TrueSkill model that incorporates additional information available in some online games, beyond the final win–loss ordering.^[4] The paper describes using signals such as player experience, squad membership, individual statistics (for example kill and death counts), quitting behaviour, and performance in other game modes, with the aim of improving the accuracy of inferred skills for matchmaking.^[4]

The authors described TrueSkill2 as retaining the same interface as classic TrueSkill, a single-number skill rating intended to remain compatible with existing matchmaking systems, while changing how skills are inferred from historical data.^[4] They described automatic parameter estimation from batches of historical matches, and two operating modes: an online mode that propagates skills forward in time and a batch mode that infers parameters and skills over all times (referred to asTrueSkill Through Time).^[4]

Compared with classic TrueSkill, the paper lists model changes including: incorporating individual player statistics alongside team win/loss; treating mid-game quitting as a surrender for rating purposes; borrowing information across game modes by modelling skills as correlated between modes; and modelling skill evolution as biased towards improvement, especially in a player's early matches in a mode.^[4] It adds an explicit squad effect, modelling players who queue together as performing better than they would individually.^[4]

On match data fromHalo 5, the authors reported that TrueSkill2 improved prediction of historical match outcomes, giving 68% accuracy on their evaluation compared with 52% for the original TrueSkill system.^[4]

Related models and extensions

TrueSkill is part of a broader family of statistical rating systems used to infer relative strength from match outcomes. TheElo rating system, originally developed for chess, models win probabilities from rating differences and updates ratings after games based on the difference between observed and expected results.^[9]

Glickman later proposed dynamic paired-comparison models that incorporate uncertainty in a player's estimated strength and allow that uncertainty to change over time, motivated in part by limitations of Elo-style updates for large competitive populations.^[10] In a survey of ranking systems used in games and assessments, ETS researchers described TrueSkill as an Elo-style generalisation for multiplayer and team competitions that maintains uncertainty estimates alongside point ratings.^[7]

Some related probabilistic models instead treat rankings as samples from a distribution that satisfiesLuce's choice axiom. Guiver and Snelson note that a Thurstonian model with independent score variables induces a Plackett–Luce distribution if and only if the scores follow a Gumbel distribution, and they contrast this with TrueSkill, which uses Gaussian score variables.^[5] They state that TrueSkill's Gaussian-score model does not satisfy Luce's choice axiom, while noting its use in large-scale online rating systems.^[5]

Several research extensions build on TrueSkill's probabilistic formulation. Dangauthier, Herbrich, Minka and Graepel proposedTrueSkill Through Time, which replaces the original filtering-style updates with smoothing over a time series of player skills, allowing retrospective estimation of skill trajectories while retaining uncertainty estimates and an explicit draw model.^[11]

Patents and trademarks

TrueSkill is patented,^[12] and the name is trademarked, so it is limited to Microsoft projects and commercial projects that obtain a license to use the algorithm.^[13] The patent is scheduled to expire on April 9th, 2029.^[14]

See also

Software patents

References

^Murphy, Kevin (2012).Machine Learning: A Probabilistic Perspective. MIT Press.ISBN 978-0262018029.
^^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ ^o ^p ^q ^r ^s ^t ^u ^v ^w ^x ^y ^z ^aa ^ab ^ac ^ad ^ae ^afHerbrich, Ralf; Minka, Tom; Graepel, Thore (2006)."TrueSkill™: A Bayesian Skill Rating System"(PDF).Advances in Neural Information Processing Systems 19 (NIPS 2006). MIT Press. pp. 569–576. Retrieved11 January 2026.{{cite conference}}: CS1 maint: url-status (link)
^"TrueSkill™ Ranking System".Microsoft Research. Retrieved11 January 2026.
^^a ^b ^c ^d ^e ^f ^g ^h ⁱMinka, Tom; Cleven, Ryan; Zaykov, Yordan (22 March 2018).TrueSkill 2: An improved Bayesian skill rating system(PDF) (Technical report). Microsoft Research. MSR-TR-2018-8. Retrieved11 January 2026.{{cite tech report}}: CS1 maint: url-status (link)
^^a ^b ^cGuiver, John; Snelson, Edward (2009)."Bayesian inference for Plackett–Luce ranking models"(PDF).Proceedings of the 26th Annual International Conference on Machine Learning (ICML '09). Association for Computing Machinery. pp. 377–384.doi:10.1145/1553374.1553423. Retrieved11 January 2026.{{cite conference}}: CS1 maint: url-status (link)
^Bishop, Christopher M. (13 February 2013)."Model-based machine learning".Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.371 (1984).doi:10.1098/rsta.2012.0222.PMC 3538442.PMID 23277612. Retrieved11 January 2026.
^^a ^bRotou, Ourania; Qian, Xiaoyu; von Davier, Matthias (July 2015).Ranking Systems Used in Gaming Assessments and/or Competitive Games(PDF) (Report). ETS Research Memorandum. Educational Testing Service. Retrieved11 January 2026.
^Minka, Thomas P. (2001)."Expectation propagation for approximate Bayesian inference"(PDF).Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI 2001). Morgan Kaufmann. pp. 362–369. Retrieved11 January 2026.{{cite conference}}: CS1 maint: url-status (link)
^Elo, Arpad E. (1978).The Rating of Chessplayers, Past and Present (2nd ed.). Arco Pub.ISBN 9780668047210. Retrieved11 January 2026.
^Glickman, Mark E. (1999)."Parameter estimation in large dynamic paired comparison experiments"(PDF).Journal of the Royal Statistical Society: Series C (Applied Statistics).48 (3):377–394.doi:10.1111/1467-9876.00159. Retrieved11 January 2026.
^Dangauthier, Pierre; Herbrich, Ralf; Minka, Tom; Graepel, Thore (2007)."TrueSkill Through Time: Revisiting the History of Chess"(PDF).Advances in Neural Information Processing Systems 20 (NIPS 2007). Retrieved11 January 2026.{{cite conference}}: CS1 maint: url-status (link)
^"US8538910B2 – Determining relative skills of players".Google Patents. Retrieved11 January 2026.
^"Copyright notice".Microsoft Learn. 28 April 2025. Retrieved11 January 2026.
^"US20090227313A1 – Determining relative skills of players".Google Patents. Retrieved11 January 2026.

External links

v
t
e

Sports rating systems

Concepts

Methods and computer models

Elo family	Chessmetrics DWZ Elo Glicko Universal

Polls and opinion

By sport

People

v
t
e

Microsoft Research (MSR)

Main
projects

Languages, compilers	Bartok Bosque Cω F* Lean P Project Verona Phoenix Polyphonic C# SecPAL
Distributed–grid computing	BitVault Confidential Consortium Framework DeepSpeed Orleans
Internet,networking	AjaxView Avalanche Conference XP Gazelle HoneyMonkey Penny Black Wallop
Other projects	Automatic Graph Layout Cognitive Toolkit Digits Holoportation IllumiRoom Image Composite Editor Infer.NET LightGBM LiveStation MyLifeBits Neural Network Intelligence NodeXL OneFuzz PhotoDNA SEAL SLAM T2 Temporal Prover WorldWide Telescope Z3 Theorem Prover
Operating systems	Barrelfish HomeOS Midori Singularity Verve
APIs	Accelerator Dryad Joins mimalloc
Launched as products	C# Comic Chat Detours F# Sideshow PixelSense (TouchLight) SenseCam ClearType Group Shot Allegiance TrueSkill Songsmith Xbox Kinect

MSR Labs
applied
research

Current	Pivot Seadragon Deep Zoom
Discontinued	Deepfish Listas Live Clipboard Photosynth

Other labs

Retrieved from "https://en.wikipedia.org/w/index.php?title=TrueSkill&oldid=1337743665"

Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp