Introduction
The key assumption underpinning most ranking systems is that we can model the performance of players using some statistical distribution. That way, we can model two players competing in a game simply by generating a random number from each of their performance distributions and seeing which is larger. For example, we might assume players' performance is normally distributed, so that player A's performance is modeled by $A\sim N(\mu_A,\sigma_A)$ and player B's performance is modeled by $B\sim N(\mu_B,\sigma_A)$. Then we can find the probability of player A beating player B by finding $P(A>B|A\sim N(\mu_A,\sigma_A), B\sim N(\mu_B,\sigma_B))$.
The players' ratings consists of both $\mu_A$ and $\sigma_A$, or possibly fewer or more parameters depending on the performance distribution family. For normal distributions, we have these two parameters. However, Elo rating uses Logistic distributions with a fixed scale parameter, and Glicko uses Logistic distributions with varying scale parameters.
When two players compete in a game, their ratings are updated afterwards based on the outcome of the game. The center parameter is the player's average skill according to the model, and simply put the losing player has their center parameter (rating) decreased and the winning player has their rating increased. How large this increase/decrease is depends on what the probability of the win/loss was and potentially other factors.
A guessing game
In particular, after we observe $X>G$ for some guess $G$, we have a posterior distribution which represents the updated distribution of $X$ given the new information that $X>G$. For a uniform distribution, the posterior distribution given $X>G$ or $X<G$ is also a uniform distribution.
However, for most distributions the posterior distribution is not the same. We will explore this more later, although when we do we will be working with the posterior distribution of $A$ given $A>B$ where $A$ and $B$ are both normally distributed (and neither is constant). In this case, $A$ can technically have any value in its support, but the shape of its distribution is concentrated a little further to the right.
Using rating systems to rank things
- select their single favorite
- select $1$ to $k-1$ out of $k$
- put the choices in order of preference
- pick a favorite out of every pair
Having a single user rate each choice is not a good approach. It is easier to compare a handful of choices to each other than to assign an exact preference to many decimal places to a bunch of choices. Having the user select some strict nonempty subset of the choices is the best approach, because this is the easiest for the user to decide so options that the user prefers about the same are handled better.
The next two things we need to determine are how to choose the subset of choices we present the user with at each step, and how to use the user's choices to rank the choices. For the second, we have a couple options.
- Create a directed acyclic graph among the choices and continue querying the user until the graph has a unique topological ordering
- Give each choice a rating using some system such as Elo
The first option seems better at first since the second option seems like overkill. This is inspired by the ranked pairs voting method, which is a voting method with a lot of desirable properties. In this method, each voter chooses a favorite candidate from each pair of candidates. Then, votes are totalled for each pair and for each pair from the one with the largest percentage victory to the smallest we add an edge to a DAG from the winner to the loser. If this edge would introduce a cycle, we omit it. We can stop early if there is a unique topological ordering on the graph so far.
However, many of the reasons why ranked pairs voting works so well as a voting method leave it ill suited to ranking the preferences of a single user.
- ranked pairs voting deals with a low number of choices so that voters manually selecting a favorite from each pair is reasonable
- ranked pairs voting has a large number of voters so how decisively the winner of each pair won can be quantified
With a single user picking their favorite for each pair of many small subsets sequentially, we don't get information about all pairs (unless we query the user $O(n^2)$ times) and we can't order preferences by decisiveness. This makes cycles really problematic. We can just add edges in the order we see them and omit edges that would create cycles, but this does not handle inconsistent or intransitive preferences from the user. Moreover, having the user select multiple favorites from each subset is better than having the user pick a favorite from each pair in each subset, as discussed. This is the motivation for using a rating system like Elo.
Using a rating system means that we can deal with inconsistent and intransitive preferences robustly, and extrapolate precise ratings for each choice which are as consistent with each other and the user's choices as possible.
Like in the Guessing game example, we do not know the true rating of any of the choices and we refine our estimates of them. However, honing in on ratings is significantly more complicated than honing in on a single real number. For one thing, we have many ratings we need to hone in on. For another, the performance distribution is only an approximation of player skill or user choice preference, and this skill/preference can even change over time.
Normal Posterior Example
$A$ and $B$ are normal random variables with mean $0$ and $1$ respectively and standard deviation $1$, $A|A>B$ is the posterior distribution of $A$ given $A>B$, and $~A|A>B$ is a normal approximation to the posterior distribution. This approximation is chosen by matching the mean and variance of the true posterior. We will be deriving all of these equations in the following sections. |
Towards a basic Normal rating system
The normal distribution has two parameters, mean $\mu$ and standard deviation $\sigma$. Its pdf is $f(x;\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$ and its cdf is $F(x;\mu,\sigma)=\Phi(\frac{x-\mu}{\sigma})$ where $\Phi(x)$ is the cdf of the standard normal distribution with mean 0 and standard deviation 1, namely $\Phi(x)=\frac{1}{2}\left(1+\mathrm{erf}(\frac{x}{\sqrt{2}})\right)$.
An affine function of a normal random variable is normally distributed
A linear combination of normal random variables is normally distributed
\begin{align*} f_C(x)&=\int_{-\infty}^\infty f_A(x-y)f_B(y)dy\\ &=\int_{-\infty}^\infty \frac{1}{\sigma_A\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-y-\mu_A}{\sigma_A}\right)^2}\frac{1}{\sigma_B\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{y-\mu_B}{\sigma_B}\right)^2}dy\\ &=\int_{-\infty}^\infty \frac{1}{2\pi\sigma_A\sigma_B}e^{-\frac{1}{2}\left(\left(\frac{y-(x-\mu_A)}{\sigma_A}\right)^2+\left(\frac{y-\mu_B}{\sigma_B}\right)^2\right)}dy\\ &=\int_{-\infty}^\infty \frac{1}{2\pi\sigma_A\sigma_B}e^{-\frac{1}{2}\left(\frac{\sigma_B^2 y^2-2\sigma_B^2(x-\mu_A)y+\sigma_B^2(x-\mu_A)^2+\sigma_A^2 y^2-2\sigma_A^2\mu_B y+\sigma_A^2\mu_B^2}{\sigma_A^2\sigma_B^2}\right)}dy\\ &=\int_{-\infty}^\infty \frac{1}{2\pi\sigma_A\sigma_B}e^{-\frac{1}{2\sigma_A^2\sigma_B^2}\left((\sigma_A^2 + \sigma_B^2) y^2-2(\sigma_B^2(x-\mu_A)+\sigma_A^2\mu_B)y+\sigma_B^2(x-\mu_A)^2+\sigma_A^2\mu_B^2\right)}dy\\ &=\int_{-\infty}^\infty \frac{1}{2\pi\sigma_A\sigma_B}e^{-\frac{1}{2\sigma_A^2\sigma_B^2}\left((\sigma_A^2 + \sigma_B^2) y^2-2\frac{(\sigma_B^2(x-\mu_A)+\sigma_A^2\mu_B)\sqrt{\sigma_A^2 + \sigma_B^2}}{\sqrt{\sigma_A^2 + \sigma_B^2}}y+\sigma_B^2(x-\mu_A)^2+\sigma_A^2\mu_B^2\right)}dy\\ &=\int_{-\infty}^\infty \frac{1}{2\pi\sigma_A\sigma_B}e^{-\frac{1}{2\sigma_A^2\sigma_B^2}\left(\left(\sqrt{\sigma_A^2 + \sigma_B^2}y-\frac{\sigma_B^2(x-\mu_A)+\sigma_A^2\mu_B}{\sqrt{\sigma_A^2 + \sigma_B^2}}\right)^2-\frac{(\sigma_B^2(x-\mu_A)+\sigma_A^2\mu_B)^2}{\sigma_A^2 + \sigma_B^2}+\sigma_B^2(x-\mu_A)^2+\sigma_A^2\mu_B^2\right)}dy\\ &=\int_{-\infty}^\infty \frac{1}{2\pi\sigma_A\sigma_B}e^{-\frac{1}{2\sigma_A^2\sigma_B^2}\left(\left(\sqrt{\sigma_A^2 + \sigma_B^2}y-\frac{\sigma_B^2(x-\mu_A)+\sigma_A^2\mu_B}{\sqrt{\sigma_A^2 + \sigma_B^2}}\right)^2-\frac{(\sigma_B^2(x-\mu_A)+\sigma_A^2\mu_B)^2}{\sigma_A^2 + \sigma_B^2}+\frac{(\sigma_A^2 + \sigma_B^2)(\sigma_B^2(x-\mu_A)^2+\sigma_A^2\mu_B^2)}{\sigma_A^2 + \sigma_B^2}\right)}dy\\ &=\int_{-\infty}^\infty \frac{1}{2\pi\sigma_A\sigma_B}e^{-\frac{1}{2\sigma_A^2\sigma_B^2}\left(\left(\sqrt{\sigma_A^2 + \sigma_B^2}y-\frac{\sigma_B^2(x-\mu_A)+\sigma_A^2\mu_B}{\sqrt{\sigma_A^2 + \sigma_B^2}}\right)^2+\frac{\sigma_A^2\sigma_B^2}{\sigma_A^2+\sigma_B^2}(-2(x-\mu_A)\mu_B+(x-\mu_A)^2+\mu_B^2)\right)}dy\\ \end{align*} \begin{align*} &=\int_{-\infty}^\infty \frac{1}{2\pi\sigma_A\sigma_B}e^{-\frac{1}{2\sigma_A^2\sigma_B^2}\left(\left(\sqrt{\sigma_A^2 + \sigma_B^2}y-\frac{\sigma_B^2(x-\mu_A)+\sigma_A^2\mu_B}{\sqrt{\sigma_A^2 + \sigma_B^2}}\right)^2+\frac{\sigma_A^2\sigma_B^2}{\sigma_A^2+\sigma_B^2}(x-\mu_A-\mu_B)^2\right)}dy\\ &=\int_{-\infty}^\infty \frac{1}{\frac{\sigma_A\sigma_B}{\sqrt{\sigma_A^2+\sigma_B^2}}\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{y-\frac{\sigma_B^2(x-\mu_A)+\sigma_A^2\mu_B}{\sigma_A^2 + \sigma_B^2}}{\frac{\sigma_A\sigma_B}{\sqrt{\sigma_A^2+\sigma_B^2}}}\right)^2}\frac{1}{\sqrt{\sigma_A^2+\sigma_B^2}\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu_A-\mu_B}{\sqrt{\sigma_A^2+\sigma_B^2}}\right)^2}dy\\ &=f\left(x;\mu_A+\mu_B,\sqrt{\sigma_A^2+\sigma_B^2}\right)\int_{-\infty}^\infty f\left(y;\frac{\sigma_B^2(x-\mu_A)+\sigma_A^2\mu_B}{\sigma_A^2 + \sigma_B^2},\frac{\sigma_A\sigma_B}{\sqrt{\sigma_A^2+\sigma_B^2}}\right)dy\\ &=f\left(x;\mu_A+\mu_B,\sqrt{\sigma_A^2+\sigma_B^2}\right) \end{align*}
So as desired, $C=A+B$ is normally distributed, and the particular parameters are $\mu_C=\mu_A+\mu_B$ and $\sigma_C=\sqrt{\sigma_A^2+\sigma_B^2}$.
$P(A>B)$ for normal random variables
Our next order of business will be constructing estimations on the mean and standard deviation of $A$ and $B$ given $A>B$. We will later use these to update the mean and standard deviation for a player after a game.
$E(A|A>B)$ for normal random variables
\begin{align*} \int_{-\infty}^\infty xf_A(x)F_B(x)dx&=\left[\left(-\sigma_A^2 f_A(x)+\mu_A F_A(x)\right)F_B(x)\right]_{-\infty}^\infty+\int_{\infty}^\infty\left(\sigma_A^2 f_A(x)-\mu_A F_A(x)\right)f_B(x)dx\\ &=\mu_A\left[F_A(x)F_B(x)\right]_{-\infty}^\infty+\sigma_A^2\int_{\infty}^\infty f_A(x)f_B(x)dx-\mu_A\int_{\infty}^\infty F_A(x)f_B(x)dx. \end{align*} Now, we can compute each of these in turn: $[F_A(x)F_B(x)]_{\infty}^\infty=1$ because $F_A(x)$ is a cdf so it goes from $0$ to $1$, $\int_{\infty}^\infty F_A(x)f_B(x)dx=1-P$ because this is an expression for $P(B>A)$ and for us $P(A=B)=0$ because we are modelling $A$ and $B$ as continuous random variables. That leaves $\int_{\infty}^\infty f_A(x)f_B(x)dx$. Remember $f_A(x)f_B(x)dx$ is just shorthand for $f(x;\mu_A,\sigma_A)f(x;\mu_B,\sigma_B)$. By completing the square like we did when proving the sum of two normal random variables is normally distributed, we can show that $$f(x;\mu_A,\sigma_A)f(x;\mu_B,\sigma_B)=f\left(x;\frac{\sigma_B^2\mu_A+\sigma_A^2\mu_B}{\sigma_A^2+\sigma_B^2},\frac{\sigma_A\sigma_B}{\sqrt{\sigma_A^2+\sigma_B^2}}\right)f\left(0;\mu_A-\mu_B,\sqrt{\sigma_A^2+\sigma_B^2}\right)$$ and so it becomes a single normal pdf times a normal pdf evaluated at $0$. When we integrate, the normal pdf evaluated at $0$ is a constant term that we can pull out of the integral, and then the integral of the other pdf which has argument $x$ is simply $1$, so we get $$\int_{\infty}^\infty f_A(x)f_B(x)dx=f_{A-B}(0)$$ and all in all we get $$E(A|A>B)=\frac{1}{P}\left(\mu_A+\sigma_A^2 f_{A-B}(0)-\mu_A(1-P)\right)=\mu_A+\frac{\sigma_A^2}{P}f_{A-B}(0).$$
$E(B|A>B)$ for normal random variables
$Var(A|A>B)$ for normal random variables
Now we need to compute the integral by integration by parts. Note that $$\int x^2 f_A(x)dx=(\mu_A^2+\sigma_A^2)F_A(x)-\sigma_A^2(x+\mu_A)f_A(x).$$
So \begin{align*} \int_{-\infty}^\infty x^2 f_A(x)F_B(x)dx=&\left[\left((\mu_A^2+\sigma_A^2)F_A(x)-\sigma_A^2(x+\mu_A)f_A(x)\right)F_B(x)\right]_{-\infty}^\infty\\ &-\int_{-\infty}^\infty\left((\mu_A^2+\sigma_A^2)F_A(x)-\sigma_A^2(x+\mu_A)f_A(x)\right)f_B(x)dx\\ =&\mu_A^2+\sigma_A^2-(\mu_A^2+\sigma_A^2)\int_{-\infty}^\infty F_A(x)f_B(x)dx+\sigma_A^2\int_{-\infty}^\infty(x+\mu_A)f_A(x)f_B(x)dx\\ =&P(\mu_A^2+\sigma_A^2)+\sigma_A^2\int_{-\infty}^\infty(x+\mu_A)f_A(x)f_B(x)dx. \end{align*}
Recall that $f_A(x)f_B(x)=f\left(x;\frac{\sigma_B^2\mu_A+\sigma_A^2\mu_B}{\sigma_A^2+\sigma_B^2},\frac{\sigma_A\sigma_B}{\sqrt{\sigma_A^2+\sigma_B^2}}\right)f_{A-B}(0)$. We will define $\mu_{AB}=\frac{\sigma_B^2\mu_A+\sigma_A^2\mu_B}{\sigma_A^2+\sigma_B^2}$ and $\sigma_{AB}=\frac{\sigma_A\sigma_B}{\sqrt{\sigma_A^2+\sigma_B^2}}$.
Note that these do not correspond to the product of $A$ and $B$ but rather come from suitably decomposing the product of their pdfs.
Then $f_{A-B}(0)$ can be factored out of the remaining integral and we can resolve it by finding the expected value of $x-\mu_A$ for $f(x;\mu_{AB},\sigma_{AB})$. Thus we have \begin{align*} \int_{-\infty}^\infty x^2 f_A(x)F_B(x)dx=&P(\mu_A^2+\sigma_A^2)+\sigma_A^2f_{A-B}(0)(\mu_{AB}+\mu_A) \end{align*} and we can substitute this into the equation for $Var(A|A>B)$ to get \begin{align*} Var(A|A>B)=&\mu_A^2-\mu_a^2+\sigma_A^2+\frac{\sigma_A^2}{P}f_{A-B}(0)(\mu_{AB}+\mu_A)\\ =&\sigma_A^2+\frac{\sigma_A^2}{P}f_{A-B}(0)\left(\mu_{AB}-\mu_A-\frac{\sigma_A^2}{P}f_{A-B}(0)\right)\\ =&\sigma_A^2\left(1-\frac{f_{A-B}(0)}{P}\left(\mu_A-\mu_{AB}+\frac{\sigma_A^2 f_{A-B}(0)}{P}\right)\right)\\ =&\sigma_A^2\left(1-\frac{\sigma_A^2 f_{A-B}(0)}{P}\left(\frac{\mu_A-\mu_B}{\sigma_A^2+\sigma_B^2}+\frac{f_{A-B}(0)}{P}\right)\right) \end{align*}
If we define $\mu_{B-A}$ as $\mu_A-\mu_B$ and $\sigma_{B-A}$ as $\sqrt{\sigma_A^2+\sigma_B^2}$, we can write this in perhaps the nicest way $$\sigma_A^2+\frac{\sigma_A^4 f_{B-A}(0)}{P}\left(\frac{\mu_{B-A}}{\sigma_{B-A}^2}-\frac{f_{B-A}(0)}{P}\right).$$
$Var(B|A>B)$ for normal random variables
The Normal rating system
Future Directions
There are a few considerations. We want to prioritize sets of choices with similar mean ratings, since choices with highly disparate mean ratings are unlikely to give us much information and presenting them even encourages the user to ignore middling options they might otherwise select if the available choices were all similar in mean rating.
Additionally, we want to prioritize choices with high rating variance, since these are the ones whose ratings we are least sure of. We can quantify exactly how desireable each set of choices is by calculating the Shannon information from all the nontrivial subsets. Then we prefer choosing sets of choices with higher Shannon information.
A good way to achieve this is to set up a fast mixing Markov process over the set of sets of choices and use this to pick a set of choices. To do this, we would need to identify whether using probabilities proportional to a power of the Shannon information, an exponential function, or potentially something else is optimal, and then design a Markov process which produces this distribution.