Ranking learning
Where I work, we have some kind of journal club where we normally share papers that we find interesting and one of us has to explain it to the rest. This week was my turn and since we had lately been talking about recommendation and information retrieval I thought that looking for a paper on Ranking would fit nicely. But instead of finding a state of the art paper, I thought it might be interesting to start simple so I decided to share the paper named "Learning to Rank using Gradient Descent" wchich is not precisely new. I started reading and quickly found out that I needed to understand the fundamentals first: Ordinal Regression. So I decided to read about it and explain it along with the paper to my coworkers.
Wikipedia has a nice entry on the topic so I am just going to copy the relevant parts and comment them when needed.
Ordinal Regression is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. In the setting of a search engine, we could try to predict how relevant a document is given a specific query, for example from 1 to 5 where 1 means high relevance and 5 means low relevance. I think it is worth focusing on the linear models to understand how ordinal regression tries to solve this ranking problem.
Linear models for ordinal regression
Ordinal regression can be performed using a generalized linear model (GLM) that fits both a coefficient vector and a set of thresholds to a dataset. Suppose one has a set of observations, represented by length-p vectors $\small x_1$ through $\small x_n$, with associated responses $\small y_1$ through $\small y_n$, where each $\small y_i$ is an ordinal variable on a scale $\small 1, ..., K$. For simplicity, and without loss of generality, we assume $\small y$ is a non-decreasing vector, that is, $\small y_i \leq y_{i+1}$. To this data, one fits a length-p coefficient vector $\small \bf w$ and a set of thresholds $\small \theta_1, ..., \theta_{K−1}$ with the property that $\small θ1 < \theta_2 < ... < \theta_{K−1}$. This set of thresholds divides the real number line into $\small K$ disjoint segments, corresponding to the $\small K$ response levels. The model can now be formulated as $$ Pr(y\leq i|\bf{x}) = \sigma(\theta_i - \bf{w \cdot x}) $$
There are different options for the $\small\sigma$ function (inverse link funciton in GLM language) which is our cummulative distribution function, and I am going to mention two:
- logistic function
- probit function
When we use the logistic function we assume that our error term follows a logistic distribution with $\small \mu = 0$ and $s = 1$ (scale parameter) conditioned on $\small \bf{x}$ and when we use the probit function we assume our error term follows a normal distribution with $\small \mu = 0$ and $\small \sigma = 1$ conditioned on $\small \bf{x}$. Both distributions have a similar shape but the logistic distribution has longer tails. In practice, both models are very similar and the choice comes down to interpretation (link).
import numpy as np
from scipy.stats import norm, logistic
import matplotlib.pyplot as plt
loc = 0
scale = 1
# generate points for the normal distribution
x_norm = np.linspace(norm.ppf(0.01, loc, scale), norm.ppf(0.99, loc, scale), 100)
y_pdf_norm = norm.pdf(x_norm)
y_cdf_norm = norm.cdf(x_norm)
# generate points for the logistic distribution
x_logistic = np.linspace(logistic.ppf(0.01, loc, scale), logistic.ppf(0.99, loc, scale), 100)
y_pdf_logistic = logistic.pdf(x_logistic)
y_cdf_logistic = logistic.cdf(x_logistic)
fig, axes = plt.subplots(1, 2, figsize=(10,3))
axes[0].plot(x_norm, y_pdf_norm, color='k', label='normal pdf')
axes[0].plot(x_logistic, y_pdf_logistic, color='r', label='logistic cdf')
axes[1].plot(x_norm, y_cdf_norm, color='k', label='normal pdf')
axes[1].plot(x_logistic, y_cdf_logistic, color='r', label='logistic cdf')
axes[0].legend()
axes[1].legend();
Probably there are some details that are still not clear to you, so the wikipedia article goes on and introduces the latent variable model, which helps put all the pieces together.
Latent variable model
The probit version of the above model can be justified by assuming the existence of a real-valued latent variable (unobserved quantity) $\small y*$, determined by $$ y* = \bf{w\cdot x} + \epsilon $$ where the error term is normally distributed conditioned on $\small x$ (as mentioned above). The response variable $\small y$ results from an "incomplete measurement" of $\small y*$, where one only determines the interval into which $\small y*$ falls: $$ y = \begin{cases} 1 & \text{if $y* \leq \theta_1$}\\ 2 & \text{if $\theta_1 \leq y* \leq \theta_2$}\\ \vdots\\ K & \text{if $\theta_{k-1} \leq y*$} \end{cases} $$
From this definition the conditional distribution of $\small y$ can be derived as $$ P(y= k|{\bf x}) = P(\theta_{k-1} \leq y* \leq \theta_k | {\bf x})\\ = P(\theta_{k-1} \leq {\bf w \cdot x + \epsilon} \leq \theta_k)\\ = P(\theta_{k-1} - {\bf w \cdot x} \leq \epsilon \leq \theta_k - {\bf w \cdot x})\\ = \phi(\theta_{k} - {\bf w \cdot x}) - \phi(\theta_{k-1} - {\bf w \cdot x}) $$
Now that we have the conditional distribution of $\small y$ we can derive the likelihood function for a single training example: $$ log \mathcal{L}({\bf w}, \theta|{\bf x}_i, y_i) = \sum_{k=1}^{k}[y_i = k]\cdot log(\phi(\theta_k - {\bf w \cdot x}_i) - \phi(\theta_{k-1} - {\bf w \cdot x}_i)) $$
We are at a point where we are familiar with the model and have managed to understand it, so let's see an example. As of writing this article statsmodels
has this model implemented in the V0.13.0.dev0
version so if you want to give it a try you will have to install from their github (statsmodels github).
import pandas as pd
from statsmodels.miscmodels.ordinal_model import OrderedModel
import seaborn as sns
I have downloaded the abalone
dataset from here. In this website you have several datasets for ordinal regression. We have 8 categories ranging from 1 to 8.
abalone_data = pd.read_csv('abalone.csv')
# You need to convert the target variable to categorical in order to make it work.
abalone_data.response = pd.Categorical(abalone_data.response, ordered=True)
abalone_data.head()
mod_prob = OrderedModel(abalone_data['response'],
abalone_data[abalone_data.columns[1:]],
distr='probit')
res_prob = mod_prob.fit(method='bfgs')
res_prob.summary()
In the output table, we can see all the fitted coefficients and thresholds separating each region.
num_of_thresholds = 7
thresholds = mod_prob.transform_threshold_params(res_prob.params[-num_of_thresholds:])#[1:num_of_thresholds + 1]
coefficients = [0.6831, 7.9829, 5.7866, 3.0927, -8.2448, -2.6864, 3.9445]
def generate_result():
row = np.random.randint(0, abalone_data.shape[0])
predicted_label = res_prob.predict(abalone_data.iloc[row: row+1, [1,2,3,4,5,6,7]]).values.argmax() + 1
mean = np.dot(coefficients, abalone_data.iloc[row, [1,2,3,4,5,6,7]].values)
true_label = abalone_data.loc[row, 'response']
return row, true_label, mean, predicted_label
fig, axes = plt.subplots(4, 2, figsize=(8,8))
axes = axes.ravel()
for ax in axes:
row, true_label, mean, predicted_label = generate_result()
x = np.linspace(norm.ppf(0.01, loc=mean),
norm.ppf(0.99, loc=mean), 100)
sns.lineplot(x=x, y=norm.pdf(x, loc=mean),color='r', alpha=0.6, label='norm pdf', ax=ax)
ax.axvline(mean, color='b')
if thresholds[predicted_label] == np.inf:
ax.axvspan(thresholds[predicted_label - 1], ax.get_xlim()[1], alpha=0.2, color='k')
elif thresholds[predicted_label - 1] == -np.inf:
ax.axvspan(ax.get_xlim()[0], thresholds[predicted_label], alpha=0.2, color='k')
else:
ax.axvspan(thresholds[predicted_label - 1], thresholds[predicted_label], alpha=0.2, color='k')
if thresholds[true_label] == np.inf:
ax.axvspan(thresholds[true_label - 1], ax.get_xlim()[1], alpha=0.2, color='g')
elif thresholds[true_label - 1] == -np.inf:
ax.axvspan(ax.get_xlim()[0], thresholds[true_label], alpha=0.2, color='g')
else:
ax.axvspan(thresholds[true_label - 1], thresholds[true_label], alpha=0.2, color='g')
for th in thresholds:
ax.axvline(th, 0, color='k', linestyle='--')
ax.set_title(f'Row = {row}, True = {true_label}, Predicted = {predicted_label}')
plt.tight_layout();
In the above figure I have plotted eight predictions from the model and the ground truth label. In green the region corresponding to the true label and in grey the region corresponding to predicted ordinal value. The blue line is the mean predicted by our model. One nice thing about this model is that it outputs a probability for each ordinal value.
Note that I have not splitted the dataset into train/test since this demo was just to showcase how one can fit a model like this, how it works and the kind of output it gives. There is also a nice discussion here on what metrics are good for ordinal regression
Ranking Learning
So now let's get back to the paper and try to explain it a little bit. I will focus mostly on the things I found more interesting so feel free to check the paper for more information. The authors in the paper claim that casting the problem as an ordinal regression when trying to rank is an unnecessary hard problem, so that no mapping to particular rank values and rank boundaries (the thresholds we mentioned above) are needed. This sentence should make a lot more sense now that we know ordinal regression. They propose to frame the problem associating a probability of preference $\bar{P}_{AB}$ to a pair of documents as the probability that document A is prefered over B.
Main ideas
- Learn a ranking function trained on pairs
- No mapping to particular rank boundaries
- Probabilistic cost function
The authors consider models $f:R^d \mapsto R$ such that the rank order of the set is specified by the real values $f$ takes, specifically, $f(\bf{x_1}) \geq f(\bf{x_2})$ means that the model asserts that $x_1$ is to be ranked higher than $x_2$. Cross entropy is used as the Cost function: $$ C_{ij} \equiv C(o_{ij}) = \bar{P}_{ij} log P_{ij} - (1 - \bar{P}_{ij} log (1 - P{ij})) $$ with $$ o_i \equiv {f(\bf{x_i})}\\ o_{ij} \equiv {f(\bf{x_i}) - f(\bf{x_j})}\\ $$ and the logistic function to map the differences $O_{ij}$ to probabilities. $$ {P}_{ij} \equiv {\frac{e^{\bar{o}_{ij}}}{1 + e^{\bar{o}_{ij}}} }\\ $$
The above model leads to constraints on possible choices of $\bar P's$. If we do some algebra on the logistic function we get that: $$ \bar P_{ik} = \frac{\bar P_{ij} \bar P_{jk}}{1 + 2\bar P_{ij} \bar P_{jk} - \bar P_{ij} - \bar P_{jk}} $$
In the plot below we have the case where $\bar P_{ij} = \bar P_{jk} = P$ and one can see that for example when the probability is 0 for $\bar P_{ij}$ and $\bar P_{jk}$ (meaning we are certain that $O_i\leq O_j$ and $O_j\leq O_k$ or in other words, j is certainly ranked higer than i and k is certainly ranked higher than j), this certainty propagates to $P_{ik}$. The same goes for P = 0.5 and P = 1. If we are completely uncertain about the order of $i$ and $j$ and the order of $j$ and $k$ meaning $P=0$, then we are also completely uncertain about the order between $i$ and $k$. This complete certainty or uncertainty propagates as one would expect logically. And the same goes for confidence or lack of confidence. Take for example $P_{ij} = P_{jk} = 0.6$, then $P_{ik} > 0.6$ (remember the nonlinearity in the logistic function), and that makes complete sense, in that if we are somewhat confident that $i$ is to be ranked higher than $j$ and $j$ is to be ranked higher than $k$, then we are even more confident that $i$ is to be ranked higher than $k$.
def prob_ij(distance):
return np.exp(distance) / (1 + np.exp(distance))
def cost_function(p, p_ij):
return -p * np.log(p_ij) - (1 - p) * np.log(1 - p_ij)
def prob_ik(p_ij, p_jk):
return (p_ij * p_jk) / (1 + 2 * p_ij * p_jk - p_ij - p_jk)
distances = np.linspace(-5, 5, 100)
p0_results = cost_function(0, prob_ij(distances))
p05_results = cost_function(0.5, prob_ij(distances))
p1_results = cost_function(1, prob_ij(distances))
fig, ax = plt.subplots(1, 1, figsize=(6, 5))
x = np.linspace(0, 1, 100)
sns.lineplot(x=x, y=prob_ik(x, x), ax=ax, color='k')
ax.set_xlabel('P=P_ij=P_jk')
ax.set_ylabel('P_ik')
ax.axvline(0.5, linestyle='--', color='k', alpha=0.2)
ax.axhline(0.5, linestyle='--', color='k', alpha=0.2);
How much freedom is there to chose the pairwise probabilities?
The authors propose a theorem that says that given a set of documents and any permutation of it, if we have a probability specified for each adjacent pair, that is necessary and sufficient to uniquely identify a target probability for each pair in the set. So let's see this:
def p_to_distance(p):
return np.log(p/(1-p))
collection = {'A', 'B', 'C', 'D', 'E'}
p = np.random.uniform(0,1,4)
p
p_to_distance(p)
permutation -> 'A', 'B', 'C', 'D', 'E'
adjacent probs -> 0.77713275, 0.79908694, 0.33050207, 0.67320075
distances -> 1.24903487, 1.38059745, -0.70591514, 0.72269752
So with this distances we can compute every distance we want for any two elements in our set. The distance between say first element and third element would be $o_{AC} = o_{AB} + o_{BC} =$ 1.24903487 + 1.38059745 = 2.62963232. So this way we could compute all the distances and order the set. So let's check which element goes first, B or D. We know directly that B has a higher rank than C and C has a lower rank than D, so both B and D have a greater rank than C, but which of these goes first? Let's compute the distance between B and D and compare it to that of B and C. $o_{BD} = o_{BC} + o_{CD}$ and since we know that $O_{BC} \geq 0$ , if $o_{CD}$ is positive then $o_{BD} \geq o_{BC}$, if it is negative then $o_{BD} \leq o_{BC}$. Since $o_{CD} \leq 0$, we know that C is smaller than D, and D is smaller than B. The path I have taken is not optimal since we do not need to compute anything since we know that C is smaller than B and D, and we have the distance between B and C and between C and D and we see that the distance between B and C is greater than that of C and D. So without taking E into account we would have so far -> A > B > D > C.
As I mentioned earlier, this paper is a little bit old, so I do not think it is worth talking much about the architecture, I will just mention that they have a one layer network with no activations, so just a linear model and a double layer neural network. The fact that they use a one layer NN is to benchmark against other linear models. And this is actually one of the results of the paper: their linear model outperforms the other linear models basically, they say, due to the approach with this pair-based cost function. In this paper a neural network is used but the ranking function can of course be any differentiable function.