(no subject)

Nov 17, 2010 19:28

I half-heartedly competed in this http://kaggle.com/chess?viewtype=results, which was a competition to predict chess match outcomes based on observed past matches. The penalty was a variant of RMSE (i.e. L2) against aggregated win(1)/loss(0). This penalty rewards conservatism, i.e. in the absence of strong contrary information, the best guesses tend to be around the baseline win rate (which is ~0.55-0.60 for White). An absolute deviation (L1) loss would encourage "bolder" guesses, and an entropy-type loss would have necessitated predicting outcomes as opposed to probabilities. I am not convinced that RMSE is the best penalty, but that's not in my bailiwick.

My team was "acceptablelies" and we came in 73rd out of 258, which is OK given that we basically gave up (too much other work to do) about 1/3 into the competition. If we'd tweaked our premise to include what was in the literature (i.e. include some kind of dynamic model for describing a player's evolution over time) I think we could have gotten ~30th place. But this isn't the point.

What is remarkable is that our model included no temporal terms at all and still came out competitive with the currently standard Glicko ratings system: Glicko came in 72nd, beating our RMSE by only 0.0003. For all intents and purposes, our models are indistinguishable for prediction rate. Of course it may be the case that our models perform better on distinct subsets in which case even a naïve mixture would be fruitful.

Glicko uses a semi-elaborate (but smooth!) dynamic response model in which players have a rating which is allowed to vary smoothly over their observed performance. By contrast our system used only a discrete ranking (top tier; 2nd tier; ... down to 6th tier) which was fit by a simple energy-minimization heuristic. The ranking was then used to generate a two-way linear factor model and passed through a squashing function (0-1 clipping; oddly enough a logistic made things worse). An orthogonal L2 penalization term (i.e. an independent Gaussian prior) was used to prevent overfitting.

Despite this simple-minded approach (i.e. we had only ~62 parameters) we still essentially matched Glicko (*). Thus, from a purely predictive standpoint there is perhaps evidence that these two hypotheses are in a "dead heat": `players gradually change over time and are mostly comparable'; versus `players don't change much, and further can be ordered into a few mostly identical groups'.

Given that we match in predictive performance, there is something to be said for our relatively-parsimonious ~40-parameter model as opposed to modeling a continuous trajectory. Although we would have done better by including a "temporal twist", I think also that smooth/continuous methods could also benefit from a discretization/factor approach. In fact they are probably quite complementary approaches in the real world (although the combo may seem a bit hodge-podge sometimes). I think this is a good lesson to have learned, and I would be surprised if the winning teams did not use such a combination.

*: I'm fairly convinced that if we'd used a more principled approach in selecting the rankings (factors), we'd have gotten maybe 55th-ish.

**: OK, there are a lot more parameters since each player has a 6-way membership. In a sense, the model has O(n) parameters. However, considered this way, glicko has O(n*t) parameters since each player gets their own trajectory across time.
Previous post Next post
Up