Space Owls: An X-Wing Blog: What can we learn from tournament results?

Tournaments are fun for everyone! We catch up with old friends, play some good games of X-Wing, and crown a winner. After they're done, we can look at the results to see what to complain about for the next few weeks/months! :)

That raises an interesting question: how much information do tournament results give us about card or list strength?

I think the answer might be: not much. We can be more confident when we have more data, but we should be careful drawing firm conclusions from tournaments.

We've had a history of drawing hasty conclusions from tournament results. When 2.0 first launched, Wedge had the worst performance of the popular Rebel ships across several months and many tournaments. People thought he was a bad ship. In the span of just a few tournaments, he shot up to be the best-performing Rebel ship. He continues to be one of the best the Rebel ships, even after his recent nerfs. Turns out, it was just hard to build a good Rebel list back then, and Wedge was really good.

When the January 2019 points came out, VTG-Ion Y-Wings were all the rage. It won the Hyperspace Qualifier during the Toronto System Open. We had post on post talking about the "Y-on" menace. But looking back on that meta, it never really performed at that level in future tournaments. Horton sometimes showed up in a Rebel Beef list, but VTG Ion Turret Y-Wings otherwise didn't do too well after that first tournament.

With the January 2019 points, it was widely known that the Decimator was awful. I chuckled when Marc brought out Deci Whisper for a game, and you can imagine my surprise when it completely dismantled me two games in a row. Turns out, Rear Admiral Chiraneau can be very strong with the right combination of upgrades (Moff Jerjerrod and Darth Vader crew are key), and it took a strong list-builder like Dalli to see that.

When CIS first launched, there was a question of which Belbullab to run with a Vulture swarm. My model really loved Captain Sear, but the early results and the community favored Wat Tambor. Many months of experience later, and it's not a question. The Sear Swarm is a noted top-tier list, and no one runs Wat Tambor anymore.

In the early days after the July 2019 points adjustment, Anakin Obi Ric won a big tournament. People were all up in arms about how broken Jedi with regen are. Now, it seems clear that list is just one good list in a field of good lists. Jedi with regen are strong, but so are many other ships.

I think tournament results can be used for examples of viable lists that can do well or win. A truly broken list could stand out in tournaments. Beyond that, it's hard to use tournament results to rank lists. We should be very careful drawing conclusions here because of data limitations. It's especially easy to overrate lists that make the top 2, and especially the winning list. The biggest problem is a small sample size of games compared to the large amount of variance (e.g. skill, matchups, dice, mind-games) in the game. Ideally, we'd also have some type of matchup-based Elo-like ratings by specific ship builds, not just final results aggregated across pilots and builds that can differ wildly in function and strength.

What do we want to know?

When we think about list strength, we have an experiment in mind. Take the same player, swap out their list, and assume they can play the lists equally well. How does their tournament performance change? We're interested in a causal effect, not merely the correlation between a list and winning.

We can imagine a list's strength as something abstract, but objective and real. The chance to win a game depends on factors including:

your list's average strength
opponent's list's average strength
matchup-specific list strength adjustment
your average skill
your list-specific skill adjustments (e.g. familiarity with list; range depends on skill floor/ceiling)
opponent's average skill
opponent's list-specific skill adjustments
net matchup-specific skill adjustments
time-dependent variation in your skill (e.g. rust, fatigue; may depend on list)
time-dependent variation in opponent's skill (e.g. rust, fatigue; may depend on list)
performance-dependent variation in your skill (e.g. tilt, nerves)
performance-dependent variation in opponent's skill (e.g. tilt, nerves)
random variation in player skill (e.g. brain farts)
random variation in dice luck
other random variation (e.g. 50/50 decisions, barely hitting/missing a rock)

The strength of your list affects your chance of winning any particular game, along with many other factors.

How can we figure out list strength?

Obviously, we can't just see how strong a list is, we have to figure it out somehow. We need something that corresponds with list strength that we can observe. We can imagine things that correspond well with list strength, and things that don't correspond well with list strength. Printing out all the lists, throwing them down the stairs, and ranking them by how far they went would be a bad way to figure out the strength of lists.

There are two ways we can judge how good the method for figuring out the "true" list strength. First, we want it to be accurate. If the method gets things right on average, then we might say it's pretty good. If not, the method could give us something that looks like list strength at first glance, but really isn't (the method is biased). Second, we want it to be precise. If a method gets things right on average but it's all over the place most of the time, we'd need a lot of data before we can trust what it says.

Ideally, we would take every possible list and play lots of games against every other list with players of a broad range of skills. Failing that, we could randomly assign lists to players X weeks/months in advance of tournaments, let them do their stuff normally, and record their tournament performance. With a large number of tournaments, we could probably get a good picture of which lists are strong and which are weak.

Problems with our current methods

Instead, we often use tournament results. Many websites currently show information about final tournament results for (e.g. average percentile of the tournament ranks of lists that include...) ship chassis, pilots, and upgrade cards. Are tournament results (e.g. average percentile or win rates) a good way to figure out how strong a list is, or is it likely to be wrong or suboptimal? It's probably closer to the X-Wing playtest sweatshop method than the gravity-stairs method, but how good is it?

If we think about this carefully, we can imagine some differences between tournament win-rates and average percentile statistics and the stats produced by those idealized and infeasible methods. Unfortunately, these differences mean tournament results are limited in what they can tell us about how strong a list is.

The biggest problem is the small sample size of tournament games compared to how much variance there is in the game. This creates several problems.

First, there's a lot of variance in the game, and without enough data, it's hard to draw strong conclusions about any results. If you shuffle a deck of poker cards, one of the cards will be on top. Is that card more likely to show up on or near the top, or is this just random? Without a lot of shuffles, it can be hard to tell.

In X-Wing, someone has to win the tournament. In general, we seem to give the winning list a lot more attention than other well-performing lists. With so few tournaments, it's hard to say without outside theory whether a list won because it was much stronger than other lists, if the player was better, or if their opponents made more mistakes. Right now, I believe most competitive lists run around 210-220 points of Academy Pilot value. That's less than a 5% difference, and variation in skill and dice can easily swamp that.

It's even more difficult to draw conclusions from tournaments because there are differences in player skill. It may look like a list generally performs well, but that may be because it's usually played by a stronger player. In fact, if we don't have multiple people playing the same list in the data, it's impossible to separately identify the list's strength from the player's skill. We'd be trying to figure out two variables with one data statistic.

Second, we don't have enough data to sample the entire range of cards and lists. There are over 500 pilots in the game. A list usually has multiple pilots, but most tournaments have an order of magnitude fewer players than there are pilots in the game. This is compounded by transformational upgrades that dramatically change how a pilot functions. For example, Obi-Wan Kenobi might as well be flying a different ship if he has the Calibrated Laser Targeting upgrade versus the Delta 7-B upgrade. Most websites would give you the average strength of Obi-Wan Kenobi and the average strength of Delta 7-B, but Delta 7-B may have a very different effect on Obi-Wan Kenobi than on the generic Jedi Knight.

Similarly, a card's strength may be hard to observe since it exists in a list with other cards. In the early days of 2.0, Wedge was strong but his wingmates often couldn't keep up, so his performance looked weak. More concerning is the preliminary evidence on high-initiative pilots having dramatically different win-rates based on whether or not they are the first player. We can't ever have enough tournament game data to infer list strength because of how many different lists there are.

This can be fine if all the unused ships and lists are weak, but that's not necessarily the case. Part of a ship's play rate depends on its strength, but play rate also depends on "coolness factor," previous tournament performance, and whether it has misleading builds. There are simply too many ships and upgrades to explore, it's not surprising if some strong options are missed.

For example, I'm pretty sure the generic E-Wing is strong, but no E-Wings were played at Worlds 2019. The problem is a combination of the generic E-Wing's poor strength historically going back to 1.0, misleading builds and ideas of what it does (it's a jouster), and the fact that most E-Wing fans are Corran fans (Corran is the weakest of the E-Wings). I'm pretty sure FFG will buff the E-Wing again in January and we'll have E-Wing-Pocalypse for 6 months :). Similarly, ships like Rear Admiral Chiraneau and Latts Razzi may be strong in certain contexts and with the right upgrades, and it may take a keen listbuilder to realize this before they get played.

Besides the problems of small sample size, there's a question whether the commonly-reported statistics like win rate or average percentile are good measures of list strength. When we look at the list of things which affect your chance of winning, it's clear much of that is left out from these statistics. If what is left out is correlated with list strength, then our method of figuring out list strength would actually be telling us some unholy amalgamation of list strength and other stuff rather than what we're really interested in, which is the causal effect of switching lists (this is known as omitted variable bias).

There are several ways this problem can show up. First, we can imagine stronger players are more likely to play strong lists. If this isn't accounted for, a strong list's win rate will reflect both a higher list strength and player skill. As such, win rates and average percentiles are likely to overstate a strong list's strength and understate a weak list's strength.

Second, in a Swiss tournament, you play against other players with your same record. That means stronger lists are more likely to face other strong lists, while weaker lists are more likely to face other weak lists. A list that went 4-2 by winning the first four games against progressively stronger lists and losing the last two against strong lists can be very different from one that goes 4-2 by losing the first two games against average lists and winning the next four against weak lists. If this isn't accounted for, then win rates and average percentiles will understate the strength of strong lists and overstate the strength of weak lists.

These effects bias the win rate figure in opposite directions. It'd be nice to say they cancel out. Unfortunately, it can be hard to say which effect is stronger. It could be that one effect is very large and the other is weak. It's hard to know whether differences in win-rate

There's also a weird issue with simply using play rate statistics without adjusting for points. For example, pretend Captain Seevor and fully-loaded Rebel Han Solo were equally strong. Captain Seevor is much easier to throw into a list as a cheap filler, while Fat Han is more two-thirds the points budget and only goes into lists that feature him. Just looking at play rate would overstate the popularity of cheap filler ships and understate that of expensive or synergy-reliant ships.

What can we do?

With these limitations, it's hard to find differences in list strength from tournament data beyond the extreme outliers in either direction. So, what can we do about it?

First, we should remember the value of patience. We should pay less attention to lists that win a single tournament. For overpowered lists, we should be looking at lists that consistently do well across tournaments. For viable lists, it's still best to look at ships that perform consistently across tournaments, but we can also look at lists that performed well at single tournaments (e.g. 4-2 and above) for ideas.

Second, I'd like to see continued innovation in the tournament results reporting space. At the very least, I'd like to see builds with transformational upgrades reported separately. The statistics for Delta 7-B Obi should be separate from that of CLT Obi, just like X-Wing Luke would be very different from a hypothetical A-Wing Luke. Other examples include Supernatural Reflexes, Special Forces Gunner, and maybe even Afterburners. It may not be possible with limited data, but a matchup-based Elo rating may provide more accurate information than using raw win-rates.

Finally, I'd like to see more theory-based approaches to identifying list strength. The benefit of a theory-driven approach is while it has to be informed by the data, it doesn't rely 100% on data alone and thus can avoid some of the problems with data limitations. My Ship Effectiveness Model takes a crack at this, but it's not perfect. Some of it is based on fundamental math concepts, but a large chunk is a reflection of my judgement about ships where I off-loaded the effort it takes to apply a consistent standard across 600+ ship builds to the computer. It's a huge effort, but I'd love it if someone else also took a serious effort at modeling ship strength. If nothing else, it'd be interesting to see other people seriously understand how the model works so they can apply their own assumptions to it.

Anyway, if you read this far, thanks! :) I know this can be a dry topic, and hopefully I shared something you find interesting.

Space Owls: An X-Wing Blog

Tuesday, November 12, 2019

What can we learn from tournament results?

What do we want to know?

How can we figure out list strength?

Problems with our current methods

What can we do?

No comments:

Post a Comment