Saturday, May 15, 2021

Can we improve the way we talk about X-Wing ships?

(I hope Betteridge's Law doesn't apply.)

This is an article where I say there's a problem and have no idea how to fix it.

The source of all of society's problems


Imagine a discussion about Punishing One Dengar. He was extremely weak at 74 points at release and is now a competitive option at 58 points. If someone asked at release why Punishing One Dengar is weak, the responses would probably have talked about his awful dial, being a clunky big-base ship, not really having a turret, no extra dice mods, dying under focus fire, etc. If someone asked why Punishing One Dengar is strong now, responses would probably have talked about Initiative 6, above-average health, having a double-tap ability, being able to learn how to fly with his clunky dial, etc.

Besides the problems with cherry-picking and confirmation bias, the funny thing is both sides are right and always have been. Punishing One Dengar has always had these features. He had all of these features when he was weak and when he was strong. What isn't usually discussed is the only thing that changed which is his points cost, or how many points he is worth so it can be compared to a changing point cost. This discussion isn't useful for understanding why Dengar is good or bad.

(Edit to add: there is one case where I find a discussion of features is helpful and that's when I didn't know about a certain interaction with a ship. But even then, it's often not discussed exactly how beneficial that is and how often that comes up.) 


It's really hard to talk about whether a ship is good or bad. It's hard to talk about the effectiveness of ships. Even comparing two vanilla ships often requires non-simple math and running the dice calculator several times to get an accurate comparison. Conditional abilities are harder because we also need to guesstimate the chance the ability triggers and more dice probabilities will have to be calculated. Talking about features that are less numerical like arc-dodging or dials is even more difficult.

Even when we can convey how effective a ship is, whether a ship is good or bad depends crucially on its points cost. This is a problem that requires dividing by two numbers that are not friendly for division and we have to do this again every 6 months.

I noticed this most recently when talking with Raithos about Darth Vader in the TIE Defender after a test game. It was an unproductive discussion of our feelings, some head-sims that were probably in completely different places, and whether the dice or strategy in our sample size of one favored one side or the other. I don't think either of us changed our minds on the ship after our discussion.


(I remembered this after posting, so I'm editing this in now.) One option is to compare ships to other ships. This makes it easier to discuss the effectiveness of a ship compared to its point cost. The challenge with this method is that the comparison ship still has to be evaluated. For example, there was a recent post that argued the First Order Provocateur was underpowered by comparing it to a Saber Squadron TIE Interceptor. My evaluation is that the First Order Provocateur is one of the best options in the game and the Saber is bordering on being overpowered. However, this isn't too problematic if we're just concerned about finding the best ships in the game and using comparison ships that are known to be strong. Still, that limits the amount of comparison ships and thus the applicability of this method.


What if we just talk about tournament results? Besides not being able to predict what we should play, I stand by my previous article on why this may not work. Imagine tournaments as shuffling a deck of cards, and we want to know whether a certain card is more likely to be on or near the top of the deck after we shuffle. We shuffle the deck and the Jack of Hearts is on top. You might have realized how many shuffles we need to figure out if the Jack of Hearts is actually more likely to be near the top of the deck or if this was complete luck. We don't have that many tournaments/shuffles.

Even with the tournaments we have, we often don't pay enough attention that the Queen of Spades was #2 or the 8 of Clubs was #7. On top of that, think of all the pilots in the game or the even vaster number of possible lists in the game, only a small portion of which are in a single tournament. A tournament doesn't even shuffle the full deck. And if there's a systematic way players of varying strengths picks lists, then even an infinite sample size would give us a biased result of how strong lists are.


Obviously, this is all just a plug for my model, right? Well, sort of. For example, version 1.9 of my model rates the TIE Silencer "Avenger" at 57 points, which is roughly fair. Chris Allen, in the recent Fly Better Podcast, thinks Avenger is much better than that. We can dig into the model and see that I based Avenger's strength on his ability triggering 1 time per game on average and he gets an extra dice mod for his attack when it triggers. Chris would likely point out that I got it wrong: Avenger's usually flown in 5-ship lists so his ability would trigger more often than once a game on average, and being able to reposition makes his ability more valuable than just a dice mod. We can have a productive conversation about how often the ability triggers: I might say that sometimes Avenger will die first, sometimes the ability triggers when Avenger is stressed or can't benefit from the action, and sometimes none of your ships die during the game and you win or lose on time. But in the end, the focused area of disagreement makes it easy to change my mind or otherwise understand why we disagree. If I give Avenger 1.5 uses of his ability and value the benefit halfway between the extra token and a full initiative-7 coordinate, then I'd think Avenger is worth 62 points (+10% over his current cost of 56 points). What an improvement this is over vague talks about feelings or generic listings of features!

The model is great for talking about ship strength -- for the two people in the world who understand how it works. It's a messy and poorly-documented jumble of equations and assumptions. I often forget how some of the more obscure parts of it work. It's exponentially harder to read someone else's code and understand their logic. Otherwise, you're just taking my numbers at face value and relying on my judgement. I've spent more time systematically thinking about and evaluating X-Wing ships than most people (than everyone?), but many different people will have a better evaluation of a specific ship they're very familiar with. Anyone who thinks I got things exactly right in my model should look at how many versions there have been :).

And there are definitely ships which the model gets wrong. For example, version 1.9 of the model values Eta-2 Anakin at 45 points, for a whopping -22% difference from its current point cost of 56 points. From the games I've seen, including Paul Heaver's VASSAL League games, and the one game I've play-tested him, I know Eta-2 Anakin is almost certainly better than that. But is Anakin average, competitive, among the best options in the game, or broken OP? I have no idea, and I probably won't know until there's more data on the ship or I figure out what my model (and by extension, what I) got wrong about how to evaluate the ship. In the meantime, I don't know how to have productive conversations about how strong this ship is.


Do I have any ideas how to solve this? Eh. I've written some articles in the Evaluation and Calculation series to try to explain some of the math behind evaluating ships, but that's still not easy to have a discussion about. I've yet to write about more difficult topics like arc-dodging and the articles are a bit time-consuming to write. I've thought about creating a table (actually 6 tables, one for each initiative) of fair vanilla ship point values for attack and hull combinations. That still requires some assumptions about the meta of attack and defense distributions, but it could help establish a baseline for further conversation. And I can also expose some more calculations in the model such as durability and damage output to show how I'm getting results. All of these solutions are very mathy. I'm not sure any of these would solve this broader problem, especially in more casual conversations. Hopefully some more clever people will come up with a good solution for this :).