Voltorb Flip: Overanalyzing a Minigame in Pokemon HeartGold and SoulSilver -- Part 4
Building a Better Board, Decision Trees, Optimality, and a Retrospective
Here we are in the final post of this four-part series on Voltorb Flip. In Part 1 of this series, we explored deterministic rules for assisting us in our quest to beat Voltorb Flip. In Part 2, we developed a heuristic algorithm that succeeded at clearing 71% of Voltorb Flip boards. In Part 2.5, we demonstrated that our heuristic algorithm is statistically superior to at least one other simpler algorithm. In Part 3, we developed an Optical Character Recognition engine to load the Voltorb Flip board into our algorithm automatically. So, that’s it, right? We’ve developed a great algorithm for playing and semi-automated the playing of it. We even have our Slugma and Shinx from the Game Corner!
Maximum Effort, Minimum Reward
Of course we’re not done, there are tons of questions remaining!
Why is our algorithm better performing than Evin Jaff’s algorithm?
My board generation process is actually slightly different than the real Voltorb Flip boards. Does that account for the difference?
Our Figure of Merit algorithm from Part 2 is surprisingly good. But is it optimal? What characteristics should an optimal algorithm have?
Given some constraints on a 25 tile board, can we design an algorithm which will construct all valid Voltorb Flip boards?
Building a Better Board
The way I am constructing my Voltorb Flip board for algorithm testing is slightly different than the real way Voltorb Flip boards are constructed.
Here’s the Level 1 and Level 2 table again:
My method for constructing Level 2 boards is as follows. Take the average number of 2 tiles (3), the average number of 3 tiles (2), and average number of Voltorbs (7). Divide each by 25, and you get the probability of each tile being a 2 (12%), a 3 (8%), or a Voltorb (28%). I then randomly assign each tile according to that probability distribution. Let’s call this the “probabilistic” method.
The way Voltorb Flip boards are actually constructed is that they randomly assign you to one of the five possible cases listed above - one 2 tile, three 3 tiles, and 7 Voltorbs, for example. They then randomly place those tiles on the 25 tile board. This method, while very similar on average, is not exactly the same as my method. Let’s modify our algorithm to use the correct board generation method and see if we get different success rates. Let’s call this the “semi-probabilistic” method.
Here’s the result of our algorithm on the probabilistic method:
Recalling that our standard deviation for 10,000 flips should be 0.45%, this is well within one standard deviation of our previous result, 70.83% win rate.
After writing some code to generate boards correctly, we get:
Wow, that’s actually a much better success rate! It’s four standard deviations better — way above the fluctuations expected from noise. Clearly, the way you generate the boards does make a difference.
We can now answer the question we posed earlier: Does the difference in board generation method account for the difference between my algorithm and the algorithms studied by Jaff et al.? We answer in the negative — actually, it seems proper level 2 board generation exaggerates the difference.
A token effort at replicating previous work.
Can we replicate this success rate of Jaff’s algorithms? Unfortunately, Jaff et al. don’t tell us exactly how their algorithms work, but we know the general idea. The “Maximum Points” algorithm is probably quite similar to our R score algorithm, where we calculate the remaining reward in each line (equivalent to the remaining points in each line) and use that as the Figure of Merit for choosing a tile. Let’s try that out on the semi-probabilistic boards and see how it works:
Much worse, and not too far off from the 62% achieved by Jaff et. al. Both intuitively and computationally, I’m convinced that the R score algorithm is likely close to the “Maximum Points” algorithm implemented by Jaff et al.
Decision Trees, Branching and the BEAST
Our Figure of Merit algorithm is pretty good. But is it optimal? In all likelihood, it isn’t. How do we find an optimal algorithm? I don’t know, but I do know how to brute-force this problem.
Consider the following half-solved board:
This board is actually deterministic, but let’s pretend it isn’t. Given this board, how do we construct all possible boards?
First of all, why do we want to do that? I’m betting that if we have a list of all possible valid boards for any given scenario, it is possible to “optimally” guess the next tile to flip. Second of all, how many boards are possible? If each tile was independent, there would be 4^25, or 1,125,899,906,842,624 possible boards. But we have lots more information than that, we have the row and column statistics, which drastically reduces the number of boards we have to construct to a manageable number. It might be fun to calculate, on average, how many possible boards exist.
But I digress — to construct all possible boards, we’re going to perform an exhaustive search with a branching tree algorithm. We choose a tile as our root, say (1,1). That tile can be either a 1 or a Voltorb. We create two new boards A and B where we assign that tile the value of 0 or 1, respectively. We now consider board A, and pick a new element on which to “branch out,” creating two more boards, C and D. Then we consider the three remaining boards (B, C, D) and branch on those boards. By iterating this process until all tiles are assigned, we construct all possible boards.
Of course, we can be a little smarter, and use the Row and Column statistics to “prune” our tree as it grows. Every iteration, we check to see if the board violates one of the four statistics we are given, and if it does, we delete that board. This algorithm can be implemented elegantly using recursion.
Once we have all possible boards, let’s first try to use the collection of boards to construct a more accurate figure of merit.
We can keep our old FoM = R / (1 + V), but instead of calculating it naively from the row and column statistics, let’s calculate it like this:
Given the set of all possible boards, for each tile find the probabilities that each tile is a 2, a 3, or a Voltorb, notated P(2), P(3), and P(V), respectively.
FoM* = ( 2P(3) + P(2) )/ (1 + P(V))
We recognize this FoM* as simply the old heuristic Figure of merit, but with a more accurate calculation of remaining reward R and number of Voltorbs V.
After an hour or so messing with recursive functions (all code available on github), I think I have a working implementation, and it’s got two big problems.
Half of the time when I try to run it, I exceed Python’s allowed recursion depth
When it does run, it is so slow
I think we can cheat a little here. Our Figure of Merit algorithm is impressively good, fully clearing 71% of boards. Since it (almost always) takes more than one move to clear a board, that means that at a minimum, our Figure of Merit algorithm guesses correctly (or at least not incorrectly) ~85% of the time. Also, each time we flip a tile, we reduce the number of board states by at least a factor of two (usually more). So this is going to be our new branching algorithm, in pseudo-code:
while board != solved
try:
boards = constructAllBoards()
fom = constructFOM(boards)
if RecursionError:
fom = FoMAlgorithm(board)
choice = choose(fom)
At each guessing step, we’re going to try to construct all boards and find the best figure of merit, and if I hit a recursion error, I’m going to fall back on the heuristic Figure of Merit algorithm, and retry the branching algorithm on the next iteration.
Since this algorithm takes much longer, I only did 3,000 trials.
74.4% success rate! That’s better! But hold on, we only did 3,000 trials, not 10,000. Is that still a significant difference? Recall in Part 2.5 where I showed the analytic expression for the standard deviation of the binomial distribution:
σ = sqrt( N μ (1 - μ) ), where μ is the win rate and N the number of trials
Also remember that this gives us the standard deviation in terms of games won, not win rate as a percentage, so we divided σ by N to get σ* = 0.45% standard deviation. Let’s write that expression in full:
σ* = sqrt(μ (1 - μ)) / sqrt(N)
The standard deviation of our win rate σ* goes like 1/sqrt(N). So if we do twice as many trials, we expect 1/sqrt(2) times the standard deviation in our probability distribution. In this case, we did 3/10ths the number of trials, and so we expect 1.8 times the standard deviation, or σ* = 0.82%.
For the Branching Exhaustive Algorithm Search Tree algorithm (BEAST), assisted by the fast FoM algorithm when necessary so as not to brick my computer, we find a win rate of 74.4%, which is still more than 4 standard deviations above the mean FoM win rate of 70.6%.
Optimality?
What does optimal even mean? Now that we have the set of all possible boards, what does it mean to play “optimally?” Clearly, we would like to maximize the win rate, but even with all possible boards out in front of us, how do we design an algorithm to do that? And even if we do, how do we prove the algorithm is optimal? These are very deep questions that I don’t actually know the answer to. If there are any CS people reading this blog, I’d love to know more about optimal algorithms and best performance bounds.
MinVoltorbs
One obvious thing to try is simply never hitting a Voltorb. Let’s replace our figure of merit in the BEAST algorithm with 1/(P(V) + ε), where ε is a small number to avoid divide-by-zero errors. After 200 trials, still using the heuristic FoM as a fallback in case of tree overflow, I get 70.5% success. Definitely not better, not much worse, either.
MaxPoints
Let’s try the same thing with FoM = 2*P(3) + P(2). I get 71.3% after 300 trials. Not significantly better.
Others?
There are so many possible algorithms we could try. So far, I’ve tried to cast each algorithm as a “Figure of Merit-choosing” algorithm, where we find the “best” square to choose by manually defining a function on each square. We don’t have to do it this way — we can choose the square which gives us the most “information,” i.e. eliminates the most possible boards from the set of possible boards.
But at the end of the day, I simply don’t know how to prove that any given algorithm for Voltorb Flip is optimal. I’m inclined to believe this Quora answer, which basically says that proving something is optimal is really hard, and there’s no known way that works in the general case. Perhaps that will be the (yet to be planned) Part 5!
Retrospective
If you count the algorithmically played games, I’ve probably played over 100,000 games of Voltorb Flip. Why did I do this? What did I gain from this?
First and most importantly, I got a few thousand coins in the game corner, allowing me to buy Hot Slugg the Slugma and a Shinx.
But more than that, I think this project demonstrated to me the value of thinking deeply about silly problems. In attempting to solve a minigame within a kid’s video game, I researched and wrote about number theory, algorithm design, significance testing, Optical Character Recognition, and theoretical computer science. I wrote nearly a thousand lines of code, over six thousand words, and a branching tree algorithm I’ve never used before. And I had fun! I’ve loved playing Pokemon games since the first time my brother and I played Emerald Version. I still fondly remember spending hours trying beat the Altaria in the Fortree City Gym. This project felt like a fitting tribute to those childhood memories.
UPDATE:
A guest post from a concerned reader, blaine141, puts a lovely capstone on this project. Please see Part 5, Reinforcement Learning and Optimality Testing!