Voltorb Flip: Overanalyzing a Minigame -- Part 5
Reinforcement learning and optimality testing by a guest author, blaine141
Disclaimer: This post is part 5 in a series on Voltorb Flip, which can be found here:
Part 1, Introduction to Voltorb Flip
Part 2, Heuristic Solution Algorithm
Part 2.5, Solution Significance Testing
Part 3, OCR for faster play
Part 4, Optimality and Retrospective
I recommend reading at least part 1 of this series for context on this post.
Introduction
About a year ago, I posted my series on Voltorb Flip, a sudoku/minesweeper-like minigame from Pokemon HeartGold/SoulSilver. In my post series, I attempt to computationally “solve” Voltorb Flip, with the ultimate goal of buying a Slugma.
A few weeks ago, I was contacted by an interested reader, blaine141. He generously offered to test my Voltorb Flip algorithm against a reinforcement learning algorithm, to assess whether my heuristic algorithm is optimal or not. Since optimality is hard to prove in general, the idea is that the reinforcement learning algorithm serves as a proxy for a true optimal playing algorithm.
So, he built a reinforcement learning algorithm to play Voltorb Flip, and compared all the various playing algorithms against each other on boards from level 1 to level 8. His code is available here and his analysis is presented below.
Critically, while trying to replicate my results, he fixed the bug in my code causing the Voltorb Flip win rate to be the same on all board levels (now fixed on the Voltorb Flip Github Repository). That DID seem weird at the time... This is why peer review is important!
blaine141’s Analysis of Voltorb Flip Solution Policies
The following is written by blaine141 concerning his Voltorb Flip results, lightly edited by me.
Multiple people online have proposed strategies to solve the mini-game Voltorb Flip from the games Pokemon HeartGold and SoulSilver (for example, the work of Evin Jaff). This analysis aims to test each solution policy and determine if any are sub-optimal.
Because a general proof of optimality is very difficult, a reinforcement learning model was trained on a Voltorb Flip simulator, and will be used as a pseudo-optimal benchmark against which all solution policies will be tested. Each policy’s win rate is compared to the best win rate achieved across all policies, including the reinforcement learning.
For each policy, 200 valid Voltorb Flip boards (see Bulbapedia for details) were generated and used to compute statistics on the likelihood of each cell containing a 1 tile, 2 tile, 3 tile, or Voltorb tile. Each policy was then fed these likelihoods as input. Using these likelihoods, each policy selected the best move according to its own figure of merit.
The following policies were tested:
Avoid Bomb: Select the tile that is the least likely to contain a Voltorb.
Greedy: Select the tile that is the most likely to be a 2 or 3.
Greedy With Penalty: Select the tile that is the most likely to be a 2 or 3 and least likely to be a Voltorb by computing a figure of merit combining both P(2 or 3) and P(Voltorb). This algorithm is a slight modification of Dylan Black’s heuristic algorithm in Part 2 of his Voltorb Flip Series to use more statistically accurate tile likelihoods, and is very similar to his “branching tree exhaustive search” algorithm in Part 4 of his series.
Max Entropy: Select the tile that reveals the most information about the board configuration while avoiding Voltorbs.
Reinforcement Learning: The pseudo-optimal strategy determined by reinforcement learning.
Each policy was tested on 2,500 games on each board level from 1 to 8, for a total of 20,000 games per policy.
The win rate for each policy/board level were used to compute the per-level win rate and the total game win rate. Below are the win rates for each policy. Slight variations of some policies, with different coefficients in their figures-of-merit were tried. These variations are signified by a serial number, i.e. “Greedy with Penalty 2.”
To test whether a policy was sub-optimal, I computed the z-score of each policy against the highest win-rate for each level. Below is a table showing all policies that performed sub-optimally for each level, within a 95% confidence interval. The optimal algorithm is assumed to be the reinforcement learning algorithm.
Overall, the only policy that was not rejected as sub-optimal compared with Reinforcement Learning was “Greedy with Penalty” (specifically, Greedy with Penalty 2, though no Greedy with Penalty strategy was significantly different from the others).
Thus, we find the best non-reinforcement learning method of playing Voltorb Flip is the Greedy with Penalty algorithm, which is implemented as follows:
Calculate selection score = P(cell is a 2 or 3) / (ε + P(Voltorb)), where ε is a small number intended to avoid divide-by-zero errors (0.001, 0.02, etc.).
Choose cell with highest selection score
Repeat
The code for this analysis can be found in this Github Repository.
Author’s Addendum
blaine141 finds that the original heuristic approach to Voltorb Flip I took in Part 2 of my series is non-inferior to a fully-trained reinforcement learning approach. This is both gratifying and very surprising, as the heuristic algorithm is quite simple.
I think this is a wonderful conclusion to my (and blaine141’s) exploration of Voltorb Flip, and I would like to give a huge thank you to blaine141 for taking the time to do this. In training a reinforcement learning algorithm to solve Voltorb Flip, blaine141 has done an exemplary job of showcasing the qualities most important to this blog: Maximum Effort, Minimum Reward.
I lived in trepidation here, expecting one of the five parts of this series to shock me with a massive Voltorb GIF, nothing else, no text, game over, next Substack. But each part was just full of wonder.