units_0 |16 |24
units_1 |24 |32
units_2 |8 |None
tuner/epochs |10 |4
tuner/initial_e...|0 |2
tuner/bracket |0 |2
tuner/round |0 |1
- Regroup on Jan 2021.
- We achieve mae: 5.4591e-11, val_mae: 2.4891e-04 with a deep model on RepA,
and DISCOUNTED_RETURN. But model doesnt perform well in games. 50 game
epochs, aiming to use 1K games, but early stopping.
loss: 1.2714e-20 - mae: 5.4591e-11 - val_loss: 3.2321e-07 - val_mae: 2.4807e-04
- Similar results with Rep B. But it played terribly...
- Online MCTS is looking promising. Did a first run with a 1-neuron model (normalizing
features), playouts=25, training everytime there are >1K samples in batches of 64.
There was a bug in the data as well (was creating samples like root-node => mcts result
for all branches, instead of branch-node => mcts result). It didn't win games,
but there seemed to be VP improvement from 2 VP avg to 5 VP avg, in 100 games.
- A heuristic player plays better than WeightedRandom (but we have a memory leak).
Memory leak was due to functools.lru with copied boards.
- Simple MiniMax Algorithm takes too long... ~20segs per decision,
which makes for 30 min games. Even if de-bugged the implementation seems pretty slow.
As such, alpha-pruning doesnt seem wont help much (even if cuts time by half).
- Using PyPy (removing TF and other Python 3.9 features), speeds up playing random
games by around 33%s. (1000s games w/CPython3.9 took 2m52s and 1000s games
w/PyPy3.7 took 1m52s).
- Looked at RL Python Frameworks. keras-rl doesn't work with TF2.0, same with
stable-baselines. Many undocumented or hard to use. Best one seems to be
tensorforce. TF-Agents seems possible, but pretty raw API right now.
Hard to use tensorforce because not clear how to specify Rep B mixed model.
- VictoryPoint player is better than Random (much better) and WeightedRandom (not so much).
- Where able to do a somewhat strong NN with clipnorm=1, 2 layers, 32 and 8 neurons
batch norm of inputs and batches of 32. 7,000,000 samples divided in 10 epochs.
Still this player doesnt beat hand-heuristic. (ValueFunctionPlayer). This used
DISCOUNTED_RETURN. MSE as loss and LR=0.0001.
`loss: 1.5084e-07 - mae: 2.1945e-04 - val_loss: 4.2677e-05 - val_mae: 5.0571e-04`
- Did a RandomForest, but data seems wrong. Played very poorly. Tried again with
all Rep 1 features; no dice. Hard to analyze... Probably overfitted. Keeps ending turn.
- Scikit Regression player doesnt win, because just gets a lot of wheat production,
builds roads to extend as much as they can (dont hold to build settlements),
place robber on themselves, dont trade (same value); no hand-diversity features.
- Greedy (even when budgeted to run same num playouts as MCTS) does better than
MCTS.
- ValueFunction 2.0 (using EFFECTIVE_PRODUCTION and REACHABILITY features) plays
pretty strongly. Better than Greedy=25 and MCTS=100.
- AlphaBeta(depth=2) plays stronger than ValueFunction 2.0. Using VF2.0 as heuristic.
- Tried Bayesian Optimization to search better weights for hand-crafted heuristic,
but seems too slow and inefective.
- Note: In a random game, 38.6% of actions are roll (action=0), 27% are end turn (action=5557).
Future Work:
- Should probably try to do NN player with better tuning (more layers, different label).
- Interested in seeing MCTS play in real life. Should compare G10 to MCTS50
- Work on performance.
- Generate data as 1 Big CSV per representation. (samples + labels), (bt + labels)
- Idea: Make test-evaluation framework. Use gaussian optimization to find best weights.
- Attempt DQN Tutorial with TF2.0 Summary Module, to see progress. If works,
adapt to Catan-DQN Tutorial.
- Paperspace: Dockerfile: Install catanatron, and Pipenv dependencies, so that we can
experiment/play.py, have a Game in the Overview.ipyn and run a catan-dqn.py job faster
(should only do when can confirm Data and DQN approach improves...).
- Simplify ROBBER_ACTIONS to "Play Knight" to player with most points.
- Learn from offline MCTS data, using Rep B. (Seems slow to generate MCTS labeled data).
- Use tensorforce (since TF2.0 compatible and DQN Agent will be bug-free).
Seems hard because API is not too easy.
- Learn an offline DQN (using VPs) with Rep B. That is, use Reward as label (
so that early moves don't get a 1)
- Consider implementing AlphaBetaPruning to speed up game. Although
it seems speedup at best might be 1/2 the time, which still doesn't cut it.
Unless we speed up game copying...
- Actually use MCTS result to improve network. Do online so that we can
see progress.
- Add epsilon-playing to ensure we are exploring space. (Well, I think the fact
that we have random players fixed does this job for now, since we collect
samples from their perspective as well).
- Play a game against Greedy player. Play against Greedy with look-ahead.
- Idea: Sum up "enemy features". Make it look like its 1 enemy.
- Try Online - DQN approach (using PythonProgramming Tutorial).
- Use Value-Estimator with a tree-lookahead alpha-beta pruning.
- Try CNN-action space. (i.e. BUILD_SETTLEMENT at 3 means plane-board-tensor)
- Try policy-learning and q-learning on simpler action space.
- Try Cross-Entropy approach (using only top X features and dropping END TURNs).
- Try AlphaZero single-neural network learning (state) => (value, action).
- Use in tree-search (take N top actions to explore via RollOuts).
- Try putting in features back. Is this VPlayer better than Raw Features one?
- Can autokeras use tf.Dataset?
- Using autokeras with whole 1Gb dataset is better?
- Does playing against your V1 self, training on that, improve?
- Try Q-Learning but, iterate on playing, learning, playing, learning... e-greedy.
Performance Improvements
- Separate immutable state from Board (Map?), so that copying is a lot faster, and
can cache functions (say node-production) globally (across game copies).
Toy problems
- Idea: hot-encode 5 layers per player (one per resource) to denote income and buildings.
then use 3D convolution of size WxHx5 and stride=5 (to not overlap)
- An easier problem would be to count house/city VPs (only uses 1 plane). count-vp-problem.
- Next medium problem will be, guess wheat production (to see if combining the two planes).
Appendix
To install auto-keras and auto-sklearn:
auto-sklearn = "*"
emcee = "*"
pyrfr = "*"
scikit-optimize = "*"
pydoe = "*"
autokeras = "*"
kerastuner = {git = "https://github.com/keras-team/keras-tuner.git", ref = "1.0.2rc4"}
Learnings
- Basic tf.Tensor manipulations
- Visualize data with `tfdv.visualize_statistics(stats)`
- Basic toy problems on Keras.
- Used auto-keras, auto-scikit.
- Basic how CNNs work. How LSTM (RNNs in general) work.
- Epochs, steps, generator datasets. GZIP compression.\
- Paperspace.
- DQN Algorithm.
- Noise can mislead a NN.
Catan State Space Analysis
- Each tile can be any resource = 19. So 19! resource-tile decisions.
- Each tile must be one of the numbers. So 18! (desert has no number)
- There are 9 ports so: 9! ways of ordering them.
Finally, there are 19! \* 18! \* 9! boards in Catan. So like 10^38 boards.
Configurations states inside it are upper bounded by:
- Each player has at most 5 houses. 54 choose 5 are ways of putting all 5 houses.
54 choose 4 are ways of putting 4 houses. Sum\_{i in [0, 5]} (54 choose i) = 3.5M.
Actually, better ignore colors and consider all 20 houses. So:
`sum([comb(54, i) for i in range(21)]) = 683689022885551`.
Actually, including cities, then there are 9 pieces per player that can be in board.
`sum([comb(54, i) for i in range(4*9 + 1)]) = 17932673125137243`
- There are 14 roads per player so 56 roads in total. 72 possible locations so:
`sum([comb(72, i) for i in range(56)]) = 4722360822358286581300`
If we include number of cards that makes the state space much much bigger,
but in practice its a lot less (its rare for a player to have 20+ cards). So
just using the Board States we see there are:
10^38 boards. Each with 17932673125137243 \* 4722360822358286581300 configurations,
which is almost like 10^37, so we are talking around 10^68 just possible board-states.
In terms of cards-in-hand state. Assuming on average players have 5 cards in hand,
then out of the 19*5 resource cards we start with, we are talking about:
`comb(19*4, 20) = 1090013712241956540` (or 10^18).
Grand-ballpark estimate is ~**10^100** states in Catan. Chess is 10^123.
Go is 10^360. Number of atoms in the Universe is 10^78.
Branching Factor
The following stats suggests the decision tree players navigate usually has
around ~5 branches, and very few times something huge like 279 branches
(trading options after a monopoly(?), late road-building card(?)).
Branching Factor Stats: