|
Abstract
We have applied an
embodied intelligence
approach to
learn optimal playing strategies for both PacMan and the opponent ghosts.
Over many thousands of iterations, the linear creature-centered perceptron
network is continually modified with reinforcements based on rewards
gained through out each game. To make such a formidable learning task
feasible, we begin training with extremely simple boards, then increase
their complexity once simpler, basic behaviors have been learned.
Games have been a prominent theme within Machine Learning, especially
in the domains of reinforcement learning and genetic programming. One
class of games, which simulate competitions for survival in some
synthetic world approach the problem of opposing "players" who learn
the domain simultaniously. Work in this area has been predominantly in
"Tom & Jerry"-type competitions where the two competitors have some
restrictions on movement, speed, or available energy. They typically
however, do not attempt to incorporate what we will call environmental
policy learning.
In most "Tom & Jerry" competitions the objective is to learn to
predict and optimally react to your opponents actions in a world with
few environmental constraints on direction or speed of motion.
Because of this trivial environment the behaviors learned in these
experiments are simple responses to the action of the opponent. For
example behaviors like "chase", "bolt", "juke", "random walk" are
learned. Given the more complex environment of PacMan, more
sophisticated behaviors and policies, perhaps even long term planning
could be learned.
Introduction
The world of PacMan is more complex than the environment of a typical
competition scenario, because of several of the games characteristics:
- Spatial configurations of walls severely limits mobility of
both preditor and prey; including such possibilities
as dead-ends which force policy reversals.
- Motion-blocking walls permit more complex "hiding" and "dodging"
behaviors.
- Consumption of "Powerdots" temporarily reverse the predator
prey relationship; necessitating a behavioral response due to
environmental changes.
- Gradual changes in environment occur as a result of actions
taken by the competitors.
- Extreme changes in environment occur as the the playing
board increases.
- Topologically toriodal environment (i.e. the wrap around at
board edges) prevents simple "run away" behaviors.
1 : The PacMan Environment
The domain over which PacMan and the Ghost must learn consist consists of
external environmental constrains determined by the current level, and by
their current policies and the policies of their opponents.
The environment can be described with a matrix which encodes the presence
or absence of each of the environmental components listed in Table 1.1.
- Inanimate elements
- Walls
- Open Spaces
- Dots
- Power dots
- Ghost regeneration points
|
- Animated creatures
- PacMan
- Ghosts
- Edible Ghosts
|
Table 1.1 Environmental components
|
---|
2 : Internal Representation
When given to the system as input, the world is represented by a 2d array which
identifies the contents of each cell on the game board.
To make a system reason about these possible arbitrary labels would be quite difficult.
For example consider if 1=PowerDot, 2=Ghost, 3=Dot, there is no linear relationship
between label and goodness.
To overcome this we split the world into bit planes where each bit indicates the
presence or absence of a particular component at a particular position on the board.
|
Figure 2.2:
The current state of the world is represented by a three-dimensional
binary array where each bit plane indicates presence or abence of a component at a
particular position.
|
|
Figure 2.1:
The current state of the world is represented by a three-dimensional
binary array which allows linear perceptrons to capture relations within the world.
|
|
Each creature has binary detectors for each of the possible elements
of the environment for a 10x10 grid centered on the creature. This defines
the current environment for the creature.
|
3 : Our Reinforcement Learing Paradigm
Figure 3.1:
Action decisions are based on perceptron-like policy networks, which
are centered about the creature.
|
|
Because we use a perceptron-like network as our fundamental representation of
each of the creatures evaluation functions, we have implicitly assumed that the
factors in the world are linearly separable.
This is in fact not the case, as in some situations an optimal policy might
include conditions such as exclusive or, which cannot be represented with a
linearly separable mechanism.
|
Given our creature-centric internal representation of the world state,
as depicted in Figure 3.1,
a perceptron network over this space can be represented by a real valued 3 dimensional
array.
Each creature has a weighting matrix which defines the desirability
of having each of the elements in each particular location for each
possible move.
By taking the dot product analog of this array with the state
of the world represented as a 3 dimensional binary array, we can essentially
record the response of this network to the word-state stimulus.
|
Figure 3.1:
The response of a given unit can be computed by taking the 3 dimensional
analogue of a dot product of an array encoding the perceptron network for a given
decision with the binary array representing the world state.
|
|
Figure 3.2:
The response of a given unit can be computed by taking the 3 dimensional
analogue of a dot product of an array encoding the perceptron network for a given
decision with the binary array representing the world state.
|
|
By comparing the responses of four separate networks, each which encodes
the drive to move in one of four directions, we can determine the
most preferred action for the creatures.
It would be possible to use four matrices for Up, Down, Left and Right
but this would take longer to learn than one matrix which is
directionally specific. This matrix is rotated and multiplied by the
current environment to attain the expected benefit of making that
move.
|
Because in our simulation there are eight possible units, and each creature
examines a 10 x 10 window, yielding 800 free parameters to be learned.
The update rule for the perceptron network is based on a reinforcement
learning scheme. Essentially upon competition for each game, each the value
assigned to each move is the immediate score acquired on that turn plus
some percentage of the value of the previous move.
4 : Testing Procedure
The testing procedure involved introducing the ghost and pacman to
environments of increasing complexity. By starting with a very simple
environment, the pacman would begin with reinforcing very basic behavior
such as avoiding walls, avoiding ghosts, eating pacman, eating edible
ghosts, and eating dots.
After a number of iterations, the board was changed and more
complex behaviors needed to be developed. There is a chance that
simple behaviors could be unlearned, if they are not as useful at
higher levels, but our intention is to learn behaviors which can be
generally applied to most situations.
These behaviors could be categorized as simple, general
drives(i.e. a will to live, a desire for food, adversion moving into
walls). Each of these drives are competing against each other, but
depending on the strength of the reinforcement, certain drives will
dominate. For example, the drive to avoid getting eaten should outweigh
the drive to eat dots.
Figure 4.1:
Level 1 was extremely simple. It has only three open squares. If the
pacman moves down, it wins. If it moves up, it dies. If it moves any
other direction, its chance of dying increases. This is an attempt to
force pacman to learn 4 basic behaviors:
- moving directly into pacman is bad
- moving away from the ghost is good
- moving towards dots is good
- moving into walls doesn't benefit
The ghosts are expected to learn
- moving onto pacman is good
- any other move is less good
|
Figure 4.2:
Level 2 was only a little more complex. It has four open squares and
included a power dot. This level is intended to reinforce:
- pacman's preference for power dots, especially when ghosts are near
- large pacman's desire to eat edible ghosts
- the ghost's fear of the large PacMan
- undesirability of moving directly into walls
|
Figure 4.3:
Level 3 has more open space than either of the proceeding levels.
There are many dots, but only one power dot. If pacman successfully
eats the ghost then the ghost is exiled to the upper left corner of the
board, further reinforcing the desirability of that action. An important
behavior which might be learned is moving towards dots which are not
directly in front of pacman.
|
Figure 4.4:
Level 4 shows a jump in complexity. It has circular paths and multiple power
dots. The ghost will appear on the board again even if eaten. The paths
are intended to reinforce continuing on the same path, once chosen,
because continuing along a path gives positive marginal reinforcement and
turning around give mildly negative reinforcement.
|
Figure 4.5:
Levels 5-10 are more and more complex. There are more complex paths,
more power dots, and paths from one side of the image to the other.
If the lower level driving
behaviors were properly learned(i.e. avoid walls, run from ghosts, etc.),
the pacman and ghost should be more successful on the higher levels.
It is not expected that this simple perceptron model can solve higher
level tasks, it is only intended to provide drives based on the
surrounding environment. In fact, it can only learn tasks which are
linearly separable.
|
5 : The Perceptron Network As Accumulated Drives
Figure 5.1.1
The influence of ghost position on measurment of goodness of PacMan's forward movement.
The dark spot above the center square indicates that it is very unfavorable to move forward when
a ghost is before pacman. Similarly the light square below the center indiates that PacMan should
move away from the ghost.
|
Figure 5.1.2
The influence of an edible ghost position on measurment of goodness of PacMan's forward movement.
In general it is advantagious to eat it when it is close.
|
Figure 5.1.3
The influence of regular dots on PacMan's forward movement.
|
Figure 5.1.4
The influence of powerdots on PacMan's forward movement. Notice the strong attraction when they can be
eaten in only a small number of steps.
|
Figure 5.1.5
The influence of walls on PacMan's forward movement. It is generally advantagous to move away
from walls when possible.
|
Figure 5.2.1
The drive of a ghost to move towards pacman when he is edible.
|
Figure 5.2.2
The drive of a ghost to move away from pacman when he can eat the ghost.
|
Figure 5.2.3
The dots generally have no influence on the direction of the ghosts movement.
This is evidence of the failure of this model to capture complex concepts like the fact
that PacMan eventually will visit points with dots on them.
|
Figure 5.2.4
The ghosts tend to stay away from power dots, wether or not PacMan is actucally posing
a real threat.
|
Figure 5.2.5
The influence of walls on the Ghost's forward movement.
Just as for PacMan, it is generally advantagous to move away from walls when possible.
|
6 : Cross Generational Testing on Level 1
Cross Generational Testing
|
|
|
GHOST
|
|
LEVEL |
1
|
2
|
3
|
4
|
5
|
6
|
7
|
P A C
M A N |
1
|
64%
|
66%
|
70%
|
60%
|
68%
|
65%
|
66%
|
2
|
61%
|
53%
|
55%
|
51%
|
48%
|
42%
|
49%
|
3
|
48%
|
59%
|
55%
|
54%
|
55%
|
52%
|
45%
|
4
|
55%
|
61%
|
59%
|
57%
|
55%
|
60%
|
58%
|
5
|
52%
|
52%
|
51%
|
49%
|
48%
|
50%
|
62%
|
6
|
53%
|
52%
|
41%
|
56%
|
55%
|
59%
|
55%
|
7
|
52%
|
48%
|
56%
|
55%
|
49%
|
57%
|
48%
|
Figure 6.1:
Cross generational testing results, curve color indicates maxiumum PacMan training level,
abcissica indicates maximum Ghost training level, curve height indicates percentage of PacMan
wins.
|
6 : Cross Generational Testing on Level 2
Cross Generational Testing
|
|
|
GHOST
|
| LEVEL
| 1
| 2
| 3
| 4
| 5
| 6
| 7
|
P A C
M A N
| 1
| 52%
| 60%
| 50%
| 51%
| 44%
| 54%
| 61%
| 2
| 72%
| 59%
| 61%
| 66%
| 66%
| 74%
| 67%
| 3
| 81%
| 80%
| 77%
| 87%
| 76%
| 80%
| 83%
| 4
| 81%
| 80%
| 78%
| 69%
| 70%
| 82%
| 81%
| 5
| 80%
| 78%
| 75%
| 87%
| 70%
| 76%
| 72%
| 6
| 77%
| 74%
| 70%
| 87%
| 70%
| 76%
| 72%
| 7
| 82%
| 77%
| 68%
| 80%
| 78%
| 86%
| 76%
|
Figure 6.2:
Cross generational testing results, curve color indicates maxiumum PacMan training level,
abcissica indicates maximum Ghost training level, curve height indicates percentage of PacMan
wins.
|
6 : Cross Generational Testing on Level 3
Cross Generational Testing
|
|
|
GHOST
|
| LEVEL
| 1
| 2
| 3
| 4
| 5
| 6
| 7
|
P A C
M A N
| 1
| 26%
| 30%
| 16%
| 27%
| 23%
| 27%
| 27%
|
2
| 50%
| 52%
| 41%
| 42%
| 48%
| 41%
| 48%
|
3
| 59%
| 61%
| 53%
| 46%
| 55%
| 62%
| 56%
|
4
| 47%
| 54%
| 55%
| 57%
| 56%
| 44%
| 45%
|
5
| 55%
| 48%
| 55%
| 51%
| 67%
| 56%
| 54%
|
6
| 51%
| 53%
| 46%
| 51%
| 50%
| 51%
| 55%
|
7
| 51%
| 54%
| 62%
| 60%
| 60%
| 49%
| 53%
|
Figure 6.3:
Learning performance quickly improves then asymptotes.
The effect of improved ghost training on pacman victory begins to appear.
|
A1 : Images " of" " the" Levels
Page loaded on November 11, 2024 at 01:15 PM.
Page last modified on
2006-05-27
|
Copyright © 1997-2024, Jeremy S. De Bonet.
All rights reserved.
|
|