Jeremy S. De Bonet & Chris P. Stauffer: Learning to Play PacMan Using Incremental Reinforcement Learning

BODY

Resume

RESEARCH

Publications

Image Compression

Texture Synthesis

Image Database Retrieval

Segmentation

Registration

Discrimination

Projects

Web Hacks

Abstract

We have applied an embodied intelligence approach to learn optimal playing strategies for both PacMan and the opponent ghosts. Over many thousands of iterations, the linear creature-centered perceptron network is continually modified with reinforcements based on rewards gained through out each game. To make such a formidable learning task feasible, we begin training with extremely simple boards, then increase their complexity once simpler, basic behaviors have been learned.

Games have been a prominent theme within Machine Learning, especially in the domains of reinforcement learning and genetic programming. One class of games, which simulate competitions for survival in some synthetic world approach the problem of opposing "players" who learn the domain simultaniously. Work in this area has been predominantly in "Tom & Jerry"-type competitions where the two competitors have some restrictions on movement, speed, or available energy. They typically however, do not attempt to incorporate what we will call environmental policy learning.

In most "Tom & Jerry" competitions the objective is to learn to predict and optimally react to your opponents actions in a world with few environmental constraints on direction or speed of motion. Because of this trivial environment the behaviors learned in these experiments are simple responses to the action of the opponent. For example behaviors like "chase", "bolt", "juke", "random walk" are learned. Given the more complex environment of PacMan, more sophisticated behaviors and policies, perhaps even long term planning could be learned.

Introduction

The world of PacMan is more complex than the environment of a typical competition scenario, because of several of the games characteristics:

Spatial configurations of walls severely limits mobility of both preditor and prey; including such possibilities as dead-ends which force policy reversals.
Motion-blocking walls permit more complex "hiding" and "dodging" behaviors.
Consumption of "Powerdots" temporarily reverse the predator prey relationship; necessitating a behavioral response due to environmental changes.
Gradual changes in environment occur as a result of actions taken by the competitors.
Extreme changes in environment occur as the the playing board increases.
Topologically toriodal environment (i.e. the wrap around at board edges) prevents simple "run away" behaviors.

1 : The PacMan Environment

The domain over which PacMan and the Ghost must learn consist consists of external environmental constrains determined by the current level, and by their current policies and the policies of their opponents.

The environment can be described with a matrix which encodes the presence or absence of each of the environmental components listed in Table 1.1.

Table 1.1 Environmental components
Inanimate elements Walls Open Spaces Dots Power dots Ghost regeneration points	Animated creatures PacMan Ghosts Edible Ghosts

2 : Internal Representation

When given to the system as input, the world is represented by a 2d array which identifies the contents of each cell on the game board. To make a system reason about these possible arbitrary labels would be quite difficult. For example consider if 1=PowerDot, 2=Ghost, 3=Dot, there is no linear relationship between label and goodness. To overcome this we split the world into bit planes where each bit indicates the presence or absence of a particular component at a particular position on the board.

Figure 2.2:
The current state of the world is represented by a three-dimensional binary array where each bit plane indicates presence or abence of a component at a particular position.

Figure 2.1:
The current state of the world is represented by a three-dimensional binary array which allows linear perceptrons to capture relations within the world.

Each creature has binary detectors for each of the possible elements of the environment for a 10x10 grid centered on the creature. This defines the current environment for the creature.

3 : Our Reinforcement Learing Paradigm

Figure 3.1:
Action decisions are based on perceptron-like policy networks, which are centered about the creature.

Because we use a perceptron-like network as our fundamental representation of each of the creatures evaluation functions, we have implicitly assumed that the factors in the world are linearly separable. This is in fact not the case, as in some situations an optimal policy might include conditions such as exclusive or, which cannot be represented with a linearly separable mechanism.

Given our creature-centric internal representation of the world state, as depicted in Figure 3.1, a perceptron network over this space can be represented by a real valued 3 dimensional array. Each creature has a weighting matrix which defines the desirability of having each of the elements in each particular location for each possible move. By taking the dot product analog of this array with the state of the world represented as a 3 dimensional binary array, we can essentially record the response of this network to the word-state stimulus.

Figure 3.1:
The response of a given unit can be computed by taking the 3 dimensional analogue of a dot product of an array encoding the perceptron network for a given decision with the binary array representing the world state.

Figure 3.2:
The response of a given unit can be computed by taking the 3 dimensional analogue of a dot product of an array encoding the perceptron network for a given decision with the binary array representing the world state.

By comparing the responses of four separate networks, each which encodes the drive to move in one of four directions, we can determine the most preferred action for the creatures. It would be possible to use four matrices for Up, Down, Left and Right but this would take longer to learn than one matrix which is directionally specific. This matrix is rotated and multiplied by the current environment to attain the expected benefit of making that move.

Because in our simulation there are eight possible units, and each creature examines a 10 x 10 window, yielding 800 free parameters to be learned.

The update rule for the perceptron network is based on a reinforcement learning scheme. Essentially upon competition for each game, each the value assigned to each move is the immediate score acquired on that turn plus some percentage of the value of the previous move.

4 : Testing Procedure

The testing procedure involved introducing the ghost and pacman to environments of increasing complexity. By starting with a very simple environment, the pacman would begin with reinforcing very basic behavior such as avoiding walls, avoiding ghosts, eating pacman, eating edible ghosts, and eating dots.

After a number of iterations, the board was changed and more complex behaviors needed to be developed. There is a chance that simple behaviors could be unlearned, if they are not as useful at higher levels, but our intention is to learn behaviors which can be generally applied to most situations.

These behaviors could be categorized as simple, general drives(i.e. a will to live, a desire for food, adversion moving into walls). Each of these drives are competing against each other, but depending on the strength of the reinforcement, certain drives will dominate. For example, the drive to avoid getting eaten should outweigh the drive to eat dots.

Figure 4.1:
Level 1 was extremely simple. It has only three open squares. If the pacman moves down, it wins. If it moves up, it dies. If it moves any other direction, its chance of dying increases. This is an attempt to force pacman to learn 4 basic behaviors:

moving directly into pacman is bad
moving away from the ghost is good
moving towards dots is good
moving into walls doesn't benefit

The ghosts are expected to learn

moving onto pacman is good
any other move is less good

Figure 4.2:
Level 2 was only a little more complex. It has four open squares and included a power dot. This level is intended to reinforce:

pacman's preference for power dots, especially when ghosts are near
large pacman's desire to eat edible ghosts
the ghost's fear of the large PacMan
undesirability of moving directly into walls

Figure 4.3:
Level 3 has more open space than either of the proceeding levels. There are many dots, but only one power dot. If pacman successfully eats the ghost then the ghost is exiled to the upper left corner of the board, further reinforcing the desirability of that action. An important behavior which might be learned is moving towards dots which are not directly in front of pacman.

Figure 4.4:
Level 4 shows a jump in complexity. It has circular paths and multiple power dots. The ghost will appear on the board again even if eaten. The paths are intended to reinforce continuing on the same path, once chosen, because continuing along a path gives positive marginal reinforcement and turning around give mildly negative reinforcement.

Figure 4.5:
Levels 5-10 are more and more complex. There are more complex paths, more power dots, and paths from one side of the image to the other. If the lower level driving behaviors were properly learned(i.e. avoid walls, run from ghosts, etc.), the pacman and ghost should be more successful on the higher levels. It is not expected that this simple perceptron model can solve higher level tasks, it is only intended to provide drives based on the surrounding environment. In fact, it can only learn tasks which are linearly separable.

5 : The Perceptron Network As Accumulated Drives

Figure 5.1.1 The influence of ghost position on measurment of goodness of PacMan's forward movement. The dark spot above the center square indicates that it is very unfavorable to move forward when a ghost is before pacman. Similarly the light square below the center indiates that PacMan should move away from the ghost.	Figure 5.1.2 The influence of an edible ghost position on measurment of goodness of PacMan's forward movement. In general it is advantagious to eat it when it is close.
Figure 5.1.3 The influence of regular dots on PacMan's forward movement.	Figure 5.1.4 The influence of powerdots on PacMan's forward movement. Notice the strong attraction when they can be eaten in only a small number of steps.
Figure 5.1.5 The influence of walls on PacMan's forward movement. It is generally advantagous to move away from walls when possible.

Figure 5.2.1
The drive of a ghost to move towards pacman when he is edible.

Figure 5.2.2
The drive of a ghost to move away from pacman when he can eat the ghost.

Figure 5.2.3
The dots generally have no influence on the direction of the ghosts movement. This is evidence of the failure of this model to capture complex concepts like the fact that PacMan eventually will visit points with dots on them.

Figure 5.2.4
The ghosts tend to stay away from power dots, wether or not PacMan is actucally posing a real threat.

Figure 5.2.5
The influence of walls on the Ghost's forward movement. Just as for PacMan, it is generally advantagous to move away from walls when possible.

6 : Cross Generational Testing on Level 1

Cross Generational Testing
		GHOST
	LEVEL	1	2	3	4	5	6	7
P A C M A N	1	64%	66%	70%	60%	68%	65%	66%
	2	61%	53%	55%	51%	48%	42%	49%
	3	48%	59%	55%	54%	55%	52%	45%
	4	55%	61%	59%	57%	55%	60%	58%
	5	52%	52%	51%	49%	48%	50%	62%
	6	53%	52%	41%	56%	55%	59%	55%
	7	52%	48%	56%	55%	49%	57%	48%

Figure 6.1:
Cross generational testing results, curve color indicates maxiumum PacMan training level, abcissica indicates maximum Ghost training level, curve height indicates percentage of PacMan wins.

6 : Cross Generational Testing on Level 2

Cross Generational Testing
		GHOST
	LEVEL	1	2	3	4	5	6	7
P A C M A N	1	52%	60%	50%	51%	44%	54%	61%
	2	72%	59%	61%	66%	66%	74%	67%
	3	81%	80%	77%	87%	76%	80%	83%
	4	81%	80%	78%	69%	70%	82%	81%
	5	80%	78%	75%	87%	70%	76%	72%
	6	77%	74%	70%	87%	70%	76%	72%
	7	82%	77%	68%	80%	78%	86%	76%

Figure 6.2:
Cross generational testing results, curve color indicates maxiumum PacMan training level, abcissica indicates maximum Ghost training level, curve height indicates percentage of PacMan wins.

6 : Cross Generational Testing on Level 3

Cross Generational Testing
		GHOST
	LEVEL	1	2	3	4	5	6	7
P A C M A N	1	26%	30%	16%	27%	23%	27%	27%
	2	50%	52%	41%	42%	48%	41%	48%
	3	59%	61%	53%	46%	55%	62%	56%
	4	47%	54%	55%	57%	56%	44%	45%
	5	55%	48%	55%	51%	67%	56%	54%
	6	51%	53%	46%	51%	50%	51%	55%
	7	51%	54%	62%	60%	60%	49%	53%

Figure 6.3:
Learning performance quickly improves then asymptotes. The effect of improved ghost training on pacman victory begins to appear.

A1 : Images " of" " the" Levels

Jeremy S. De Bonet
jsd@debonet.com

Page loaded on July 12, 2025 at 06:07 AM.
Page last modified on 2006-05-27