Jeremy S. De Bonet : TRACKING : PART OF THE HCI PROJECT

Resume

RESEARCH

Publications

Image Compression

Texture Synthesis

Image Database Retrieval

Segmentation

Registration

Discrimination

Projects

Web Hacks

Multiple Room Occupant Location and Identification

Purpose

To effectively interact with the users of the Intelligent Room it is critical to know where the occupants are. With such information, HCI input and output can be modified through understanding of the physical context of each occupant. Examples of this are: using knowledge of an occupants location to resolve speech ambiguities such as "this" and "that"; displaying information on devices which are near and or visible to the occupant to whom the information is relevant. Occupant location information can further be used as input for additional processing, such as providing a foveal area for a gesture-recognition system, or allowing occupants to be "objectified" so that they may be labeled or pointed to by other systems.

Task Description
To provide a list of three-dimensional coordinates or bounding boxes for each room occupant, given the current accumulated knowledge about the room and input from the rooms sensory devices.

Overview of the Current Implementation

The current tracking system uses cameras located in two corners of the room as shown in figure 1.1 who's views are shown in figure 1.2

Figure 1.1:
Locations of the two tracking cameras in the HCI room. These cameras are mounted on the walls about 8 feet above the floor, and are tiled slightly downward.

Figure 1.2:
Sample output images from the two tracking cameras.

The output image from each camera is analyzed by a program which labels, and identifies a bounding box around, each occupant in the room. This information is then sent through coordination program which synchronizes the findings from the individual cameras and combines their output using a neural network to recover a 3D position for each room occupant.

Additional Implementation Features

In addition to producing the 3D position of room occupants, the tracking subsystem controls several other aspects of the room's "behavior". This is done because these additional behaviors are more effectively implemented as direct parts of this subsystem than as independentdent systems, thus circumventing the need for interaction with the Brain. Yielding mechanisms which act very much like reflex behaviors in living creatures. Currently three such reflexes have been implemented:

Control of mobile cameras to follow a particular room occupant. In the current implementation these cameras are mounted above the stationary tracking cameras, and are controlled by servo motors, which allow the cameras to rotate laterally. These cameras are shown in figure 1.3.
Selection of optimal view of a particular occupant. Using the 3D position of the occupant who is being followed by the mobile cameras, this mechanism selects the "best" (as defined by predetermined room locations) view of that occupant from the two mobile-camera views and a third stationary camera centered in the room, as shown in figure 1.4
Direct output of occupant location to a display program. This output interface allows for real-time visualization of the location of each of the rooms occupants. To date, two output programs have been designed to connect to this interface. One which shows a top down view of the current location and past locations and another which uses a 3D Inventor display. These are shown in in figure 1.5

Figure 1.3:
One of the dual-camera mounts in the HCI room. The lower camera, which uses a wide angle lens is used for tracking room occupants. The upper camera is mounted on a computer controlled servo motor which is controlled by the tracking subsystem to follow occupants around the room.

Figure 1.4:
Locations and approximate view frustums of the three cameras who's views are selected between by the selection mechanism.

Figure 1.5:
Sample output from programs connected to the visualization interface.

The Current Implementation
Code Files
Most of the code used for this implementation is based upon the Vision-Development (visiondev) library, developed by members of the HCI project. For more information about this library contact {email-me}
Single camera tracking code ~jsd/hci/TRACKING.cpp contains code which performs tracking of multiple occupants using a single view ~jsd/hci/HCI_INTERFACE.cpp contains code which allows trackers to pass model information back and forth contains interface for communication with the coordination system Single camera tracking configuration ~jsd/hci/TRACKER.cfg various parameters which control tracker performance ~jsd/hci/BoundingBoxesL.cfg Bounding boxes of legal occupant positions in the left camera ~jsd/hci/BoundingBoxesR.cfg Bounding boxes of legal occupant positions in the right camera Mobile camera control program ~jsd/hci/Follow.cpp interfaces the Motorola serial IO chip which controls the servo motors on the mobile cameras. Camera coordination and brain interface ~jsd/hci/Coordinator.cpp Combines output from trackers to produces a 3D occupant position. Interfaces with the Brain to output occupant position, and handle request for entering and exiting occupants.
Compilation: make TRACKER from the ~jsd/hci directory

Hardware requirements and configurations

Currently the Tracking system requires the concurrent execution of 6 programs. A bare minimum requirement is two SGI Indy stations, due to the single analog line-in limitations of these systems. However, the system is currently configured to use 3 SGIs, to reduce the workload on each of these systems. The current systems are:

raphael.ai.mit.edu
Coordinator
brave-heart.ai.mit.edu
TRACKER -L (NOTE: The input from the left camera must be fed into the serial line-in camera connection.)
Follow -L (NOTE: The serial-port must be connected to the left servo controller.)
gameout
diablo.ai.mit.edu
TRACKER -R (NOTE: The input from the right camera must be fed into the serial line-in camera connection.)
Follow -R (NOTE: The serial-port must be connected to the right servo controller.)

Bootup sequence

diablo and brave-heart must be logged into with xhost+ set.
if connection to the brain is desired, the brain must be running before startup
on raphael: run ~jsd/hci/Coordinator with options
- use [+/-]B to use or ignore the brain
- use [+/-]A # to automatically grab the first # people in the virtual doorway
- use [+/-]T "<TRACKER OPTIONS>" to send options to the trackers
- use [+/-]F to follow the first occupant with the mobile cameras.

Algorithms

The Tracker algorithm
The purpose of the tracker algorithm is given an image from a single camera, identify the optimal bounding box for each occupant of the room.

Background subtraction for segmentation

The principal segmentation method used by this algorithm is background subtraction. Because of the particular configuration, and general nature of the HCI project, we can take advantage of the fact that the HCI environment is generally static. Further we can exploit the fact that the cameras used for tracking are stationary. Thus, by accumulating an image of what we expect the "true" background of the room to be, we can subtract this background from the current image to find which pixels differ, and hence contain non-background objects. An example of such a background image is shown in figure {VAR FIG_NUM_BG}

Figure 1.6:
A sample background image used for segmentation.

There are several issues which arise within this method:

our principal objective is to identify room occupants, not chairs or other mobile items. Backgrounding should eliminate such objects from consideration.
stationary people, or people who are present during start-up tend to present extra difficulties. A Backgrounding method must be able to "learn" when replace its previous belief of the background with new information.
how does the system set the threshold so that room occupants are above this threshold, regardless of their local environment. Shadows cause this problem to be particularly severe, because in shadows the variation between background and foreground (occupant) is far lower than elsewhere.

The current Backgrounding scheme handles these problems using several techniques. To handle thee first two problems of allowing the background to update to incorporate image regions which are not believed to be people, background updating is governed by two mechanisms, one passive and one active.

The passive mechanism keeps track of differences from the background and their longevity. Thus, two backgrounds are kept, the background which is being used for subtraction and second a background which averages over previous time-steps which are within threshold. When an input image is acquired which differs from the the accumulating background by more than the threshold, the count for those pixels restarts. If the incoming image is within threshold for more than a certain count of frames, the background which is used for subtraction is updated with that new value. This has the effect of whole regions of steady pixels being incorporated into the subtraction background. For example when a chair is moved, it will be detected in the subtraction image only for a certain number of frames, then the whole chair will be incorporated into the background. In figure {VAR FIG_NUM_BGACCUM} examples of the accumulation background and the counter image are shown.

Figure 1.7:
The background is updated using an accumulation buffer which keeps track of steady regions which differ from the background. When a region has been steady long enough, the background is updated.

The active mechanism simply suppresses the updating of the background accumulation buffer within the bounding boxes surrounding room occupants. This suppression only slows the "steadiness" counter down by a factor, instead of compleatly halting it; this is done to prevent the protection of accidentally selected regions. For example if a moved chair is accidently identified as an occupant, suppressing the counter compleatly would prevent the Backgrounding mechanism from incorporating the chair.

To compensate for shadows which cause dramatic changes in the brightness and chromaticity of occupants as they move around the room, a color correction system is used.

Color Correction

Color correction for the tracking mechanism occurs in two phases.

The first used preacquired knowledge of the lighting variation in the room to correct the incoming image. This calibration has only been done roughly, yet it seems to do a good job in correcting for many of the deepest shadows. Unfortunately due to the dynamic lighting of the room by two projection TV's the usefulness of a static color correction mechanism is limited. Calibration consists of collecting a time averaged image of a uniform white (with, ideally, spectrally uniform reflectance) object. The incoming video stream can be color corrected by normalizing each pixel by its corresponding pixel in the collected image. One such color correction image is shown in figure {VAR FIG_NUM_CC}.

Figure 1.8:
A calibration image used for color correction.

The second method of color correction is done by transforming the color space of the incoming image. Currently the transformation used is (r, g, b) is mapped to (r/(g+b), g/(r+b), b/(r+g)). Figure {VAR FIG_NUM_COLORNORM} shows a segmented and color transformed image. This is currently the weakest link in the current system, the stability of this normalization is less than optimal. This is a place for great improvement in the current system.

Figure 1.9:
Segmented output which has been transformed into a more stable though still suboptimal color space.

Occupant detection

Given the segmented output from the earlier processes, the remaining task is to determine bounding boxes around each room occupant, and consistently label them.

To detect regions which are most likely to contain room occupants, then further attention can be paid to these selected regions to perform consistent labeling. Candidate regions are selected from a pre calibrated list of legal bounding boxes, (currently stored in the file HCIcfg/BoundingBoxes*.cfg, where '*' is one of 'L' or 'R'). By predetermining legal bounding boxes, we can compensate for occlusions by assuming the existence of a full sized person, even though part of their body may be hidden from view. Further this has the effect of reducing the space over which we must consider as candidates for occupancy. For example, we do not need to consider people on the ceiling. An example of the bounding boxes are shown in figure 1.10 Then the N+2 bounding boxes with the highest density of pixels which differ from the background are selected. The selected regions are the passed on to the labeling system.

Figure 1.10:
Example of the legal bounding boxes for a room occupant, shown in gray. Black boxes are partially filled, and green box is best model fit.

To identify each occupant the tracker algorithm makes use of a time averaged model of each occupant. This model is initialized when a new occupant announces himself to the room. The Brain then sends a message to the coordinator, which simply passes the message on to the trackers to look for a new occupant in the pre-selected virtual doorway (see the HCIcfg/TRACKER.cfg file for the current settings). When both trackers signal that they have located the new occupant, they begin tracking.

Every several frames the two trackers pass their models back and forth so that over time they have the same model of the occupant. This is useful because presumably all sides of the occupant will be observed by both cameras. Oversized samples (for viewing purposes) are shown in figure 1.11.

Figure 1.11:
Examples of the room occupants models used to identify each occupant. Enlarged to show details.

Labeling is performed by exhaustively calculating the error for matching the preselected boxes with each model. A process which is O(N!) in the number of room occupants; however, for small N this is acceptable. Then the labeling which minimizes the total error is selected. This labeling acquired independently for each of the two tracking cameras is then passed onto the Coordination system.

The Coordination Algorithm
In addition to handling communication with the brain, the Coordinator program combines the output from the two independent tracking systems to provide a single 3D location for each occupant.
Several techniques have been tried to perform this 3D reconstruction: Nearest-Neighbor Using the known positions of the occupant when the bounding boxes were collected, the simplest solution was to simply return a 3D position corresponding to the average location of corresponding to the bounding boxes from each image. Linear approximation The average of the closest four neighbors are averaged to produce a 3D position, in the same fashion as above. Neural Network The bounding box data was fed into a neural network, the result of which is a function which closely approximates the projective transformation. Reversing the projective transformation Because the Neural Network solution works well, this has not been implemented, however, the best solution is to simply find the projective transformation which best fits the bounding boxes and then invert it to find 3D position.
Currently the neural network is used to find the 3D position. Clearly the projective transformation is a more robust / well founded technique and should at some point be applied.

Occupant Following and View Selection

Each tracking camera independently controls its mobile camera simply by setting its rotation to be the X coordinate of the desired occupant in the tracking image.

The optimal view is selected based upon the location of the occupant as determined by the Coordination system. View selection is MUX controlled from within the coordinator program. Figure 1.12 shows the regions of the room, which when entered by an occupant cause a change to the corresponding camera. The camera positions are shown in figure 1.4. Notice that when on the right of the room the left follow camera is selected, because the occupant is presumably facing into the room, and toward the left camera. When in the "dead zones" in between the selected view retains its last output to create a steady flow and prevent rapidly flipping back and forth when the occupant is on a region border.

Figure 1.12:
The regions in the room which trigger selection of a particular camera output for the optimal view of a particular occupant. Useful for applications such as automatic occupant filming (during a lecture for example).

Jeremy S. De Bonet
jsd@debonet.com

Page loaded on July 18, 2025 at 06:20 PM.
Page last modified on 2006-05-27