Our approach to real-time vision is heavily biased towards the needs of a mobile robot navigating in a structured environment. We start with the observation that low-bandwidth sensors (such as Polaroid-type sonar) are adequate for simple obstacle-avoidance and path-following behaviors. Earlier research with the Spatial Semantic Hierarchy (SSH) has shown that robots with no visual abilities can build useful maps of their environment and navigate through it. However, one thing that sonar and other range-finding senses, such as laser scanners, lack is the ability to use surface properties to localize and identify objects.
ARGUS is our framework for visual identification and tracking of objects. Color video allows us to identify objects by their color, shape, and the spatial relationships among their constituent parts. Since ARGUS is designed to be a sense for mobile robot map-building, real-time performance is critical. See the publications section for papers describing the system.
The problem with live, real-time, color video is the tremendous data bandwidth of the image stream, but for any particular visual task, such as tracking an object moving across the visual field, most of the data in the video signal is irrelevant and can be safely ignored. This presents the potential for a substantial reduction in computational complexity. ARGUS exploits this potential by using simple image features such as single-color blobs and prominent edges as the building blocks for its object representations.
If we can sample the image stream at a sufficiently high rate, each individual blob or edge segment of an object can be tracked by examining a very small area around its last location, since it will not move far enough between frames to escape. Intelligently "cropping" the video image to this small area focuses processing power on a small subset of the image pixels. For many tasks, processing can be further reduced by subsampling the cropped image. For example, high-contrast straight edges will appear much the same regardless of the number of pixels used to display them. As an example, consider the low-resolution view of the interior of our lab below.
The following sequence of images shows the left edge of the lab door using 2500, 1500, 1000, 200, and 100 pixels. The images have been adjusted to be the same size so that they can be more easily compared.
Notice that even when using only 100 pixels, the slope and extent of the door's edge are clear. If the task allocated to this particular site of attention was to track the edge of the door, 100 pixels would be all the information needed. The computational savings from processing only 100 pixels instead of 2500 are substantial. By subsampling to an appropriate level, the tracking system can allocate processing power to the tasks which need it and save time on those features which can be tracked using low-resolution windows.
It is possible to further optimize processing by reducing the resolution of the color representation. You can use no more than 8 bits per pixel (total), 3 red, 3 green, and 2 blue and still compute meaningful results. This provides a greater advantage than simply the added storage and transmission efficiency, it also allows arbitrarily complex single or double argument computations to be handled as pre-computed table lookups. All of the images on this page use the 3/3/2 representation.
ARGUS tracks simple features by connecting a variable-geometry window (called an attention buffer) with a computational agent. We refer to this combination as a feature tracker. The attention buffer is a subsampled, cropped, color reduced region of the incoming image stream. The computational agent applies simple image-processing algorithms to its associated attention buffer and adjusts the buffer's geometry (position and orientation) with respect to the image stream to follow a feature as it moves from frame to frame. Currently, we have trackers for straight edge segments and for blobs of a constant color. (We are currently working on making the blob tracker's color recognition adaptive.)
Conceptually ARGUS represents objects as a hierarchically-structured set of image features and the geometric relations between them. The figure below depicts a schematic representation of a possible model for a door in an office environmment.
ARGUS uses models like this to guide the object recognition process. In steady state, each edge in the model would be tracked by a feature tracker attuned to edges, and a blob tracker would follow the knob and the main body of the door. There is a higher level object, a visual schema, that is responsible for the creation and placement of the individual trackers. This schema is also responsible for ensuring that the individual trackers are tracking features in the image that satisfy the geometric constraints associated with the model. If not, it repositions them appropriately.
The schema matches the model in an incremental fashion. At any given point only be a subset of the features of the model may be represented by active trackers following features in the image stream, but this does not prevent the schema from accurately pursuing those parts that have been located. As compared to a system which requires matching the entire object in every frame, the incremental nature of ARGUS's matching process means that it can track faster moving objects with a given computational effort.
In one of our early experiments we found that even though we had a very limited connection to the frame grabber, which kept the frame rate under a couple of frames per second, the system could still track figures on flash cards (our test case) moving at a reasonable rate. This was due to the fact that the system only needed to locate a single feature in order to anchor its search for the other model parts.
Discuss the new model parser...Each tracker maintains a low-resolution sub-window of the image stream focused on the feature it is following. In each frame the tracker must isolate its feature, and predict a position and orientation for that feature in the next frame. This task is made simpler by the fact that the tracker knows the feature's characteristics from the previous frame. This section illustrates the tracker's operation and shows some of the processing performed by the edge and blob varieties.
The image below depicts potential sub-windows for trackers following the four outer edges of a door under the control of a door schema. The windows are shown here with their orientations with respect to the model, but in the real system, each edge tracker has it's own local coordinate system where the tracked edge is always maintained vertical.
The edge tracker is designed to track a single edge. Its algorithms generate a set of candidate edges and select the one which is most closely matches the edge from the previous frame. The match is computed as a weighted distance measure incorporating length, orientation, position, color, and intensity. Once a candidate has been selected, its parameters are used as a prediction for the appearance of the feature in the next frame.
The images below depict an edge tracker in action. The image on the left shows the actual sub-window of the image stream that the tracker has selected, and the image on the right depicts an intermediate stage in the edge tracker's image processing. In the processesed image, only vertical edges are enhanced and it is easy to pick out the desired feature, especially when you know that in the last frame, the edge was about the height of the window, represented a light to dark transition, and was located slightly to the right.
Tracker image | Processing result |
The blob tracker is intended to track a blob in the image stream given an aspect ratio and a target color. The initial implementation was rather primitive, with a rigid color specification, and no provision for detecting rotations, but rough rotation detection (when appropriate) has been added, and we are working on tolerating reasonably slow color drifts through an adaptive color matching scheme.
The images below demonstrate a blob tracker in action. As for the edge tracker, the image on the left shows the sub-window of the image stream selected by the tracker, and the image on the right depicts an intermediate stage of the blob tracker's processing. The original image has been binarized to represent only the pixels that belong to the tracked blob.
Tracker image | Processing result |
Because ARGUS uses closely cropped, low spatial and color resolution sub-windows for its trackers, a given tracker's image stream is relatively low bandwidth. Consider the 200 pixel door edge image above. Even at 20 frames per second, this tracker's image stream requires less than 4K per second of bandwidth. This means that we can effectively distribute a substantial number of trackers across a network of PC class machines using ordinary ethernet.
We have developed a simple distributed object system for this purpose, and as a result each tracker can be spawned on any available machine. Trackers receive their input image stream from another distributed object, a frame server, which must run on a machine with an appropriate frame grabber. The frame server is responsible for continuously transmitting the image stream inside the window that each of its client trackers has selected, and repositioning the window whenever the client requests.
A visual schema has even lower bandwidth inputs than the trackers. It only has to manage a trickle of symbols from each of its active trackers, symbols reporting their current status, and then decide if the values are satisfactory. If not, then the schema must take some action (most often this involves repositioning some of its trackers). The low bandwidth means that visual schemas can be distributed as well. As a result ARGUS has no requirement that a schema be running on the same machine as its trackers; they may be placed wherever the load is lowest when created.