"Image
Retrieval and Classification using Local Distance Functions" - A. Frome,
Y. Singer, and J. Malik
The paper introduces a framework for learning the distance or similarity
function between a training image and other images in the training set for
visual recognition. The distance functions are built on top of elementary
distance measures between path-based features, where the authors use two scales
of geometric blur features and a color feature. The distance functions
are used
for image browsing, retrieval, and classification.
The main contribution of the paper is incorporating the function for measuring
similarity or distance between two different images into the machine learning
process. The authors try to demonstrate that the relative importance of
visual
features on a finer scale can be useful for visual categorization, and do so by
implementing the distance functions. Their objective is to learn the
weights
of the features for each training image, which outputs a quantitative measure
of the relative importance of parts in an image. This allows the authors
to
combine and select features of different types.
The primary strength of the paper is its breakdown of the overall approach into
detailed segments such that the reader understands the goal and purpose for
each subsection. The authors also explain how their approach of
implementing
the "triplets" of images for learning is based on the distance metric
learning
proposed by Schultz and Joachims, and show that the
algorithm by Schultz and
Joachims is more widely applicable than originally
presented.
Another strength is the explanation on specific
instances of where the algorithm
could be used; image retrieval, browsing for an image, and classifying a query
image.
The authors use the Caltech101 dataset, which contains images from 101
different
categories with different numbers of images for each category, which is one of
the standard benchmarks for multi-class image categorization and object
recognition. However, the authors ignored the background class, which if
were
included in the experimentation, may produce poorer classification results.
The authors should experiment with different choices of "K" for
determining
closest in-class and out-of-class images to make "triplets" for
training.
The authors have also explored how the features perform separately and in
different combinations, which can be used to determine the best features to use
for classification. Different combinations may perform better for certain
image
classes, which would be an area for further exploration.
The work can be extended to improve computational time. The unoptimized code of
the algorithm takes about 5 minutes per test image. Since the authors use
an
exact nearest neighbor computation, an approximate nearest neighbor algorithm
could be used to speed up the process.
Review of
A. Frome, Y. Singer, and J. Malik.
Image Retrieval and Classification Using Local Distance
Functions. NIPS 2006.
___________________________________________________________________________
A key step in many algorithms for object recognition
is the computation
of distance between a new image and a set of training images to determine
the class of the new image. This paper presents an approach to automatically
learn the distance function in training images.
While there have been other approaches to automatically learn the distance
function, the main contribution of this paper is to learn a distance function
for _each_ training image. By doing so, the authors are able to produce
recognition rates that are at least equal to that of the current
state-of-the-art approaches on the Caltech101 dataset.
Each image consists of patches that are each identified by a feature
vector.
The distance from a patch in a focal image to a query image is found by
computing the L2 distance to the nearest neighbor in the query image. The
distance from focal image to query image is defined by a weighted
sum of these patch distances. The goal is to learn the weights for each
training image. This is done by maximizing the difference of the distance
between a pair of images that are labeled similar and dissimilar to the focal
image (that is, by a maximum margin formulation).
A query image is assigned rankings with respect to each training image based
on the distance functions. To identify the class of a query image, a binary
classifier is learnt for each training image and then voting is used to
generate the probability of the query image belonging to a particular class.
The paper is clearly written and easy to understand. Some advantages and
limitations of the approach are:
(1) One significant advantage is that the approach does not require each
image to be defined by a fixed length feature vector: this
allows the use
of local features such as that obtained by interest point
detectors
rather than a global feature description. This means that
one can compare
a "simple" image with very few interest points to
a complex cluttered
scene with hundreds of interest points.
(2) Another advantage is that many of the weights learnt are zero reducing the
number of feature comparisons between a query images and
training image.
(3) An important feature of this method is that a distance function is learnt
for each training image:
(a) This may provide greater
discriminative power.
(b) The distance functions are not directly comparable since
they are not
in a normalized form. Thus, a binary
classifier is trained for a each
training image rather than for each
class. This could be a big
disadvantage as the number of
classes and hence the number of training
images per class grows. This also
seems like generating redundant
information. For a human being, it
is possible to identify the relative
"distance" between all images, thus creating a ranking in global
reference
frame. This should also be possible
for a machine to do rather than
producing a relative ranking in each
training image's reference frame.
This is worth exploring further.
-------------------
"Unsupervised
Learning of Models for Recognition" - M. Weber, M. Welling, P.
Perona
The authors present a method to learn object class models, defined as a
collection of objects which share characteristic features that are visually
similar and occur in similar spatial configurations, from unlabeled and
unsegmented cluttered scenes. Their algorithm
automatically selects
distinctive parts of the object class and learns the joint probability density
function encoding the object's appearance. They show that the
automatically
constructed object detector is robust to clutter and occlusion and demonstrate
the algorithm on frontal views of faces and rear views of cars.
The main contribution of the paper is the demonstration of the feasibility to
learn object models directly from unsegmented cluttered
images. By automating
the running of part detectors over the image and the formation of likely object
hypotheses, the authors have extended the algorithm presented by Burl et al.,
which only estimated the joint probability density function automatically.
Although there are many areas for improvement in their algorithm, the authors
have shown that unsupervised learning of models for recognition is possible and
can be done efficiently.
The paper's primary strengths are organization and detailed explanation of the
implementation. The authors present the general problem at hand, related
work
on the area, their approach to the problem, details of the model, and
experimental results. Each category transits to the next with purpose
such
that the reader is able to follow the intuition that the authors had when
building their algorithm.
An area for improvement would be to show direct comparisons to other algorithms;
a table or graph to show the different computational times and efficiencies in
detection/recognition. In order to determine the validity of an
unsupervised
learning algorithm as opposed to a human supervised algorithm, a comparison
with a detector of the same implementation but with human supervision could
have been presented.
The experiments are convincing. The authors allowed their model to
classify the
images without any intervention. However, further tests could have been made to
improve the overall assessment of their algorithm. More training and test
images of both the face and car classes could have been used to observe changes
in the detector's performance. More parts could have been learned by the model
(more than 5 parts for both face and car data sets); again to observe changes
in the detector's performance. Of course both of the above tests would
have
drastically increased the computation times for detection, but dealing with the
training as an off-line process, further experimentation would have been valid.
The work can be extended by incorporating different approaches to the detection
algorithm, such as multiscale image processing, multiorientation-multiresolution filters, neural networks,
etc. The scale and
orientation of the image patch, parameters describing the appearance and
likelihood of the patch, should be incorporated in the algorithm (including the
current use of the candidate part's location). Optimization of the
interest
operator and unsupervised clustering of parts is also an area of extension.
Another extension would be to build a model that is invariant to translation,
rotation, and scale, which would enable learning and recognition of images with
much larger viewpoint variations.
-----------------------------
Generic
Visual Categorization (GVC) is the problem of identifying objects of
multiple classes in images. This paper presents a method for GVC by extending
the bag-of-keypoints approach of Csurka
et. al (2004).
The main contribution of the paper is to provide a fast (computationally the
fastest method so far) of GVC using two types of vocabularies to represent
objects: a universal vocabulary and a class-specific adaptation of the
universal vocabulary. Instead of building one single universal vocabulary
by aggregation of class vocabularies (size C x N where C is number of categories
and N is class vocabulary size), the authors use a universal vocabulary
and a class adapted vocabulary (size 2 x N), thus reducing the computational
complexity.
The paper uses a Gaussian Mixture Model to represent a visual vocabulary.
The approach consists of two main steps. In the first step, the parameters of
a universal vocabulary are learnt from a training set of images of all
categories using Maximum Likelihood Estimation. The vocabulary parameters
are then adapted to a specific class using images from that class using the
MAP criterion to get an adapted vocabulary. The above two vocabularies are then
combined. In the second step, m linear SVMs are
learnt using the above
vocabularies and training images, one per class.
The experiments have been carried out on three large datasets, in-house
(8 scenes with multiple objects), LAVA7 (7 categories) and Wang (10
categories).
The results show an accuracy of 95.8% in 150 ms. on LAVA7, which is the highest
accuracy and the fastest method so far on this database.
Limitations and some comments:
(1) The results show that the approach correctly categorizes the image. Was
there
only one object of the category set in an image? Did any
image have objects of
more than one category? If so, what was the percent of image
area occupied by
the "main" category? Did the method work for
occluded objects?
(2) I am little wary of the speed comparison results because it depends a lot
on
how well the code is written and optimized, the programming
language used, the
machine used etc. The speed comparison does not make sense
unless the there is
some way of standardizing the above parameters.
(3)The experimental results for accuracy are convincing. The experiments on the
in-house database are great because the test images were collected
independently
by a third party.
(4)The good results were obtained even with color features and not juts SIFT
features.
(5)Perhaps reducing the size of SIFT vectors from 128 to 50 using PCA had some
impact on the speed?