Implicit Motion-Shape Model and Approaches to Visual-based Video Searching
Tuan Hue THI, Jian Zhang (NICTA-Sydney), Li Cheng (TTI-Chicago), Li Wang (SEU-China), and Shin'ichi Satoh (NII-Tokyo)

In this project we aim to develop a complete visual-based video search using its motion content. Traditional text-based video search engines rely solely on annotated text description of a video to carry out similarity search. Due to the fast growth of video domain, manual annotation are normally labor expensive and can also be errorprone. Our motivation is to develop a system that uses motion content inside each video to carry out matching task, then rank video similarity according to the returned matching values. We are particularly interested in two aspects of the matching algorithm, firstly, it should work for realistic cases with cluttered backgrounds, secondly, it should be generic enough to work on any kind of action. Taking these into account, we turn our focus to Invariant Local Features and Nonparametric Implicit Shape Model.
The general framework can be illustrated by the following figure

Basically, a video shot is decomposed into sparse set of invariant features, and with suitable techniques for detection/descriptor, we can eventually represent the video as a collection of feature vectors obtained from these local patches. These invariant local features will make the representation of video data more robust to illumination change and occlusion. Using Implicit Shape Model with Hough Transform as implementation, we construct a Hough Space from the query video. All video candidates will be projected on to this Model Hough Space and the Model Fitting Region is derived by searching for the projected region with highest density. Since this approach relies on no particular parameter of the action feature data, and works solely using motion shape of the action.
1. Local Feature Detection and Extraction
In our system, we use two different approaches for Local Feature Extraction and Selection, one with Sparse Bayesian Kernel Filter of Space-Time Interest Points and the other with Local Motion-Shape Feature. The output of these two techniques are the most representative local features of the action data, and will be fed into the Implicit Shape Model for matching value.
1.1. Space-Time Interest Point Approach
1.1.1. Local Features as Video Representation
A video shot in this perspective is a complex set of local features
under various configurations. Tackling action recognition
this way as we discussed earlier helps to lighten the
dependency on view and scale variance of action visual appearance.
We adopt the existing Space Time Interest Point
(STIP) detection technique from Laptev et al. [13] to detect
points with high motion change. In addition, by observing
that the certain regions around these detected points are also
contributive to the action context, we refine STIP detection
results with a post Inpainting procedure, which idea is similar
to Image Inpainting described in [4] by Criminisi et al. The
inpainting process starts on the boundary of connected STIP
point regions, base on the median scale and frequency of theseSTIP points to generate hypothesis about whether other points
in the neighborhood should be included. The following figure illustrates
the effect of our improved technique, inpainted Space Time
Interest Point (iSTIP), over the traditional STIP.
The surrounding areas of detected pixels are then described
using a concatenation of Histogram of Oriented
Gradients (HOG) and Histogram of Oriented Flow (HOF)
[13]. In order to better organize the interest points in terms of
their appearance, we use the agglomerative clustering scheme
from Agarwal and Roth [1] to group similar interest regions
based on pairwise Normalized Greyscale Correlation (NGC)
values, which helps to yield the highest similarity compactness
of pixel appearance regardless of the number of clusters
being produced. With this analysis, a video shot is now represented
as a sparse set of all interest points xi(i; c) having
cluster identity i and 3D coordinate c.

1.1.2. Sparse Bayesian Learning
Among all detected interest points from the video shots, there
are usually motion noise from the scattered background that
do not contribute to the action motion. In fact, those points
normally make the modeling computation much harder and
in some cases might completely distract the core parts of the
action. In order to filter out these irrelevant elements, we develop
an extended version of the Sparse Bayesian Kernel Machine
from Object Recognition work of Carbonetto et al. [3].
For each interest point xi(i;c) as notated in previous section,
there will be associated a class label y(-1;1). The idea
is to build a hierarchical Baysesian classifier model with parameters
learned from the limited amount of available training
data. Following [3], we adopt a sparse kernel machine for
classification purpose. Following Figure shows our successful adoption of this Sparse Bayesian
Machine for the task of feature labeling on Weizmann [6]
dataset, all red colored circles represent points noisy background
while green circles are those positively labeled as
parts of the action configuration, and denoted as action elements
in our system.

1.2. Motion-Shape Features for Video
Human action in our perspective is represented as a sparse
set of local Motion-Shape features. Those are the selective
patches detected at different scale from the Motion History
Image (MHI), each contains information about the motion
field and the shape of the actor, hence we loosely use the term
Motion-Shape to describe. By analyzing these local attributes
and their global configuration, we can conceptually articulate
the behavior of the formulated action. The following Figure describes our
feature extraction algorithm. From the query video (a), we
compute a collection of MHIs following [1]. For each MHI,
there will be detected the dominant motion region and its center
point (white rectangle and circle in Figure (b)). This motion
blob and its center are used as references for Motion-
Shape searching. In our design, we call those center points
Local Action Centers, as opposed to Global Action Center,
the mean position of all Local Action Centers in each video.

Sliding the window search at different scales in (c) will help to produce a collection of Motion-Shapes,
drawn as the thin colored circles. Each Motion-Shape m is
(a) Query Video: Action Model
will be built based on this video
(b) MHI: nominate action search
region with Local Action Center
(c) Motion-Shape: extracted as
m(x; y; t; c; omega); circle color represents
cluster identification
(d) Hough Space: represented as
Lookup Table, 3-tuple m(x; y; t)
is filled using index pair m(c; omega)
Fig. (d) Constructing Hough Space for Action Model
defined by its relative position (x; y; t) to the Action Centers.
Since each Motion-Shape by its nature is a motion field, we
easily compute the Histogram of Oriented Gradients (HOG),
and also the orientation ! of the most dominant gradient in
each patch. Using HOG values obtained from each patch,
we cluster the patches into different representative group,
each defined by a cluster identification number c. Figure (c)
shows the cluster id of Motion-Shape circles using different
colors. At this stage, a video shot is genuinely decomposed
into a set of Motion-Shapes m(x; y; t; c; omega), which will be
encoded into our model, described in the next section.
2. Implicit Motion-Shape Model
In order to produce a generic design for modeling the human
action, we extend the Generalized Hough Transform [2] technique
to detect the 3D action structure formed by different
distinctive Motion-Shapes. Using Global Action Center as the
reference point in our 3D structure, we build a Hough Space
to quantitatively represent the relative position of all Motion-Shapes in the action model, illustrated as the Lookup Table in
Figure (d). In that coordinate system, each Motion-Shape is
indexed by a key pair I = (c; omega) consisting of its cluster id c
and gradient orientation id omega. The 3-tuple entries (x; y; t) in
the Lookup Table are filled by those Motion-Shapes attributes
extracted from the model video based on index key value.

Providing this Hough Space, the action matching task now
becomes projecting the Motion-Shapes collected from video
candidate to this space, matching value will be calculated as
how well those projected points fit in the model. The above Figure
illustrates the main steps of the matching task, starting with
the MHI construction (Figure (a)) to calculate new Motion-Shapes in (b). Using the filled Model Hough Space (top left
of (b)), a particular Motion-Shape mo (circled in white) will
use its index key (c;omega) to find its corresponding entries in
the Lookup Table. In this example, at its located index, there
are 5 records about the relative position of the model reference
point (Model Action Center) and will all be considered
as possible locations for the new Action Center of this video.
The search for this new Action Center will help to determine
how well this video action matches the model action. Formally,
with each hypothesis about Action Center position
we can factor the marginalization probability using
Motion-Shape evidence and its location l = (x; y; t) based
on the pair index I = (c; omega), this factorization is adopted from
2D Object Recognition work of Leibe et al. [14]. The quantitative
matching score of an unknown motion compared to the action
model is then defined as the summation of all Motion-Shape
entities and Action Center locations.
In our design, we run mean-shift Parzen window density estimation
to find the Model Fitting Region, which is the projected region that has the highest density of voted Action
Centers (Figure (c)). In our model, we adopt two kinds of
reference points, Global Action Center and Averaging Local
Action Centers. While the former runs 3D volume density
searches using 3-tuple (x; y; t) location of a Global Action
Center, the latter runs 2D area searches using 2-tuple (x; y)
location of all Local Action Centers on multiple MHIs and the
average of all best matching will be used. We will discuss the
effects of these two referencing systems in next section. The search criteria for mean-shift is the ratio r of vote
counts on searching region. Matching score can then be interpreted
in different ways using this ratio. In our video search
system (described in the next section), we rank video relevancy
based on this ratio in ascending order, best match has
the highest r.
3. Experimental Results
We developed a complete video search system based on the
proposed action matching algorithm, Figure 4 shows a snapshot
of its user interface. On the left panel, the Query Video
is loaded and processed to generate the codebook dictionary
of Motion-Shapes (in this particular query, there are 289 clusters
obtained); also a Segmentation thumbnail is generated to
illustrate the relative position of the nominated Action Center.
The right panel shows the video from the database that have
high matches with the query video. There are three different
views for each match (Original, Motion-Shapes, or Segmentation).
The results are ordered according to the matching
score r and the relative matching correspondence is implied
by the yellow square Model Fitting Regions.

The following Figure lists the most representative action elements
obtained from 16 actions of both datasets.

A detailed snapshot of the actual classification process is also shown in the following Figure
In order to attain a thorough evaluation of our technique,
we use this Video Search system to run the tests on the two
datasets KTH [16] (2400 video shots of 6 actions) and Weizmann
[6] (92 video shots of 10 actions). The common benchmark
train/test amount for these two datasets is 2/3 Split on
KTH, 16 persons for training and 9 persons for testing, and
leave-one-out onWeizmann, 8 persons for training and 1 person
for testing. Since we are carrying out searching task, we
only need one training sample per query. Therefore, in order
to produce fair comparison with reported works, we do a
random selection (with equal samples of each action) of the
search queries, and run the search on the same testing amount.
In our evaluation process, we are also interested in understanding
the accuracy-time tradeoff relationship and view
change sensitivity. We conduct three independent test runs
based on three principal methods, namely, Global.Mirrored,
Local.Mirrored, and Local.NonMirrored. The first term indicates
whether the Global or average of Local Action Centers
is used as reference point, the second term specifies if the search is mirrored in left-right direction, that is, each video
candidate will be flipped horizontally to generate a mirror of
itself, the test is then done on both instances, and the maximum
matching score is used. Empirical results show that
on average, the Global search requires double the processing
time of Local search, while the Mirrored algorithm is 1.5
times slower than NonMirrored. Figure 3 summarizes the performance
of three techniques on KTH and Weizmann using
the Receiver Operating Characteristic (ROC) curve.

Tests on Weizmann dataset generally return better results
than KTH, which is quite reasonable since the backgrounds
in Weizmann are static, while the cameras in KTH are not
stable. It is also shown that using Global Action Center as
a reference for 3D point cloud search does yield better result
than averaging individual 2D Local Action Centers. The performance
decline in two NonMirrored techniques does imply
that our technique is sensitive to view change, which is expected
since we rely on the motion field shape to analyze action
behavior, this drawback can be overcome with Mirrored
method, sacrificing an overhead portion of processing time.
The black straight line in the above Figure is called Cut-off Line
which connects the two ends of True Positive Rate [0; 1] and
True Negative Rate [1; 0], intersections of this line with ROC
curves are called Cut-off points (represented as black circles),
indicating the position where Sensitivity is equal to Specificity,
and we use those points to analyze the average performance
of our system specific for each action and as compared
to other reported works on KTH and Weizmann. The performance
of Global.Mirrored are elaborated as per action at Cutoff
points using the Confusion Matrix with normalized sum on
each row and column, as shown in the following Figure.

Using the average accuracy, we also conduct a comparison
survey between our methods and reported state-of-the-art
action recognition approaches, as shown in Table 9. Interestingly,
while other works seem to work well for either one
of the datasets, our system performs equally well, having the
second best classification score on both KTH and Weizmann.

References
[1] S. Agarwal and D. Roth. Learning a sparse representation for
object detection. In ECCV ’02: Proceedings of the 7th European
Conference on Computer Vision-Part IV, pages 113–130,
London, UK, 2002. Springer-Verlag.
[2] G. Bradski and J. Davis. Motion segmentation and pose recognition
with motion history gradients. MVA, 13:174–184, 2002.
[3] P. Carbonetto, G. Dorko, C. Schmid, H. Kuck, and N. de Freitas.
Learning to recognize objects with little supervision. Intl.
Journal of Computer Vision, 77(1-3):219–237, May 2008.
[4] A. Criminisi, P. Prez, and K. Toyama. Region filling and object
removal by exemplar-based image inpainting. IEEE Transactions
on Image Processing, 13:1200–1212, 2004.
[5] A. Fathi and G. Mori. Action recognition by learning midlevel
motion features. pages 1–8, 2008.
[6] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri.
Actions as space-time shapes. Transactions on Pattern Analysis
and Machine Intelligence, 29(12):2247–2253,
2007.
[7] M. Grundmann, F. Meier, and I. Essa. 3d shape context and
distance transform for action recognition. In ICPR, 2008.
[8] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically
inspired system for action recognition. pages 1–8, 2007.
[9] H. Jiang, M. Drew, and Z. Li. Successive convex matching for
action detection. In Proc. Conf. Computer Vision and Pattern
Recognition, pages II: 1646–1653, 2006.
[10] Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual event
detection using volumetric features. In ICCV, 2005.
[11] Y. Ke, R. Sukthankar, and M. Hebert. Event detection in
crowded videos. In Proc. Intl. Conf. Computer Vision, pages
1–8, 2007.
[12] H. Kuck, P. Carbonetto, and N. de Freitas. A constrained semisupervised
learning approach to data association. In Proc. European
Conf. Computer Vision, pages Vol III: 1–12, 2004.
[13] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning
realistic human actions from movies. In Proc. Conf. Computer
Vision and Pattern Recognition, 2008.
[14] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization
and segmentation with an implicit shape model. In ECCV workshop on statistical learning in computer vision,
pages 17–32, 2004.
[15] J. Niebles, H. Wang, and F. Li. Unsupervised learning of human
action categories using spatial-temporal words. Intl. Journal
of Computer Vision, 79(3), September 2008.
[16] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human
actions: a local svm approach. In Proc. Intl. Conf. Pattern
Recognition, pages III: 32–36, 2004.
[17] J. Sivic and A. Zisserman. Video Google: A text retrieval
approach to object matching in videos. In ICCV, 2003.
[18] Y.Wang and G. Mori. Human action recognition by semilatent
topic models. PAMI, 31(10), 2009.
|