The goal of this project is to automatically process video streams to estimate crowd density and detect abnormal events in a crowd to infer whether there is a threat that should be signaled to a human operator.
Crowd density is an important feature of a crowd and thus it is essential to think that different levels of crowd density should receive different kinds of attention. Polus et al. [1] define the problem of level of services for a pedestrian flow as: free flow, restricted flow, dense flow and jammed flow based on the number of pedestrians per unit area.
The literature on crowd density estimation presents two different methods: counting by detection and counting by regression. In counting by detection [2][3], a pedestrian in the scene is individually detected using an object detector, and then the detected objects are tracked. Finally, the number of people is counted by the total number of tracks.
The mainstream works in recent years focus on solving the counting problem based on the counting-by-regression framework [4][5]. This kind of method aims to learn a direct mapping from global or local low-level features to the number of objects by means of supervised machine learning algorithms such as Support Vector Regression (SVR), Gaussian Process Regression (GPR) or Bayesian Poison Regression. An alternative way is to use pixel-wise density learning and the number in the ROI is computed by integrating over the crowd density map.
Although the regression methods have achieved great performance in the recent years, the features set or the kernel function is usually dependent to the training size, and the overfitting problem occurs when the dimension of the features is high. Thus, learning a regression function is still a challenging problem.
In order to avoid the aforementioned problem in the crowd density estimation, we attempt to solve this problem based on the image retrieval methods. The basic idea is that for any given test image or frame, we could find several closed images in the training set and compute people number as the average number when a large dataset of pedestrians are annotated. In specific, we first segment the local crowd region for each frame, and for each region we are able to use the local low-level features such as motion or optical flows. Then we use these features as the basis, and apply the sparse representation to train a well-fit dictionary with K-SVD method. In the test process, for each given frame or image, we find the closed images in the training with the minimum reconstruction cost. The final people number is the average number annotated for these images we retrieval from the training set.
Anomaly detection also refers to outlier detection that is to identify the patterns in a given data set which do not conform to an established normal behavior in the crowd.
For anomaly measurement, the mainstream algorithms intend to compare testing sample with the training event based on a probability model [8][9]. There are variety of statistics models including Gaussian model, Gaussian Mixture Model, Hidden Markov Model, Markov Random Filed or spatio-temporal MRF and Latent Dirichlet Allocation. For these conventional models, high-dimensional feature is preferred to better represent the event while the required number of training data is increasing exponentially with the feature dimension, which is unrealistic to have enough training data for estimation in practice. Thus, the main unsolved problem by most state-of-the-art methods is how to represent an event using high-dimensional feature.
In our work, we propose to use sparse representation to represent high-dimensional samples with less training data, which inspire us to detect abnormal event through a sparse reconstruction from normal ones. Our work will mainly focus on how to select the optimal basis and choose a dictionary that is fitted the training set well. Finally, we aim to detect abnormal events through a sparse reconstruction over the normal bases.
This project aims to accurately measure the crowd density and also to detect the abnormal event in the crowd and hope to improve the state-of-the-art performance.
This work was carried out at the International Doctoral Innovation Centre (IDIC). The authors acknowledge the financial support from Ningbo Education Bureau, Ningbo Science and Technology Bureau, China's MOST, and the University of Nottingham. The work is also partially supported by EPSRC grant no EP/G037574/1.