Abstract:
With the recent worldwide statistical rise in the amount of public violence, automated violence detection in surveillance cameras has become a matter of high importance. This work introduces an end-to-end, trainable 3D Convolutional Neural Network (3D CNN) for detecting violence in video footage. The proposed network is inherently capable of processing both spatial and temporal information, thereby obviating the need for additional models that would introduce higher computational requirements and complexity. This work has two main contributions: 1) developing a lightweight 3D CNN suitable for inference on edge devices as mobile systems, and 2) a comprehensive explanation of all components comprising a CNN model, thereby enhances model interpretability. Experiments were conducted to assess the performance of the proposed model using a consolidated dataset combining four benchmark datasets. The results of the experiments support the asserted contributions, which are discussed in detail.