STF: Spatio-Temporal Fusion Module for Improving Video Object Detection

Consecutive frames in a video contain redundancy, but they may also contain relevant complementary information for the detection task. The objective of our work is to leverage this complementary information to improve detection. Therefore, we propose a spatio-temporal fusion framework (STF). We first introduce multi-frame and single-frame attention modules that allow a neural network to share feature maps between nearby frames to obtain more robust object representations. Second, we introduce a dual-frame fusion module that merges feature maps in a learnable manner to improve them. Our evaluation is conducted on three different benchmarks including video sequences of moving road users. The performed experiments demonstrate that the proposed spatio-temporal fusion module leads to improved detection performance compared to baseline object detectors. Code is available at https://github.com/noreenanwar/STF-module


I. INTRODUCTION
Computer vision has made remarkable progress in object detection from a single frame for localizing and identifying objects ( [1], [2]).However, relying solely on a single frame for detection is not always effective and sufficient as argued in several recent works ( [3], [4], [5], [6], [7], [8]).Singleframe object detectors are subject to errors in the case of poor or improper visibility of objects that can be caused by occlusions, motion blur, or small object sizes.When objects are occluded or in the case of motion blur, their appearance features can be severely altered.The object detector should be robust to the fact that objects can exist on a spectrum of scales and sizes.Furthermore, small objects have less distinctive features making them harder to detect.
To address these problems, as in some previous work ( [3], [4], [5], [6], [7], [8], [9], [10]), we propose to use multiple frames for better features representation.Given that we are interested in processing videos for road safety analysis, our work is in a context that fits well with object detection from multiple frames.Having several sequential frames to detect objects has the significant benefit of providing temporal complementary information about a given instance, generally observed over a short time.This kind of temporal information is utilized in some existing multipleframe object detection methods by first applying a singleframe object detector and then integrating their bounding boxes across frames using an off-the-shelf motion estimation method ( [9], [10]).The performance improvement relies on heuristic post-processing and those methods do not capitalize on combining features from several frames to compensate for poor feature quality in some frames.
Another solution for multiple-frame detection is to fuse features from several frames together in a learnable manner for better feature alignment ( [3], [6], [4], [5], [7]).Using multiple frames is not trivial, however as the features of consecutive frames are not always aligned or corresponding to the visibility state of an object that can change (e.g.not the same part of an object might occluded).This means that there are no trivial ways to determine which features are more important for the detection.Therefore, feature fusion has to be done carefully.
Global contextual attention involves capturing long-range dependencies and relationships between different regions of an image to understand specific parts and the context.Therefore, we propose a global contextual attention model for feature selection and fusion from a pair of frames.We present an end-to-end framework that learns multiple frame information and fuses it without prior knowledge of motion or temporal relations.We aim to improve the detection accuracy by effectively utilizing temporal and spatial information from two frames, the current frame and the past frame.As mentioned above, it is important to consider that the features corresponding to the same object instances in two frames often lack spatial alignment across frames due to movements or occlusions.To take this into account, we introduced multi-frame (temporal) and single-frame (spatial) attention-based modules.Secondly, to handle small objects, we are considering the multiple layer resolution features from our backbone.Our attention modules operate on those multiple resolution layers.
Our proposed approach, STF (Spatio-temporal fusion), is based on per-frame feature learning through temporal and spatial fusion of features from the current and a past frame.To achieve this, we are proposing two new attention-based modules: the first applies multi-frame attention, while the second applies single-frame attention.Here, we hypothesize that global contextual information along with spatiotemporal information can address the detection problems better as compared to previous works, limited to single-frames and multi-frame methods that fuse feature maps without attention ( [3], [4], [11]).Then, our dual-frame fusion module helps to fuse the learned features from the past and current frames to improve detection accuracy under challenging conditions, like occlusion or motion blur.The effectiveness of our method is evaluated on three popular trafficrelated datasets, including KITTI MOT [12], Cityscapes [13] and UAVDT [14] and we obtained competitive results compared to SOTA detectors.
Our main contribution is the introduction of an endto-end learnable fusion module that combines the current and a past frame by utilizing their temporal, spatial and channel features information.Our specific contributions can be outlined as: • A multi-frame attention (MFA) module with temporal convolutions used after the backbone feature extractor to efficiently use the feature maps of two frames, and enhance features for detecting occluded or blurred objects; • A single-frame attention (SFA) module that weights the current frame feature maps in channel and spatial dimensions to reduce false positive detection; • An efficient dual-frame fusion module to integrate single-frame and multi-frame feature maps at different scales.

II. RELATED WORK
Using multiple frames in object detection was studied in several previous works because it facilitates the association of detected objects, thereby improving the precision and resilience of the detection process.It consists of detecting objects in frames using their spatial and temporal features.Nevertheless, the study of video-based object detection is receiving comparatively less attention than single-frame detectors, yet their applications are numerous and impactful, which includes video surveillance for security, robot navigation, and autonomous driving.Sequential frames have significant complementary information about the same instances, generally observed in multiple frames during a short period of time.Existing multiple-frame object detection methods, such as those proposed by Kang et al. ([9]) and Lee et al. [10], readily capture this type of temporal information.These methods first apply single-frame object detectors and then integrate bounding boxes across frames using off-theshelf motion estimation, which may compromise the quality of detection due to hand-crafted rules.The improvement in the performance depends on heuristic post-processing through box-level methods without end-to-end training.
Zhu et al. [3] introduced flow-guided feature aggregation, where optical flow warping is used to integrate feature maps from temporally adjacent frames in order to increase detection accuracy.There is another way, proposed in [8] that calculates the offsets between temporally adjacent frames.These offsets enable the sharing of features from adjacent frames, improving the ability to perform the detection tasks.Similarly, there is another approach, known as FFAVOD (Feature Fusion Architecture for Video Object Detection) [6], which shares feature maps between nearby frames.FFAVOD proposes a feature fusion module that learns to merge feature maps to improve video-based object detection and classification.RN-VID [15] uses information from nearby frames and merges feature maps of similar dimensions using 1 × 1 convolution and re-ordering of channels to enhance detection.Zhou et al. [4] presented CenterTrack, a method that uses a point-based framework to perform simultaneous detection and tracking of objects.This method concatenates two frames and a prior heatmap as input and associates objects through time while performing the detection from the two frames.Previous research also explores using both motion and appearance cues of objects in a video sequence with models such as Recurrent Neural Networks (RNNs).Using an RNN, the method named Spatio-Temporal Memory module (STMM) [16] introduced a concatenated spatial-temporal memory across consecutive frames to improve detection.Additionally, Long Short-Term Memories (LSTMs) have been employed to interpolate feature maps, resulting in a notable improvement in inference speed [5].The Recurrent Multi-frame Single Shot Detector (MF-SSD) method combines features extracted from multiple consecutive frames [7].This is achieved through the integration of a recurrent convolutional module, enabling the integration of characteristics that extend across multiple frames.
The above mentioned works are mainly focusing on either concatenating or simply summing feature maps rather than using a more fully learnable way.Unlike these methods, our approach aims to train a learnable fusion-based module including temporal, spatial, and channel-based feature information, in a completely end-to-end manner, using the current frame and a past frame.

III. METHODOLOGY A. Overview
The overview of our attention-based framework, STF, is shown in Figure 1.Given a pair of frames, a pre-trained HRNet [17], where we froze the first and third layers, is used to extract features.After that, the features go through two attention modules: 1) a multi-frame attention (MFA) module that uses the two extracted feature maps to perform temporal and spatial attention, assigning adaptive temporal weights to them, and 2) a single-frame attention (SFA) module that uses spatial and channel dimension attention for improving current frame feature maps.To use the temporally prior frame, the idea here is to combine in a learnable manner the extracted features of the past and current frames for object detection.To combine features from two frames after applying attention, our proposed network fuses temporal, channel, and spatial information by aggregating them at the same time with our dual-frame fusion module.In the following, we introduce these modules in detail.

B. Multi-Frame Attention (MFA) module
Given an input video, the multi-scale feature maps of two frames (the current frame and a past frame) are extracted with the HRNet backbone.Then, our goal is to merge the features of these two frames.The Tada Convolution, introduced in the work by Huang et al. [18], efficiently addresses temporal modeling by introducing flexibility to the temporal invariance of 2D convolutions.This is achieved through the incorporation of adaptive temporal weights, which are superimposed onto the convolutional process.Similarly, Cao et al. proposed TCTrack [19], which exemplifies the application of Tada Convolution for improving object tracking.This approach employed Tada Convolutions to incorporate adaptive temporal weights, contributing to improved temporal modeling.Inspired by this previous research, to get adaptive temporal weights for each frame, we designed a Multi-Frame Attention (MFA) module (see Figure 2).The key idea is to adjust the model behavior in real-time as it processes each sequence of frames.This deals with size variations, movement, overlapping, or interaction of objects in frames.
Global information in object detection refers to semantic details that are consistent across frames, helping in identifying objects based on shared characteristics, while local temporal information involves using nearby frames to gather information, such as motion, helping to localize objects, especially in cases of uncertainty about their existence in a specific frame.This module improves the representation ability with multi-frame features by: 1) assigning adaptable weights to each frame to enhance the ability to detect and analyze changes over time, 2) combining both global and local information from multi-frames, and 3) better capturing both detail and broader spatial and temporal information using a multi-scale integrator.
Our MFA module works as follows.Let us assume that we have an input sequence of frames I n and we get a sequence X n ∈ R B×C×T ×H×W of features outputted by the HRNet backbone, where B is the batch size, C is the number of channels, T is the temporal dimension, and H and W are the spatial dimensions.For capturing the global spatial context, we start with global average pooling (GAP) across the spatial dimension of the past and current frame features.We then obtain frame descriptor S n = GAP S (X n ), that encompasses global spatial context.To integrate local temporal context effectively, global average pooling across both spatial and temporal dimensions is applied to obtain spatio-temporal descriptor Global spatial context and local temporal information are then aggregated, and this combined information is passed through a bottleneck block (BNB).The output of the bottleneck block results in obtaining local weights ω t , as illustrated in Figure 2.These weights combine the spatial and temporal descriptors after the bottleneck block with Then, the total weights that we used in our model are the element-wise product of these weights ω t and weights W p that refer to the initial set of weights in the convolution kernel that is shared across all frames.Note that the local weights ω t are set to 0 during initialization, which has the advantage of reducing the training time.An adaptive convolution is then applied to the current frame with where ⊙ denotes element-wise multiplication.
To effectively integrate spatio-temporal information and address the limitations in spatial features for a given frame, we finally apply a multi-scale integrator as shown in Figure 2. It is expressed as where Xo is the output from the adaptive convolution .The operators λ and γ represent distinct normalization functions.The goal behind using an average pooling (AP) layer is to enlarge the receptive field to capture a wider range of spatial contexts.

C. Single-Frame Attention Module
Besides temporal attention, attention in the spatial and channel dimensions also provides a potential enhancement for feature maps derived from single-frame images.In the context of Convolutional Neural Networks (CNNs), the attention mechanism assigns an additional weight to individual pixels in a specific dimension, indicating the significance of particular information.These learned weights strengthen valuable features and weaken less useful ones, facilitating feature screening and enhancement.Furthermore, in videos with generally stable backgrounds, spatial and channel attention, as explained by the methodology proposed in Hou et al. [20], can efficiently suppress false positive detection in the background area.
Inspired by this work, we propose a Single-Frame Attention module (SFA) that uses channel and spatial attention mechanisms, as illustrated in Figure 3.The SFA module aims to refine feature representation within a single frame.In the SFA module, each frame denoted as I n , is processed to enhance the channel and spatial information of its feature maps X n .First, channel attention with average pooling (AP ) and max pooling (M P ) are applied to condense the spatial information.To help our model learn complex feature representation, we integrate 1 × 1 convolutional layer as shown in Figure 3.This results in channel attention A c formulated as (5) For spatial attention, a comparable approach is applied, but it operates within the spatial domain.Here, the features influenced by channel attention are subjected to average and max pooling operations (M P ), focusing on spatial features.The resulting features are then concatenated and processed through a 5 × 5 convolutional layer to enhance the spatial aspects of in the frame.This gives spatial attention A s formulated as: where Conv 5×5 is the convolution operation using a 5×5 filter and * symbolizes convolution.
Finally, the two attention tensors are concatenated with X n to obtain the new features X s .This fusion process allows the model to focus on relevant information captured by the attention mechanisms, enhancing the representation of the feature maps.We observed that by using convolutional layers, the module can more effectively capture and enhance the intricate patterns in the features.This ensures that the model is capturing well the spatial features in each frame.

D. Dual-Frame Fusion Module
Figure 1 illustrates the feature maps X o and X s obtained after the SFA and MFA modules, which serve as inputs to our dual-frame fusion module.The proposed dual-frame fusion module combines semantic information of the highlevel feature maps and spatial information of low-level feature maps.Instead of traditional up-sampling, inspired by [21], we use Adaptive Feature Pooling for a more flexible approach.This offers an expanded receptive field, facilitating improved integration of both core and contextual semantics.
The high-level feature map is adaptively pooled to match the size of the low-level feature maps.These feature maps are then combined via pixel-wise summation and further processed through deformable convolutions.This offers better adaption to different object sizes, shapes, and other geometric deformations.The input has a total of four layers, and the aforementioned convolution and up-sampling process is iterated 2 times to have the final output.With the help of the above process, we obtained channel and spatial attention feature maps on single frames and temporal attention feature maps for multiple frames.These feature maps are aggregated to generate a fused feature map.

E. Detection Head
Our detection head is similar to CenterNet [22].We performed the computation of the fused object probability heatmap on the merged feature maps.However, the size and offset of the bounding boxes are generated from single-frame features.The loss function comprises three components: a fusion heatmap loss calculated with Focal Loss, and two regression losses (offset and size) computed with L1 Loss.The formulation of each loss is as follows.L Z is the unique fusion heatmap loss, where Qij indicates the predicted heatmap value for each pixel, Q ij = 1 signifies the pixel is the center of an object, and ϵ and ζ are the modified focal loss hyper-parameters.L Y represents the loss for heatmap offset, where Pq is the predicted offset, T is the position after downsampling, and q is the actual center point.L X calculates the loss for the size of the bounding box, where Rj is the predicted size and R j is the ground truth size.The overall training objective is where λ dim and λ pos are the adjusted hyper-parameters for the size and offset loss components, respectively.

IV. EXPERIMENTS
In this section, we assess the performance of our proposed method compared to SOTA methods and perform an ablation study.

A. Datasets and Evaluation Metrics
Datasets: As our method relies on more than a frame, the evaluation requires the use of video datasets.Our selected evaluation domain focuses on traffic surveillance given its significant relevance to our research.We used datasets with videos, but some are not standard datasets for object detection.Nevertheless, they were used in previous work on video object detection.We chose: KITTI MOT (Multi-Object Tracking) [12] and Cityscapes [13], both not used for object detection usually but provide videos, and UAVDT [14] used for object detection in videos.Each of these datasets provides unique challenges and contains sequences at different viewpoints with different sizes of objects.As we are using non-standard datasets (KITTI, Cityscapes) for object detection, we needed to compute some results ourselves for competing SOTA methods for a fair comparison.However, this is not true for the UAVDT dataset, where we use the standard data training and test split.
Evaluation Metrics: We use Average Precision (AP) for multiple scales of objects and Mean Average Precision (mAP) across varying IoU thresholds and mAP50 and mAP75, respectively at 0.5 and 0.75 IoU thresholds, to evaluate detection accuracy.Intersection over Union (IoU) is used to evaluate bounding box precision on all datasets.

B. Implementation Details
For features extraction, we used HRNet [17], and pretrained it on the COCO dataset [23], following the methodology described in [22].Our global architecture follows CenterNet [22].However, our training process is done in two steps.First, our backbone is fine-tuned on each dataset starting from the pre-trained weights on COCO.Then, the first and third layers of the backbone are frozen and the MFA, SFA, and dual-fusion modules as well as the network heads are trained.Training is conducted over 250 epochs utilizing the Adam optimizer, starting with a learning rate of 1×10 −4 , which undergoes a decimation by a factor of 10 after the 130 th and 140 th epochs.To ensure training stability, we use gradient clipping.The same training protocol was used for the overall architecture as well as for all the base detectors to demonstrate the contribution of our approach.

C. Results and Discussion
Comparisons with SOTA methods on the Cityscapes dataset are reported in Table I.
They show that our attention-based fusion detector consistently outperforms the other SOTA detectors.There is a significant improvement in the detection results when using our STF model as compared to SOTA detectors.The improvement in detection results is due to our two attention modules and our dual-fusion module, all contributing positively to detecting objects better (especially small or occluded ones).In Table I, we also compare our model with the vanilla HRNet as we use a feature extractor based on the HRNet architecture.This allows us to examine our results in comparison to vanilla HRNet to observe the impact of our STF module on a similar backbone.This comparison demonstrates a gain in accuracy for all sizes of objects.Furthermore, we changed the backbone of Centernet [22] to observe how HRNet affects its performance as it uses a detection similar as ours.It can be concluded from the results that using HRNet alone does not yield significant improvement.This is another demonstration that our method using a classification head similar to CenterNet performs better due to our SFA and MFA modules.By comparing our results with YOLOv5 and the recent YOLOX, our model shows improvement in terms of precision and accuracy, as well.Finally, we also perform better than PPNet which uses multiple frames.
Table II presents the results of the KITTI validation dataset.The conclusions are the same as for Cityscapes with similar improvements compared to baseline methods.By comparing it with other SOTA detectors, our proposed method outperforms them with improvements for all object size categories (Small, medium and large).Our method demonstrates an improvement in detection results when compared to SOTA single-frame and two-frame detectors.
Results on the UAVDT test dataset are reported in Table III.Our Spatio-Temporal Fusion (STF) module consistently outperforms the base detectors.As well, when compared to SOTA multi-frame detectors, such as FFAVOD and RN-VID that fuse features without attention, we can notice that although this helps compared to single-frame detectors, a more sophisticated fusion approach, like the one we propose is required to obtain even better results.

D. Ablation Study
An ablation study was performed to evaluate the contribution of the different parts of the proposed method: the multi-frame attention module, single-frame attention module as well as the single-frame and multi-frame attention with the dual-frame fusion module, and show the effect of each component in Table IV.We find that the method with the MFA module or the method with the SFA module both detect better than the baseline method (HRNet + CenterNet head).We also observe that the proposed method (STF) with both two modules and dual-frame fusion performs the best.According to the proposed STF module, our MFA module plays a crucial role in combining features from two frames.Similarly, the SFA module aims to improve the accuracy of detection within a single frame.This is achieved through the combination of single-frame channel and spatial attention, which effectively suppresses false positive detection in background regions.In our observations, we noted that each module independently contributes to performance enhancement.Moreover, a synergistic effect is observed when both modules are combined, leading to a more significant improvement in results.Therefore, for better efficiency and accuracy, our proposed model demonstrates superior results as compared to other configurations.
To illustrate the specific contributions of our proposed dual-frame fusion method, we also conducted an ablation study on it.We aimed to understand the individual impact of different fusion strategies on the overall performance of our model.For that, we use different strategies of combining two frames, i.e. concatenation, median, mean, and max fusion.In all cases, that decreased the performance by a large margin as shown in Table V.We attribute this to the misalignment of features across frames, necessitating a more intricate operation for aggregating these features.Admittedly, our model requires additional parameters to effectively learn the optimal combination of feature maps.However, as indicated in Table V, our findings strongly support the benefit of our dual-frame fusion method in integrating feature maps.

V. CONCLUSION
In this work, we designed a spatio-temporal fusion module as a new approach for multi-frame object detection.Specifically, we identified the ineffectiveness and inadequacy issues present in single-frame object detectors.Then, we proposed to solve these problems using multi-frame and single-frame attention modules, as well as a dual-frame fusion module to improve object representation.Our results show that by exploiting sequential frames, we can improve the efficiency and accuracy of detection under challenging conditions.

Figure 3 .
Figure 3.The channel and spatial attention modules of our proposed single-frame attention module

Table I COMPARISON
OF OUR METHOD WITH SOTA METHODS ON THE CITYSCAPES VALIDATION DATASET.BOLDFACE INDICATES BEST RESULTS.Trained by ourselves.Table II COMPARISON OF OUR METHOD WITH SOTA METHODS ON THE KITTI MOT VALIDATION DATASET.BOLDFACE INDICATES THE BEST RESULT.III COMPARISON OF UAVDT TEST DATASET WITH DIFFERENT METHODS.BOLDFACE INDICATES THE BEST RESULT.Table IV ABLATION STUDY ON THE MFA, SFA, AND DUAL-FUSION MODULES * * Trained by ourselves.[6] H. Perreault, G.-A. Bilodeau, N. Saunier, and M. Héritier, "Ffavod: Feature fusion architecture for video object detection," Pattern Recognition Letters, vol.151, pp.294-301, 2021.

Table V ABLATION
STUDY OF THE DIFFERENT FUSION STRATEGIES ON CITYSCAPES DATASET.