Multi Player Tracking in Ice Hockey with Homographic Projections

Multi Object Tracking (MOT) in ice hockey pursues the combined task of localizing and associating players across a given sequence to maintain their identities. Tracking players from monocular broadcast feeds is an important computer vision problem offering various downstream analytics and enhanced viewership experience. However, existing trackers encounter significant difficulties in dealing with occlusions, blurs, and agile player movements prevalent in telecast feeds. In this work, we propose a novel tracking approach by formulating MOT as a bipartite graph matching problem infused with homography. We disentangle the positional representations of occluded and overlapping players in broadcast view, by mapping their foot keypoints to an overhead rink template, and encode these projected positions into the graph network. This ensures reliable spatial context for consistent player tracking and unfragmented tracklet prediction. Our results show considerable improvements in both the IDsw and IDF1 metrics on the two available broadcast ice hockey datasets.


I. INTRODUCTION
Multi-Object Tracking (MOT) is an important computer vision problem subsuming several tasks such as object detection, localization, re-identification and association across a temporal sequence.It is a highly studied and established problem due to its plethora of applications in robotics, autonomous vehicles, industrial automation, surveillance, and sports.While most existing MOT approaches focus exclusively on tracking crowded pedestrians [1], [2], [3], group dancing [4], and autonomous driving [5], sports tracking is a pivotal task in vision due to its numerous subsequent applications in game analytics & statistics, strategic planning, player evaluation, injury prevention, and crucial game decisions.Player tracking helps save several manual labor hours and human efforts by automating game understanding and player performance assessments.With the advent of deep networks [6], several important strides have been taken to track various sports including soccer [7], handball [8], basketball [9], [10], [11], and volleyball [12], with public MOT datasets [13], [14] to support principled evaluations.
Unlike all the aforementioned sports, ice hockey poses unique challenges to tracking due to its highly physical and fast-paced nature.Specifically there are three major challenges that exists: (i) the significant occlusion between two or more players at a given instant within the fieldof-view (FoV); (ii) the non-linear player dynamics due to unpredictable player motion, and; (iii) blurs and reduced visibility of players due to frequent camera motion.When faced with these issues, tracking in monocular view increases identity swaps and tracklet fragmentations.Early attempts at tracking ice hockey players were pursued using an ensemble of handcrafted methods.Okuma et al. [15] use Adaboost [16] detection with mixed particle filters [17] for tracking players from television videos.Cai et al. [18] improve upon [15] by utilizing the mean-shift algorithm to stabilize player trajectories, and use rink coordinates (homography) for the particle-filter.However, mixed particle filters are susceptible to identity switches/losses during mutual occlusions, background changes, blurs, and lighting effects, which are often found in hockey.Further, there exists no quantitative evaluation in the above works to show the efficacy of their models.
Recent approaches using deep networks have shown significant improvements in the accuracy of tracking hockey players from broadcast feeds.Vats et al. [19] present a comparison between five different tracking models, finetuned on broadcast ice hockey clips and obtain state-ofthe-art results using graphs with message passing networks (MPN) [20].They subsequently use this method to generate tracklets (sequence of player tracks) for player identification & team recognition [21].As far as we know, this is the only existing benchmark for MOT in ice hockey.However, there exists two major limitations in their approach: first, their model encodes the bounding box attributes of players as graphical edge embeddings which leads to identity swaps and tracklet fragmentations.This is because, when there exists heavy occlusion between players as usually observed in broadcast hockey feeds, their bounding boxes overlap significantly in the monocular view (Intersection-over-Union (IoU) ↑), causing either misassociation of players (identity switch) or a missed connection (lost tracklet) with previous tracks.Second, their node embeddings are encoded solely based on player appearance features, which is ambiguous in hockey due to the fully-covered bulk gear worn by players, similarly colored team jerseys, blur & lighting effects.With these present setbacks, we ask the question: "Given only a monocular broadcast feed, is it possible to declutter occluded players and track their movements with high fidelity?".Our results show that the answer is Yes.
In this work, we formulate MOT as a link prediction problem in the graph domain since graphical networks offer a natural way of representing players and their relationships.The novelty of our approach lies in coupling this with homography to obtain reliable spatial/positional cues for ice hockey players.Specifically, we map every frame from broadcast view to an overhead view (rink template) using an off-the-shelf homography estimator [22], to obtain the projected footpoints of players in rink coordinates.This yields a birds-eye effect, reducing the positional ambiguities due to overlapping players and perspective projection.
We encode these projected positions along with player reidentification features as graph embeddings, and propagate information through the graph using the message passing network (MPN) [23], [24] framework.Our model follows the popular tracking-by-detection paradigm and we show evaluations with both ground-truth detections and off-the-shelf detector outputs [25], in coherence with the current state-ofthe-art (SOTA) benchmark [19].Through this approach, our model is able to utilize the overhead positional information of players as additional cues to track with consistency during occlusions, blurs, and dynamic player movements.
Our contributions can be summarized as follows: • We design a simple, yet effective spatio-temporal graphical network and adopt the MPN framework to track players from broadcast feeds consistently; • We propose a novel approach based on homography to provide additional positional information to the graph network during occlusions, blurs and non-linear movements, and; • We show significant improvements in the IDsw and IDF1 scores on two available broadcast ice hockey datasets.

A. MOT for Pedestrians
Most existing MOT methods [20], [26], [27], [28], [29], [30], [31], [32], [33] focus exclusively on tracking pedestrians in crowded scenes [1], [2], [3].Initial approaches combine Kalman filters [34] and Hungarian method [35] for next state motion prediction and object associations respectively.Consequent approaches adopted the two-stage tracking-bydetection (TBD) paradigm where: first, all objects present in a sequence are detected; and next, their association features are extracted to link similar objects across frames.Simple Online and Real-time tracking (SORT) [31] establish a TBD baseline for MOT, and argue that SOTA associations can only be obtained with traditional methods [34], [35].Deep-SORT [32] answer SORT's [31] argument, by embedding appearance features using a ReID network for association, showing lesser identity switches and better tracking results on the MOT16 [2] challenge benchmark.FairMOT [33] combines the detection and ReID steps for Joint Detection and Tracking (JDT) using an anchor-free detector [36].Tracktor [29] frames tracking as a bounding box regression problem, by converting a detector [25] into a tracker.But, a major limitation with all these methods is their inability to tackle crowded scenes and declutter occluded pedestrians.ByteTrack [28] tries to handle occlusion by associating lowconfidence targets too, to retrieve true objects, but fails during significant crowded (overlapping) situations where the object is heavily obstructed.Sparsetrack [37] tries to handle occlusion by decomposing dense/crowded pedestrian scenes into sparse subsets using pseudo-depth map estimation, but incur high computational costs and significant processing requirements.

B. MOT for Sports
Initial methods in sports utilize Kalman filtering [38], [39] and particle filters [15], [18] for tracking, but were unable to preserve identities due to their linear motion model assumptions.Nillius et al. [40] encode player trajectories into track graphs and use a bayesian framework to predict the most likely configuration of player paths.Figueroa et al. [41] track players with multi-cameras by encoding their segmentation blobs as nodes and their relative distances as edges.With YOLOv2 [42] for detection, Acuna et al. [43] track basketball players using SORT [31], while Theagarajan et al. [44] track soccer players using DeepSORT [32].However, naively extending methods originally designed for pedestrian tracking to sports is a non-trivial task, due to the domainspecific challenges in modeling arbitrary player movements, occlusions, fast camera motions, and perspective projection errors.

C. Tracking with Graphs
Graph-based formulations offer a flexible way to model target movements, interactions and their relative features.Wang et al. [45] propose a graph-based MOT framework for joint detection and tracking.MOT neural solver [20] exploits the network flow formulation of MOT to define a message passing network for data association.Vats et al. [19] fine-tune the neural solver architecture on the broadcast ice hockey dataset to create the first tracking benchmark for hockey.Luna et al. [46] extend the graph domain to multicamera tracking, and ReST [47] builds on top with a spatiotemporal network for online-tracking.Both these algorithms use the intrinsic camera parameters to map multiple camera views onto a common ground plane for additional positional cues.But, in our case, we cannot infer camera parameters from broadcast feeds directly.Therefore, we utilize an offthe-shelf homography model [22] trained on a top-view rink template, to map each broadcast frame to the overhead rink coordinates and obtain homographic footpoint coordinates for the graph MPN.

III. PROPOSED METHOD
Given an ice hockey broadcast feed, the objective of our work is to track multiple players consistently despite prevalent occlusions, blurs, and arbitrary player movements.We leverage Graphical Neural Networks (GNN) for this task (due to their efficiency in modeling relationships) to build a spatio-temporal graph and utilize homography for reliable positional information.Specifically, we project player foot keypoints onto a common overhead rink template and encode these projections along with ReID embeddings [48] as node features and their relative distances as edge features.Since we cannot obtain camera parameters from broadcast feeds directly, we use an off-the-shelf homography estimation model [22] specifically designed for ice hockey, to estimate the homography transformation matrix, H ∈ R 3×3 .We utilize the message passing network (MPN) framework [23], [24] to propagate player features across the entire graph G, and update the node and edge embeddings at each message passing step.This helps our model reason globally over the entire sequence for predicting player trajectories.We frame the association between two consecutive frames f i and f j s.t j > i, as a bipartite graph problem, and use a binary classifier with a sigmoid final layer to output association probabilities.During inference, we post-process each graph by pruning to remove low-confidence associations and solve many-to-one violations.Finally, we assign tracklet IDs to nodes wherein, if the node has a prior connection, it inherits the same ID or gets assigned a new ID otherwise.(Ref Fig.

A. Problem Formulation
Consider a broadcast hockey sequence (frames) S t = {F i | i = 1, 2, . . ., n}, where n = t ∆t ; t = total duration of the sequence, and ∆t = duration per frame.For each frame F i , assuming that there exists at least one player to track, we have P i j players, where j >= 1.For each player P , the ground truth annotation includes: {f id , t id , x, y, wd, ht, c, x proj , y proj } where f id = frame ID, t id = player ID (only used during training), {x, y, wd, ht} = bounding box coordinates, c = annotation confidence score, and (x proj , y proj ) = homography coordinates for player footpoints.As per the tracking-bydetection paradigm, player annotations are detections from an off-the-shelf detector for the inference stage.In our case, we show both evaluations: first, directly evaluating on the annotated ground truth which gives the best picture of our tracking performance (Ref.Section IV-C), and second, inference and evaluation on the detected output using F-RCNN [25] following the current SOTA [19] benchmark, for a fair comparison.We formulate multi-object tracking (MOT) as a bipartite graph matching problem.We deterministically create one graph per frame in the given sequence.Each node in this graph points to one player in that frame and the edges correspond to their relationships with neighboring nodes in other frames.Let us consider the undirected graph G t = (V t , E t ) with bidirectional connections, where V t denotes the vertex set and E t denotes the edge set at time t.N odeF eatures(.)represents the concatenation of appearance features and homographic projections of players at time t, and E t = V i × V j represents a one-one mapping between vertex sets v i and v j s.t i ̸ = j, meaning that the mapping is not within players in the same frame.
Node Formulation Each node v i ∈ V t represents one unique player i found in F t and contains: frame ID (timestamp) The ReID features are generated for each node i via an off-the-shelf Re-Identification network [48], described by: where b i | crop denotes the cropped bounding box for the i th player.The homographic projections are estimated by projecting the bottom-mid point (∼foot keypoint) of the bounding box from the monocular broadcast view.The left footpoint f l = (x i + wdi 2 ) and the right footpoint f r = (y i + ht i ) for player i are projected as: with H i being the 3 × 3 Homography matrix (Ref.Eq. 5) Edge Formulation Each edge, e ij ∈ E t is represented as the interconnection between two players from two distinct frames.It is encoded as the concatenation of relative appearance ∆r id ij and positional ∆p ij features between the pair of nodes v i and v j , where e ij = v i × v j , i ̸ = j.This can be represented as: [•,•] denotes the concatenation of Eucledian distance & Cosine Similarity for ∆r id ij , and Eucledian & Manhattan distance for ∆p ij .This consideration is inspired from [46] to obtain higher-dimensional distinctive features.

B. Homographic Projection
Due to the monocular nature of broadcast ice hockey sequences, there exists high levels of player occlusion and dynamic camera movement effects (blurs, pans, tilts, zooms).This limits the scope of tracking players consistently when they're completely obscured or remain hidden, even if for very short intervals.To provide the tracker with reliable positional cues in such scenarios, we propose an approach using homography to warp player positions from the broadcast video feed onto an common overhead rink template.This helps reduce the variance in frames present across the sequence due to camera motion, and provides a pseudo topview tracking effect to disentangle overlapping players.
At any given frame F t , a player's P i foot keypoint coordinates represents their exact point of contact with the ice, which when projected to the overhead rink plane provides additional positional cues for uncluttered tracking.For the player P with footpoints (P fx , P fy ) in broadcast view: where p i = homographic projection of the i th node, (P x ′ , P y ′ ) = projected homogenous player footpoints in the overhead view, s = scale factor, and H = 3 × 3 homography matrix.
But this is a non-trivial task, since the camera parameters for broadcast feeds are unknown and thereby, we do not know the values for H. Therefore, we tackle this issue using an off-the-shelf homography-estimator [22] pre-trained on an ice hockey top-view rink template, to map each broadcast frame onto the overhead view and obtains its respective homographic projection.

C. Temporal Graph Association
To facilitate association of players across frames, we design a simple temporal graph network, to correlate player features.Inspired by [47], we iteratively correlate the learned graph G T t−1 at time t − 1 with the next graph G t at time t to form a new graph G T t .Assuming this as the n th iteration, each node v t−1 ∈ G T t−1 contains aggregated embeddings from n − 1 iterations and is connected with the new nodes v t ∈ G t to form the new temporal graph G T t at time t (Ref algorithm 1).The edge set thus created can be denoted as: This aggregation of uniterated graphs with learned graphs helps propagate learned features and assign consistent identities for the same player (Ref Fig. II-C) end while 11: i++; 13: end for

D. Message Passing Network
We adopt the Message Passing Network (MPN) structure as introduced by [24] to propagate the node and edge information across the graph G. Message passing intuitively helps the graphs learn their neighboring features; each edge learns about the projection and appearance feature of its neighboring nodes and each node learns about the geometric features of its neighboring edges.To begin with, we initialize the node embeddings and edge embeddings as: Note that we add p and ∆p to both the node and edge features respectively to propagate homographic (positional) information throughout the graph.Given these initial embeddings, as standardized by well-established methods [19], [20], [47], [23], [46], we perform L iterations of edge updates and node updates as two separate steps.[49] Edge Update We utilize a learnable multi-layer perceptron (MLP) to perform edge encoding for l = {1, . . ., L} steps using the source and destination nodes connected by the edge: where, f M E e is the edge encoder.This leads to the sharing of appearance and projection embeddings from the neighboring nodes h vi , h vj to its connecting edge h eij .Node Update Similar to Eq. ( 8), we utilize a learnable MLP to perform node update for l = {1, . . ., L} steps, using the aggregated messages coming from its neighboring nodes: where, f M E v is the node encoder and N (v i ) denotes the neighboring nodes of v i .Note that f M E v and f M E e are two separate networks with different dimensions, but share the same MLP architecture (Ref.Table I) In both the updates, the MLP encodes all the information into a higher-dimensional feature space.The message passing step L is akin to the receptive field found in convolutional neural networks [6] and a higher value of L corresponds to farther propagation of information in the graph, but at the cost of computation.Classification We propose to learn the association between nodes as a link prediction task by framing player tracking as a graph partition problem.That is, after L iterations, we perform binary classification to predict the edge probabilities ŷeij connecting nodes v i and v j , as: where f cls is a learnable MLP with a sigmoid final layer to output probabilities.During inference, the edges with low confidence scores (weak connections) are removed.
During training, the binary ground truth labels y eij and their corresponding predictions ŷeij are compared to find the sigmoid focal loss [50].

E. Post-Processing
We adopt a post-processing step during inference to prune and resolve violations in our final graphs, and assign consistent tracklet IDs.Pruning.This is the first step in refining the predicted edge confidence scores ŷeij by our classifier.We define a confidence threshold hyperparameter ξ, where: This helps eliminate the low-confidence edges and retain only the most strongest correspondences.For our experiments, we've used ξ = 0.9 to only retain the most strongest edges.Alongside, similar to [47], we remove many-to-one edge violations, wherein, there can only exist at most one connection between two nodes within the connected graphs.This ensures that we represent each player identity through a unique connected component, since no two different players can share the same identity at the same time.Assigning IDs.The final graph contains unique connected components where each component represents one unique player tracklet.As we iterate temporally, we assign a new identity when there exists no previous connected components for a node or propagate the same identity otherwise.
IV. IMPLEMENTATION DETAILS This section contains details about datasets used, training scheme, evaluation metrics and the final results.

A. Datasets
We experiment our methodology on the two available ice hockey tracking datasets -first, similar to the current SOTA benchmark [19], we train and test on the broadcast hockey dataset for a fair comparison; second, we evaluate on the public VIP Hockey Tracking Dataset (VIP-HTD) [51] to demonstrate the generalization of our method.Both datasets contain side-of-the-rink broadcast videos with occlusions, blurs, and challenging player movements.Broadcast Tracking Dataset [19] This dataset contains 84 broadcast clips sampled from 25 NHL games, with an average duration of ∼36 seconds per clip.The dataset has a 1280 × 720p resolution (Standard Definition) at a frame rate of 30fps, with a train:validation:test ratio of 58:13:13.We follow the same training and testing scheme as [19] for equal comparison, and show superior results with our method (Ref Table II) VIP-HTD [51] This public dataset contains 22 broadcast hockey clips sampled from 8 NHL games, with both 30 & 60Hz frame rates, recorded at 1280 × 720p resolution.We perform cross-dataset validation (trained on broadcast dataset; tested on VIP-HTD) on all the 7 test clips in this dataset to showcase the generalizability of our method for any given broadcast hockey feed (Ref Table III)

B. Training Details.
Given the player bounding boxes and tracklet IDs, we find the footpoint projection coordinates using an off-the-shelf homography model [22] trained specifically for NHL icehockey rinks.This model is currently the SOTA for hockey and helps predict highly accurate overhead projections.Next, we exploit the OSNet architecture [48] as our ReID network for player feature extraction (Eq.( 1)), pre-trained on the ImageNet dataset [52].We encode the 512-D ReID feature vectors along with the 3-D (homogenous coordinates) homography features into 32-D node embeddings (Eq.( 6)), and the 4-D edge features (Eq.( 7)) into 6-D edge embeddings.We run the MPN for L = 6 iterations and pass the final graph output into our binary classifier for predicting edge probabilities.

Algorithm 2 Model Training
while l ≤ L do: end while During training, we utilize the ground-truth player annotations and calculate the prediction losses using Focal Loss [50] Focal Loss = where, ŷl eij is the edge prediction at iteration l and y eij is the ground-truth indicator function: that is, for every player match, y eij = 1 else 0. We use Adam Optimizer [53] without weight decays to update our model parameters.The learning rate (LR) is initialized at 0.01, with a gradual warmup for the first 10 epochs and a cosine annealing schedule (min.LR = 0.001) thereafter.We trained our model for 30 epochs with a batch size of 16 on a single NVIDIA GeForce RTX 4090 GPU with 24 GB RAM, 2.5GHz clock speed, and performed validation after every 2 training epochs.

C. Evaluation Metrics
Most common evaluation metrics used in popular SOTA tracking methods are the Multi-Object Tracking Accuracy (MOTA) [54] and IDF1 score [55].With respect to our problem context, they can be defined as: MOT Accuracy: It is calculated as the complement of three distinct errors - • False Positives (FP): No. of false players detected; • False Negatives (FN): No. of true players missed; • Identity Switches (IDsw): No. of identity swaps/reinitializations made for players within the field-of-view.
where, GT t denotes the ground-truth annotations.IDF1 Score: It is defined as the ratio of correctly identified players over the average number of ground-truth and computed identities: where, TP id , FP id , FN id are True Positive, False Positive and False Negative player identities respectively.Alternatively, IDF1 score can also be defined as the harmonic mean of ID Precision and ID Recall.
The FPs and FNs used to calculate MOTA relies solely on the detector's quality.Even if the tracker consistently associates players, the MOTA will be skewed if there exists high FP and FN detections, as they have twice the weightage of IDsw in MOTA.Since this doesn't give a clear picture of the tracker's association capabilities, we focus only on the IDsw score and the IDF1 score as the key metrics in player tracking.These metrics are especially relevant in ice hockey, since they measure how consistently a player is tracked with respect to his original identity.Therefore, our primary objective is to have ↓ IDsw and ↑ IDF1 score for consistent player tracking.Our preferred evaluations are directly based on ground-truth annotations; but, to be consistent with [19] we use similar detection outputs to show our results.

D. Results
We report the results of our model's performance on the test-sets of the broadcast ice hockey dataset [19] and the public VIP-HTDataset [51].It is to be noted that we reproduce the results of our benchmark [19], under similar hardware and testing conditions as our own method's evaluations, to avoid any discrepancies.From Table II during ground-truth evaluations, our model outperforms the SOTA model by a large 23.3% ↑ in IDF1 score, and 10× ↓ in IDsw.This is due to the ability of our model to handle heavy occlusions and blurs prevalent in these videos.We see similar trends with the F-RCNN [25] detection inputs, where our model surpasses all methods by a 8.4%↑ in IDF1.Our MOT Accuracy is still higher than all other methods, despite incurring more FPs and FNs due to the ability of our tracker to recover robustly from mistakes made by the detector.We show qualitative results for all 7 videos below the reference section.In Table III, we cross-validate our model on the VIP-HTDataset [51] showing a clear superiority in both IDsw and IDF1 metrics, compared to [19].This asserts two important things: (i) our model generalizes well to unseen hockey feeds, despite of varying environmental conditions; (ii) Our model's performance isn't affected by the ↑ in frame rate.

V. CONCLUSION
We present a novel approach based on the combination of graphical neural networks and homography to effectively track ice hockey players in broadcast feeds.We project player footpoints to an overhead rink template to maintain consistent positional cues, especially during occlusions and blurry situations.This provides a pseudo 'top-view' effect to disentangle overlapping players and maintain their trajectories.Message passing network (MPN) is used to aggregate player features and model their temporal relationships, followed by a classifier to predict player association probabilities.We achieve as significant ↑ in IDF 1 and ↓ in IDsw, when compared to both the current SOTA benchmark and a public tracking dataset.We believe that our work can also benefit various other sports in the future.[19] From Table IV, it can be observed that only using appearance features underperforms by a noticeable margin in each listed sequence, while augmenting it with homography cues boosts the IDF1 score and reduces the number of incurred IDsw.This empirically validates our core hypothesis that homography makes tracking better.The no. of message passing steps for L, analogous to the receptive field of a CNN is best pronounced at L = 6 steps in our model.L < 6 is inadequate to propagate information across neighboring nodes and edges, leading to lesser capability (as evident from the ↑ IDsw), while L > 6 leads to possible overfitting on the training data.Thus, L should be treated as an empirical hyperparameter which is best chosen when observed.

Video-wise quantitative results
Table VI presents a detailed summary of our test results on the 13 test videos in broadcast ice hockey dataset [19].Our model achieves SOTA results on all sequences despite occlusions and camera movements.

Qualitative results
We show qualitative results for our tracker on the 7 games test-set from VIP-HTD [51].It can be observed that our tracker generalizes well to unseen, out-of-distribution (OOD) data incurring less IDsw and excellent IDF1 score (Ref.Table III), despite varying lighting conditions, jersey colours, and camera angles.We hope is to further test the generalization capacity of our model across sports in the future.

Figure 1 .
Figure 1.Other trackers vs Our approach.a) During a significant occlusion scenario (t + 2), other trackers lead to an ID switch error; b) Our method consistently tracks players before and after occlusion; c) Overhead rink template used for homography projection; d) Player footpoint coordinates mapped to the overhead template.At (t + 2), there is a clear distinction between overlapping players from the top view.This information aids our tracker in maintaining player tracklets.

Figure 2 .
Figure 2. Proposed Approach.a) The general pipeline of our spatio-temporal graph.b) G denotes the three stages of Graph Initialization, MPN, and Classification.c) P P denotes the Post-Processing stage where we Prune, solve graph violations, and assign player IDs

Figure 3 .
Figure 3. No. of message passing steps, L vs. No. of IDsw incurred

Figure 4 .
Figure 4. Qualitative results for the VIP-HTD [51] test-set.Our tracker generalizes well to this unseen dataset, incurring neglible ID switches Table IV EFFECT OF HOMOGRAPHY ON IDSW AND IDF1 SCORES FOR THE BROADCAST ICE HOCKEY TEST-SET Table VI SEQUENCE-WISE RESULTS ON THE 13 TEST-SET VIDEOS FROM BROADCAST ICE HOCKEY DATASET.WE EVALUATE BOTH METHODS ON GROUND TRUTH ANNOTATIONS.± DENOTES CURRENT SOTA BENCHMARK RESULTS