Change of Scenery: Unsupervised LiDAR Change Detection for Mobile Robots

This paper presents a fully unsupervised deep change detection approach for mobile robots with 3D LiDAR. In unstructured environments, it is infeasible to define a closed set of semantic classes. Instead, semantic segmentation is reformulated as binary change detection. We develop a neural network, RangeNetCD, that uses an existing point-cloud map and a live LiDAR scan to detect scene changes with respect to the map. Using a novel loss function, existing point-cloud semantic segmentation networks can be trained to perform change detection without any labels or assumptions about local semantics. We demonstrate the performance of this approach on data from challenging terrains; mean intersection over union (mIoU) scores range between 67.4% and 82.2% depending on the amount of environmental structure. This outperforms the geometric baseline used in all experiments. The neural network runs faster than 10Hz and is integrated into a robot's autonomy stack to allow safe navigation around obstacles that intersect the planned path. In addition, a novel method for the rapid automated acquisition of per-point ground-truth labels is described. Covering changed parts of the scene with retroreflective materials and applying a threshold filter to the intensity channel of the LiDAR allows for quantitative evaluation of the change detector.


I. INTRODUCTION
While significant progress has been made towards robot navigation in unstructured environments, challenges with robust perception and terrain assessment remain [1].These can be exacerbated by a domain gap where methods designed for one location do not transfer reliably to another [2].On-road autonomous driving benefits from clear rules and common types of objects that assist with risk assessment and detection [3].Using extremely large, labelled datasets such as the Waymo Open Dataset [4] or SemanticKITTI [5] have allowed deep 3D networks to progress rapidly in the semantic and instance segmentation of known classes [6], [7].
This paper aims to solve a broader problem.We envision a situation where a robot can safely navigate autonomously after an initial mapping step.The map could be generated through manual exploration or prior experience from other systems.We argue that a map is a stronger prior for the types of features and obstacles that may be encountered than a predefined set of classes.It is desirable to perform per-point segmentation because bounding box regression implicitly imposes classrelated sizes.Based on these conditions, we formulate the problem as binary change detection; points in the live scan are labelled as either changed or consistent with respect to the All authors are with the University of Toronto Robotics Institute.Fig. 1.The Clearpath Warthog UGV with the Ouster OS-1 LiDAR, driving a previously taught path.A section of the path is now blocked and the change detection algorithm proposed in this paper will be used to allow the robot to safely navigate around the obstruction.The mannequin is wearing the retroreflective suit that is used for dataset generation and evaluation.map.A conceptualization of the desired behaviour is provided in Figure 1.Classical change-detection methods use nearestneighbour distances [8], normal distances [9], or ray tracing [10] to classify changes.However, they are most effective indoors and are less accurate in unstructured environments [11].Vegetation is often dynamic on small length scales leading to variable scans with small changes irrelevant to planning.We hypothesize that a deep neural network can filter these planning-irrelevant changes to detect higher-quality clusters of changed points.Rather than tackle the full scope of terrain traversability assessment [12], all changes are treated as a threat and are avoided by the path planner.In the absence of a change, regions where the robot has driven before are considered traversable.
A novel loss formulation is used to provide an unsupervised training signal.This loss leverages inductive biases based on the amount of change, distance of changes from the existing map, and temporal permanence of objects.
These factors can be represented as a standard optimization problem.However, finding the optimal solution is computationally intractable.Our approach uses a neural network to learn a good heuristic for the problem instead.An unsupervised network training process can be used to learn an initial, general network, which can be fine-tuned to new environments using data collected from the robot itself.
Once the network is trained, it is integrated into an existing autonomy stack to detect obstacles on a Clearpath Warthog unmanned ground vehicle (UGV) [13].The robot can avoid obstacles reliably in real time.
In summary, we make the following contributions: • an unsupervised deep LiDAR change-detection method for use on mobile robots, • a loss formulation to train neural networks in an unsupervised manner, • a rapid evaluation process for per-point semantic labelling, • a change-detection neural network running in closed-loop on an unmanned ground vehicle.The corpus of LiDAR datasets that contain multiple trajectories through the same environment is limited.A custom dataset with a combination of on-road and off-road driving is created and used for evaluation.Figure 2 shows the satellite view of the dataset paths.

II. RELATED WORK A. Change Detection
Change detection in point clouds has been well studied in both mobile robotics [14] and remote sensing [15].Classical change detection is primarily based on distances between two scans at different times.Some works [16], [10] use the labels Changed and Unchanged, which carry the same meaning as Changed and Consistent in this work.Our notation emphasizes that the classifications are not absolute and that points are typically consistent with mapping evidence, not identical.The simplest method evaluates the distance from each point in the first point cloud to the closest point in the second [8].Distant points are classified as changed if they exceed a threshold value.Wu [11] proposed a Gaussian roughness model of a local neighbourhood to set a dynamic threshold.Alternatively, Underwood et al. [10] use ray tracing to determine which parts of the scene have changed.Any points along a ray that are closer than a previous scan are considered changed.A common Fig. 2. The dataset trajectories and sample views.In total, the dataset is 8.5 km long with 1.75 km of unique paths.
challenge with these methods is that occluded regions are implicitly assumed to be free space, leading to false positives.Voelsen et al. [17] combine these approaches and use region growing to create instance clusters.However, these clusters require a pre-defined set of classes for the changes.
More recently, change detection has been approached using deep learning.There is a focus on airborne LiDARs that detect the long-term change of large geographic areas [18], [19].Datasets such as SHREC 2023 [20] contain matched pairs of large point clouds for change detection but do not contain trajectory information.De Gelis et al. [21] present SiameseKPConv: a supervised semantic segmentation network that detects six classes of changes on 3D point clouds of cities. Twin encoders extract features from the map and livescan point clouds.Once encoded, the feature-space difference of every live point to its nearest map neighbour is used for classification.A follow-up work uses the same network architecture but with self-supervised training [16].They propose three losses: a contrastive loss, a deep clustering loss, and a temporal consistency loss.The contrastive loss pushes points classified as changed to have different embeddings than those classified as consistent.By clustering the predictions, some misclassifications are corrected to improve pseudo-labels for the next epoch.Finally, the temporal consistency loss encourages predictions to be consistent over time, imposing the bias that the number of changed points should be small.This work focuses on large-scale point clouds and does not consider the local trajectory of a mobile system.
Deep learning for change detection on mobile robots is beginning to take shape.Zhao et al. [22] perform supervised change detection on an indoor robot.They use four classes of changes: unchanged, dynamic, structural change, and temporary change.They show that a range image can be used to make efficient predictions about changes in the environment.To the best of our knowledge, there are no unsupervised deep LiDAR change detection systems for mobile robots.

B. Map Refinement
The construction of the point-cloud map impacts changedetection algorithms.Ideally, maps contain only static structures.Map refinement is the process of updating maps, removing dynamic objects, and rejecting outliers [23].Pioneering works, Erasor [24] and PeopleRemover [25], use voxelized pseudo-occupancy-grid maps to estimate whether a region is free at a given time.These methods run in batches, allowing information from future scans to improve the detection at every frame.Points that exist within voxels that are often empty are removed from the map.Additionally, Erasor2 [26] uses instance segmentation to reduce the number of misclassifications at contact regions.Maintenance can be achieved by using multiple passes through the same environment and voting.
An alternative approach to creating static maps detects dynamic objects while the robot is moving and never adds them to the map.Pomerleau et al. [9] use the visibility of the map points and estimate every normal vector to determine whether a point is dynamic or static.Static points are added to the SLAM problem and dynamic points are rejected.Yoon [27] applies a combination of free-space checks with region growth to detect class-free dynamic points.Deep learning has also been applied to moving-object segmentation.SLIM [28] is a self-supervised moving-object segmentation network that operates on point clouds directly.RVMOS [29] uses sequences of range images and temporal augmentation to train effective supervised classifiers.Moving-object segmentation alone is insufficient for long-term autonomy because dangerous changes may be static during traversal.

C. Teach and Repeat Framework
The methods presented in this paper apply to any system that localizes against a point-cloud map.For the purposes of evaluation, we integrate the detector into the Visual Teach and Repeat 3 (VT&R3) framework 1 .VT&R was designed for stereo cameras and allows robots to precisely localize and repeat a path taught by a human operator [30].Wu [11] and Sehn [31] modified the VT&R3 framework to use LiDAR for odometry and localization.During the teaching process, iterative-closest-point (ICP) [32] odometry determines the motion of the robot.These frames are collected into a sequence of topologically connected submaps.A relative transformation between each submap is stored for later localization.The repeat phase involves autonomous navigation to any location within the network of paths.By relaxing the path-following constraint to a corridor [31], the robot is given the flexibility to navigate around obstacles on the path.A local cost map is required by the existing planner within the corridor around the robot.The proposed method detects changes and treats their locations as unsafe to drive.

A. Problem Definition
We let the problem of LiDAR change detection be one of binary classification.Given two point clouds in a common reference frame, the map M and current scan S, every point s i ∈ S is assigned a binary label (l i ) of Changed or Consistent.Let the number of points in S be n.Consistent is loosely defined as a point that could be in M under a slightly different mapping process. 1 The framework is available at: github.com/utiasASRL/vtr3 To solve the problem without ground-truth labels, a combinatorial optimization problem is defined over a sequential set of two point-cloud pairs ({M 0 , S 0 }, {M 1 , S 1 }).The first live scan is recorded at time t 0 .Consider t 1 = t 0 + ∆t where ∆t is a hyperparameter of the system.All four point clouds are transformed into a common frame.The optimization has three terms: a chamfer loss [33] between the map and the Consistent live scan, a class-balance loss that penalizes the number of points that are considered to be Changed, and a temporalconsistency loss that penalizes labels that change between the sequential scans.The total loss function is defined as where λ 1 and λ 2 are hyperparameters to be tuned.In theory, the solution to this problem does not require machine learning.For a fixed set of maps and live-scans, a gradient descent method could be iterated to find a local minimum to the problem for each frame starting from random labels.However, with thousands of points, it is intractable to solve as a classical optimization.To respect the run time requirements of a 10 Hz LiDAR scanner, a neural network is used to learn a fast heuristic solution to the optimization problem.To make the loss functions differentiable, the softmax probability of being classified as Changed is used instead of the actual prediction.Next, each of the terms in (1) is explained in detail.
1) Chamfer Loss: The chamfer loss uses the nearestneighbour Euclidean distance between points in live scan (s i ∈ S) and the map (m j ∈ M ).It is weighted by the likelihood that the given point belongs to the Consistent class (l i = 0): If the weighting of the chamfer loss is large, all points that do not exactly match the map will be classified as Changed.
The chamfer loss pushes points to be classified as Changed.
2) Class-Balance Loss: The class-balance loss is the summation of the likelihood of a point being classified as Changed (l i = 1): This loss pushes all points to be classified as Consistent.This loss opposes the chamfer loss.
3) Temporal-Consistency Loss: The temporal-consistency loss uses a bidirectional chamfer loss between points in the two consecutive live scans, S 0 and S 1 , that are classified as Changed.l ki is the label of point s i in point cloud S k : This loss acts as a form of outlier rejection and encourages predictions to be consistent through time.This loss will push the network to classify all points as Consistent because an empty obstacle class has a loss of zero.

B. Deep Network Architecture
The novelty of this paper lies in the training method rather than the network architecture used to process the 3D points.In Figure 3, the neural network could be any 3D architecture that can be modified to accept pairs of maps and live scans.A convolutional neural network using range images as the input was chosen in this paper for its fast inference time [34].Speed is critical to the autonomous operation of the UGV.Preliminary experiments were also performed on a 3D architecture based on KPConv [35] but the run time was too long.
a) Range Image Generation: The inputs to the network are aligned range images from the map and the live scan.These images are rendered by transforming the live scan and local map 3D point clouds into a common spherical coordinate system centred on the LiDAR.Each range image pixel, (u i , v i ), contains the value r i = x 2 i + y 2 i + z 2 i , to the closest point in the frustum.Dimensions H × W = 64 × 1024 are used over a 25 • ×360 • field of view (fov).The lower limit of the fov below the horizon (fov down ) is a sensor property.An indexed mapping between the range image and point cloud is stored to reassign labels in the image to the corresponding 3D points.Given a point (x i , y i , z i ) ∈ R 3 the following image mapping R 3 − → R 2 is used: b) Neural Network Architecture: The range image network (RangeNetCD) is based on RangeNet++ [34] but modified to accept both a map and live scan.The input has a dimension of H × W × 2 with the live scan and map on each channel.The network encoder consists of four double convolutions followed by maxpooling.The decoder uses bilinear upsampling and skip connections from the encoder layers for per-pixel prediction.We observed that effective classification requires non-square convolution kernels on the first and last layers by experimenting with different shapes.The kernel in the first and last layer has dimension 1 × 2.

IV. EXPERIMENTS
Experimental evaluation occurred in forested and off-road areas around the University of Toronto Institute for Aerospace Studies (Figure 2).A Clearpath Warthog UGV [13] equipped with an Ouster-OS1 LiDAR [36] was used for experiments.LiDAR ICP localization [32] aligned the live scan with the corresponding map prior to performing any neural-network inference.The horizon of the local planner is 10 m.Accordingly, detections are limited to this range around the robot.Empirically, it was observed that the sensor-aligned frame with the origin on the ground below the LiDAR scanner performed the best.This may occur because the nearby planar ground points are projected into the same pixel frustum.

A. Retroreflective Semantic Evaluation
While the method presented in this paper is fully unsupervised, for quantitative evaluation of the results, groundtruth labels are required.For change detection, an offline raytracing approach similar to Thomas et al. [37] could classify the points corresponding to changes of interest.However, these ray-tracing approaches are challenged in foliage [11] and are computationally expensive.Instead, we propose using a rapid technique for implicitly labelling the data.
During dataset evaluation, all objects introduced into the scene are coated in retroreflective material.Figure 4 highlights the difference in intensity between a reflective suit, regular clothes, and the background.The intensity channel of LiDAR is unique to each distinct sensor and not standardized between different models.For this reason, we discard the intensity values and render the intensity channel out-of-band.However, the intensity can be used for automatic labelling: a threshold filter reliably classifies the points that belong to reflective objects.
The effectiveness of this approach requires a controlled environment without ambient dynamic actors, so public scenarios are unsuitable.However, for many cases when the rapid labelling of an object is required, this method is highly effective.While we use pedestrians with reflective suits in our

B. Results
The network is trained and evaluated on a collection of four different mapping and repeating sequences.In total, the raw dataset consists of 50, 000 frames and the robot traverses 8.5 km over 1.75 km of unique paths.A LiDAR frame is stored every 30 cm of driving rather than every 0.1 s, which reduces the total frames used to 20, 568.This prevents biased training due to interruptions in the data collection when the robot stops moving.The four sequences are traced out in Figure 2. The longest sequence is the Parking Loop, which is uncontrolled and contains moving vehicles, bicycles, and pedestrians.The other three sequences were recorded with reflective obstacles for evaluation.Table I describes the details of each run of the dataset.
For comparison, a classical nearest-neighbour baseline [8] is used as an alternative to unsupervised learning.Table II, Table III, and Table IV provide the intersection-over-union (IoU) score of the Changed class as well as the mean IoU (mIoU).Note that, because the labels are highly imbalanced, metrics evaluated on the Changed class are the basis of comparison.We adopt an additional planning-oriented metric to represent the ability of the robot to perform path planning.The IoU is re-evaluated within the known planning corridor around the robot, which is 5 m wide for the Warthog running Teach and Repeat.The results in Tables I-III are evaluated with map voxel size of 0.2 m and live-scan voxel size 0.05 m.The loss function hyperparameters during training were λ 1 = 15 and λ 2 = 1.0.
On the Grass Yard (Table II), the corridor IoU ch is 16.9% higher than the IoU ch evaluated on the entire scan for RangeNetCD.This highlights the practical effectiveness of the  system because false positives on static structures outside the corridor of the planner do not affect the trajectory.We find that pre-training a general network followed by environment-specific fine-tuning is the most effective training approach.To pre-train, the unlabelled and uncontrolled data from the Parking Loop is used.Finetuning occurs on a specific sequence with a lower learning rate in the optimizer.In Figure 5, pre-training improves the initial performance to 68%.While the general network is capable in all of the environments tested, fine-tuning with more specific examples boosts detection accuracy further.An avenue of future work is to fine-tune live on the robot once it has been deployed in operation.The detection quality is suitable for autonomous navigation.
C. Ablation Studies a) Scan and Map Density: The OS-1 LiDAR used for these experiments produces 1.3 million points per second [36].Using every point is computationally impractical for ICP odometry and localization [11].Additionally, storing every point in the map is prohibitively expensive.For this reason, the map is voxel downsampled.The voxel sizes of the map and live scan impact the accuracy of change detection.VI shows the impact of each loss function on the performance for the 0.2 m and 0.05 m map and live-scan voxelizations of the Grass Yard.As expected, if the class loss is removed, all points are classified as Changed.This occurs because the trivial optimal solution classifies all points as Changed.Conversely, if only the class loss is used, all points are classified as Consistent.The correct balance of the chamfer and class losses accounts for the basic capability of the system.Adding the temporal loss improves the performance by penalizing transient points, reducing the number of false-positive detections.

D. Qualitative Evaluation
Several interesting phenomena are observed in the output of the network.First, this change-detection approach is capable of generalizing to new types of objects.In Figure 6, a cyclist is accurately detected biking past the robot.No cyclists were included in the training data.While RangeNetCD is designed to allow for the detection of any changes in the scene, it is possible to overfit the network to shapes seen during training.For example, when the network was trained on data that changed only by introducing pedestrians, it failed to detect new vehicles at test time.This was solved by using more, diverse training data.With extensive datasets, it is interesting to consider if these objects could be differentiated in the feature space.We observe that if training sequences contain changes in every frame, the network forces detections in every frame.The most false positives appear when the correct solution is no change at all.
Change-detection performance is limited by the map quality.The map is assumed to capture only the permanent environment but errors appear when it does not.Using the uncontrolled parking lot data, a map was generated containing cars that had moved before later runs were recorded.This led to two types of misdetections.Points on static objects that were previously occluded are false positives.For example, a fire hydrant behind a parked car was later detected as a change even though it was always present.Points in the same location as an object that moved are false negatives.For example, a pedestrian walking through an empty parking space where a car was located during mapping is not detected even though it is a change.These issues rarely impact the planner, because the taught path will not exist inside obstacles in the map.Significant issues with the mapping process can destabilize network training because patterns are less consistent.These errors were eliminated by generating the map when the parking lots were empty.Applying a map-cleaning approach offline would be beneficial as well.

E. Closed-Loop Performance
The range-image network was tested on a laptop NVIDIA RTX A4500 GPU located on the robot.Inference takes 14.9 ± 2.3 ms.Most of this time is spent transferring the tensors between CPU and GPU memory.More optimal pipelines may exist that reduce the amount of transfer.RangeNetCD was exported to TorchScript and executed in C++ as part of the existing LiDAR Teach and Repeat [11] pipeline.
Points detected as Changed are passed to a cost map inflation module that projects them into 2D and accounts for the robot's radius.The planner maintains a queue of cost maps to overcome the LiDAR's close-range blind spot.These cost maps allow the robot to avoid local obstacles.The supplementary video2 contains sequences of the robot navigating around a series of introduced obstacles.In Figure 7, a mannequin was placed on the path as a change.The robot successfully detects and avoids it and then continues to follow the originally planned path.As a side effect of maintaining a queue of obstacles, dynamic changes cause smearing in the cost map leading to suboptimal routes.As a future extension to this work, a self-supervised prediction layer could be used to extrapolate the motion of changes over a short time horizon.

V. CONCLUSION AND FUTURE WORK
In this paper, we present a novel method for the unsupervised training of a deep network that performs change detection on a pair of point clouds.We show that by exploiting inductive biases related to the amount of change, distance of changes from the map, and temporal consistency of physical scenes, it is possible to train a network.We demonstrate the applicability of the approach on a range-image convolutional neural network that is trained on data collected on an unmanned ground vehicle.Once trained, the network runs quickly and is added to the robot to improve its ability to navigate for long periods.Future work using a pre-trained contrastive encoder for feature extraction will add semantic differences to the training loss.

Fig. 3 .
Fig. 3. Data flow of the training procedure.Only a single map and live-scan pair are used for inference.

Fig. 4 .
Fig. 4. A LiDAR scan of two pedestrians coloured by intensity.Left: A pedestrian wearing regular clothes.Right: A pedestrian wearing the reflective suit.

Fig. 5 .
Fig. 5. Changed IoU vs Distance Traveled Fine-tuning for a map-voxel of 0.3 m and live voxel of 0.05 m.Pre-training improves the performance.

Fig. 6 .
Fig. 6.The input range images and change detection of a cyclist passing by the robot.The two range images are used by the network to determine changes.Part of the live scan is blocked by the robot's structure, this area is purple.In the result, green regions are classified as changed and grey regions are consistent.

Fig. 7 .
Fig. 7. Left: an image view of an obstacle and the path the robot drives to avoid it.Right: The point cloud view of the same scene.

TABLE I STATISTICS
FOR THE DATASET USED FOR TRAINING AND QUANTITATIVE EVALUATION Table V shows the corridor IoU of the Changed class for different

TABLE V CORRIDOR
IOUCH FOR DIFFERENT COMBINATIONS OF THE MAP AND LIVE SCAN VOXEL DOWNSAMPLING ON THE GRASS YARD

TABLE VI ABLATION
STUDY ON LOSS FUNCTIONS EVALUATED ON THE GRASS YARD WITH MAP VOXEL 0.2 M AND LIVE VOXEL 0.05 M