Vision-centric Semantic Occupancy Prediction for Autonomous Driving

3 months ago 14

Here I volition archetypal summarize the detonation of probe studies during the past twelvemonth connected a precocious level, and past travel up with a summary of the assorted method details. Below is simply a diagram summarizing the wide improvement thread of the enactment to beryllium reviewed. It is worthy noting that the tract is inactive rapidly evolving, and has yet to converge to a universally accepted dataset and valuation metric.

The improvement timeline for the tract of semantic occupancy prediction (source: created by the author)

MonoScene (CVPR 2022), archetypal vision-input attempt

MonoScene is the archetypal enactment to reconstruct outdoor scenes utilizing lone RGB images arsenic inputs, arsenic opposed to lidar constituent clouds that erstwhile studies used. It is simply a single-camera solution, focusing connected the front-camera-only SemanticKITTI dataset.

The architecture of MonoScene (source: MonoScene)

The insubstantial proposes galore ideas, but lone 1 plan prime seems captious — FLoSP (Feature Line of Sight Projection). This thought is akin to the thought of diagnostic propagation on the enactment of sight, besides adopted by OFT (BMVC 2019) oregon Lift-Splat-Shoot (ECCV 2020). Other novelties specified arsenic Context Relation Prior and unsocial losses inspired by straight optimizing the metrics look not that utile according to the ablation study.

VoxFormer (CVPR 2023), importantly improved monoScene

The cardinal penetration of VoxFormer is that SOP/SSC has to code 2 issues simultaneously: country reconstruction for disposable areas and country hallucination for occluded regions. VoxFormer proposes a reconstruct-and-densify approach. In the archetypal reconstruction stage, the insubstantial lifts RGB pixels to a pseudo-LiDAR constituent unreality with monodepth methods, and past voxelize it into archetypal query proposals. In the 2nd densification stage, these sparse queries are enhanced with representation features and usage self-attention for statement propagation to make a dense prediction. VoxFormer significantly outperformed MonoScene connected SemanticKITTI and is inactive a single-camera solution. The representation diagnostic enhancement architecture heavy borrows the deformable attraction thought from BEVFormer.

The architecture of VoxFormer (source: VoxFormer)

TPVFormer (CVPR 2023), the archetypal multi-camera attempt

TPVFormer is the archetypal enactment to generalize 3D semantic occupancy prediction to a multi-camera setup and extends the thought of SOP/SSC from semanticKITTI to NuScenes.

The architecture of TPVFormer (source: TPVFormer)

TPVFormer extends the thought of BEV to 3 orthogonal axes. This allows the modeling of 3D without suppressing immoderate axes and avoids cubic complexity. Concretely TPVFormer proposes 2 steps of attraction to generating TPV features. First, it uses representation cross-attention (ICA) to get TPV features. This fundamentally borrows the thought of BEVFormer and extends to the different 2 orthogonal directions to signifier a TriPlane View feature. Then it uses cross-view hybrid attraction (CVHA) to heighten each TPV diagnostic by attending to the different two.

The prediction is denser than supervision successful TPVFormer, but inactive has gaps and holes (source: TPVFormer)

TPVFormer uses supervision from sparse lidar points from the vanilla NuScenes dataset, without immoderate multiframe densification oregon reconstruction. It claimed that the exemplary tin foretell denser and much accordant measurement occupancy for each voxels astatine inference time, contempt the sparse supervision astatine grooming time. However, the denser prediction is inactive not arsenic dense arsenic compared to aboriginal studies specified arsenic SurroundOcc which uses densified NuScenes dataset.

SurroundOcc (Arxiv 2023/03) and OpenOccupancy (Arxiv 2023/03), the archetypal attempts astatine dense statement supervision

SurroundOcc argues that dense prediction requires dense labels. The insubstantial successfully demonstrated that denser labels tin importantly amended the show of erstwhile methods, specified arsenic TPVFormer, by astir 3x. Its astir important publication is simply a pipeline for generating dense occupancy crushed information without the request for costly quality annotation.

GT procreation pipeline of SurroundOcc (source: SurroundOcc)

The procreation of dense occupancy labels involves 2 steps: multiframe information aggregation and densification. First, multi-frame lidar points of dynamic objects and static scenes are stitched separately. The accumulated information is denser than a azygous framework measurement, but it inactive has galore holes and requires further densification. The densification is performed by Poisson Surface Reconstruction of a triangular mesh, and Nearest Neighbor (NN) to propagate the labels to recently filled voxels.

OpenOccupancy is modern to and akin successful tone to SurroundOcc. Like SurroundOcc, OpenOccupancy besides uses a pipeline that archetypal aggregates multiframe lidar measurements for dynamic objects and static scenes separately. For further densification, alternatively of Poisson Reconstruction adopted by SurroundOcc, OpenOccupancy uses an Augment-and-Purify (AAP) approach. Concretely, a baseline exemplary is trained with the aggregated earthy label, and its prediction effect is utilized to fuse with the archetypal statement to make a denser statement (aka “augment”). The denser statement is astir 2x denser, and manually refined by quality labelers (aka “purify”). A full of 4000 quality hours were invested to refine the statement for nuScenes, astir 4 quality hours per 20-second clip.

The architecture of SurroundOcc (source: SurroundOcc)
The architecture of CONet (source: OpenOccupancy)

Compared to the publication successful the instauration of the dense statement procreation pipeline, the web architecture of SurroundOcc and OpenOccupancy are not arsenic innovative. SurroundOcc is mostly based connected BEVFormer, with a coarse-to-fine measurement to heighten 3D features. OpenOccupancy proposes CONet (cascaded occupancy network) which uses an attack akin to that of Lift-Splat-Shoot to assistance 2D features to 3D and past enhances 3D features done a cascaded scheme.

Occ3D (Arxiv 2023/04), the archetypal effort astatine occlusion reasoning

Occ3D besides projected a pipeline to make dense occupancy labels, which includes constituent unreality aggregation, constituent labeling, and occlusion handling. It is the archetypal insubstantial that explicitly handles the visibility and occlusion reasoning of the dense label. Visibility and occlusion reasoning are critically important for the onboard deployment of SOP models. Special attraction connected occlusion and visibility is indispensable during grooming to debar mendacious positives from over-hallucination astir the unobservable scene.

It is noteworthy that lidar visibility is antithetic from camera visibility. Lidar visibility describes the completeness of the dense label, arsenic immoderate voxels are not observable adjacent aft multiframe information aggregation. It is accordant crossed the full sequence. Meanwhile, camera visibility focuses connected the possibility of detection of onboard sensors without hallucination and differs astatine each timestamp. Eval is lone performed connected the “visible” voxels successful some the LiDAR and camera views.

In the mentation of dense labels, Occ3D lone relies connected the multiframe information aggregation and does not person the 2nd densification signifier arsenic successful SurroundOcc and OpenOccupancy. The authors claimed that for the Waymo dataset, the statement is already rather dense without densification. For nuScenes, though the annotation inactive does person holes aft constituent unreality aggregation, Poisson Reconstruction leads to inaccurate results, truthful nary densification measurement is performed. Maybe the Augment-and-Purify attack by OpenOccupancy is much applicable successful this setting.

The architecture of CTF-Occ successful Occ3D (source: Occ3D)

Occ3D besides projected a neural web architecture Coarse-to-Fine Occupancy (CTF-Occ). The coarse-to-fine thought is mostly the aforesaid arsenic that successful OpenOccupancy and SurroundOcc. CTF-Occ projected incremental token enactment to trim the computation burden. It besides projected an implicit decoder to output the semantic statement of immoderate fixed point, akin to the thought of Occupancy Networks.

The Semantic Occupancy Prediction studies reviewed supra are summarized successful the pursuing table, successful presumption of web architecture, grooming losses, valuation metrics, and detection scope and resolution.

Network Architecture

Most of the studies are based connected proven state-of-the-art methods for BEV perception, specified arsenic BEVFormer and Lift, Splat, Shoot. The architecture tin beryllium mostly divided into 2 stages, 2D-to-3D diagnostic lifting, and 3D diagnostic enhancement. See the supra array for a much elaborate summary. The architecture seems to person mostly converged. What matters astir is the dense occupancy annotation procreation pipeline, and dense supervision during training.

Below is simply a summary of the autolabel pipeline to make dense occupancy labels successful SurroundOcc, OpenOccupancy, and Occ3D.

Summary of dense statement pipeline successful SurroundOcc, OpenOccupancy, and Occ3D (source: created by the author)

Training Loss

The semantic occupancy prediction task is precise akin to semantic segmentation, successful that SOP has to foretell 1 semantic statement for each voxel successful 3D space, portion semantic segmentation has to foretell 1 semantic statement for each measurement point, beryllium it a pixel per image, oregon a 3D constituent per lidar scan. The main nonaccomplishment for semantic segmentation has been cross-entropy nonaccomplishment and Lovasz loss. The Lovasz hold enables nonstop optimization of the mean intersection-over-union (IoU) metric successful neural networks.

Perhaps inspired by Lovasz, monoScene projected respective different losses that tin straight optimize valuation metrics. However, they look esoteric and not afloat supported by ablation studies.

Evaluation Metrics

The superior metric is IoU for geometry occupancy prediction (whether a voxel is occupied) and mIoU (mean IoU) for semantic classification (which people an occupied voxel belongs to). These metrics possibly would beryllium inadequate for concern applications.

Vision-based SOP task needs to mature for concern usage and to regenerate lidar. Although some precision and callback substance arsenic is calculated successful the IoU metric, precision is ever much important for ADAS (Advanced Driver Assistance Systems) applications to debar phantom braking, arsenic agelong arsenic we inactive person a operator down the wheel.

Detection Range and Resolution

All existent tracks foretell 50 meters astir the ego vehicle. Voxel solution varies from 0.2 m for SemanticKITTI to 0.4 m oregon 0.5 m for NuScenes and Waymo datasets. This is simply a bully starting point, but possibly inactive inadequate for concern applications.

A much tenable solution and scope whitethorn beryllium 0.2 m for the scope wrong 50 m, and 0.4 m for the scope betwixt 50 m to 100 m.

There are 2 related tasks to SOP, Surrounding Depth Maps and Lidar Semantic Segmentation, which we volition concisely reappraisal below.

Surround extent representation prediction task (such arsenic FSM and SurroundDepth) expands monocular extent prediction and leverages the consistency successful overlapped camera field-of-view to further amended the performance. It focuses much connected the measurement root by giving each pixel successful the images a extent worth (bottom-up), portion SOP focuses much connected the exertion people successful the BEV abstraction (top-down). The aforesaid analogy is betwixt Lift-Splat-Shoot and BEVFormer for BEV perception, wherever the erstwhile is simply a bottom-up attack and the second is top-down.

Lidar Semantic Segmentation focuses connected assigning each lidar constituent unreality successful a lidar scan a semantic people label. Real-world sensing successful 3D is inherently sparse and incomplete. For holistic semantic understanding, it is insufficient to solely parse the sparse measurements portion ignoring the unobserved country structures.

  • The neural web architecture successful semantic occupancy prediction seems to person mostly converged. What matters the astir is the autolabel pipeline to make dense occupancy labels and dense supervision during training.
  • The existent detection scope and voxel solution adopted by communal datasets would beryllium inadequate for concern applications. We request much detection scope (e.g., 100 m) astatine a finer solution (e.g., 0.2 m).
  • The existent valuation metrics would beryllium inadequate for concern applications arsenic well. Precision is much important than callback for ADAS applications to debar predominant phantom braking.
  • Future directions of semantic occupancy prediction whitethorn see country travel estimation. This volition assistance the prediction of aboriginal trajectories of chartless obstacles and collision avoidance during trajectory readying for the ego vehicle.

Note: All images successful this blog station are either created by the author, oregon from world papers publically available. See captions for details.

Read Entire Article