-
Notifications
You must be signed in to change notification settings - Fork 0
/
ChiselPaper.tex
1424 lines (1306 loc) · 72.2 KB
/
ChiselPaper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[conference]{IEEEtran}
\usepackage{times}
\usepackage{epsfig}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage[font=footnotesize]{caption}
\usepackage{subcaption}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{algorithm}
\usepackage{color}
\definecolor{light-gray}{gray}{0.4}
\usepackage[noend]{algpseudocode}
%\usepackage[noadjust]{cite}
\usepackage[sort&compress,numbers]{natbib} % required by RSS
\pdfminorversion=5
\pdfobjcompresslevel=3
\pdfcompresslevel=9
% list of commenters
\newcommand{\ssnote}[1]{{\xxnote{SS}{red}{#1}}}
% implement conditional notes (turn on/off with \hidenotes above)
\newcommand{\xxnote}[3]{}
\ifx\hidenotes\undefined
\renewcommand{\xxnote}[3]{\color{#2}{#1: #3}}
\fi
% Labels in IEEE format
% Equation
\newcommand{\eref}[1]{(\ref{#1})}
% Section
\newcommand{\sref}[1]{Section~\ref{#1}}
% Figure
\newcommand{\figref}[1]{Fig.\ref{#1}}
% Table
\newcommand{\tabref}[1]{Table~\ref{#1}}
% Algorithm
\newcommand{\algoref}[1]{Alg.\ref{#1}}
\newcommand{\naive}{baseline\xspace}
\newcommand{\ie}{\textit{i.e.}\xspace}
\newcommand{\eg}{\textit{e.g.}\xspace}
\newcommand{\etal}{\textit{et al.}\xspace}
\newcommand{\Tango}{\textit{Tango}\xspace}
\newcommand{\TSDF}{TSDF\xspace}
\newif\iffinalcopy
\finalcopyfalse % or
\usepackage{xspace}
\newcommand{\chisel}{\textsc{Chisel}\xspace}
\usepackage{booktabs} % Pretty tabular
\usepackage{multirow} % Cells that span multiple rows
% Include other packages here, before hyperref.
% If you comment hyperref and then uncomment it, you should delete
% egpaper.aux before re-running latex. (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\usepackage[pagebackref=true,breaklinks=true,letterpaper=true,colorlinks,bookmarks=false]{hyperref}
\begin{document}
%%%%%%%%% TITLE
\title{\Large \chisel : Real Time Large Scale 3D Reconstruction Onboard a Mobile
Device
\\
using Spatially-Hashed Signed Distance Fields}
\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}
\algnewcommand{\LineComment}[1]{\State \(\triangleright\)
\textcolor{light-gray}{\textit{#1}}}
%\nipsfinalcopy % Uncomment for camera-ready version
\iffinalcopy
\author
{
Matthew Klingensmith \\
Carnegie Mellon Robotics Institute\\
\texttt{[email protected]}
\and
Ivan Dryanovski \\
The Graduate Center,\\
City University of New York\\
\texttt{[email protected]}
\and
Siddhartha S. Srinivasa \\
Carnegie Mellon Robotics Institute\\
\texttt{[email protected]}
\and
Jizhong Xiao \\
The City College of New York \\
\texttt{[email protected]}
}
\else
\author{\Large{Author names omitted for review. Paper Number 32} \\
\small{\textit{Note to reviewers, a high resolution video is
available at:} \url{http://vimeo.com/117544631}}}
\fi
\maketitle
\thispagestyle{empty}
\begin{abstract}
We describe a system for real-time house-scale (300 square meter or more)
dense 3D reconstruction onboard a Google \Tango
\cite{Tango} mobile device by using a dynamic spatially-hashed truncated
signed distance field\cite{NiessnerHashing} for mapping, and visual-inertial
odometry for localization. By aggressively culling parts of
the scene that do not contain surfaces, we avoid needless computation and wasted
memory. Even under very noisy conditions, we produce high-quality
reconstructions through the use of space carving. We are able to reconstruct and
render very large scenes at a resolution of 2-3 cm in real
time on a mobile device without the use of GPU computing. The user is able to
view and interact with the reconstruction in real-time through an intuitive
interface. We provide both qualitative and quantitative results on publicly
available RGB-D datasets \cite{FREIBURG}, and on datasets collected in
real-time from two devices, as well as a motivating example involving an aerial
robot.
% reality and indoor building scanning applications
% Google's \textit{Project Tango}\cite{Tango} has made integrated depth sensing
% and onboard visual-inertial odometry \cite{VINS} available to mobile devices such
% as phones and tablets. In this work, we explore the problem of dense,
% large-scale 3D realtime reconstruction on a mobile device with integrated
% depth sensing. Solving this problem is a necessary pre-requisite for many indoor navigation, augmented
% reality and indoor building scanning applications. Existing 3D reconstruction approaches on mobile
% devices either only build a sparse reconstruction \cite{KleinSparse}, offload
% their computation to other devices \cite{DTAM}, or require seconds-long
% post-processing per frame \cite{TanskanenMetric}. state-of-the-art approaches
% in large-scale dense reconstruction \cite{Newcombe, Whelan2013} require large
% amounts of memory and high-performance GPU computing, which is often
% unavailable on mobile devices. Our approach accomplishes large-scale
% reconstruction on a mobile device by exploiting visual inertial odometry and by
% using chunked, dynamic spatial hash maps \cite{SpatialHashing} of truncated
% signed distance volumes \cite{Curless1996}, we are able to reconstruct and
% render very large (300 square meter) scenes at a resolution of 2-3cm in real
% time on a mobile device using only the CPU, with a memory footprint that is
% approximately 84 percent smaller than \textit{Kinect Fusion} \cite{Newcombe}.
% We provide both qualitative and quantitative results on publicly available
% RGB-D datasets \cite{FREIBURG}, and on datasets collected in real-time from two
% devices.
\end{abstract}
\section{Introduction}
Recently, mobile phone manufacturers have started adding
high-quality depth and inertial sensors to mobile phones and tablets. In
particular, the devices we use in this work, Google's \Tango \cite{Tango}
phone and tablet have very small active infrared projection depth sensors
combined with high-performance IMUs and wide field of view cameras
(\sref{section:hardware}). Other devices, such as the Occiptal Inc.
\textit{Structure Sensor} \cite{StructureSensor} have similar capabilities.
These devices offer an onboard, fully integrated sensing platform for 3D mapping
and localization, with applications ranging from mobile robots to handheld,
wireless augmented reality.
Real-time 3D reconstruction is a well-known problem in computer vision and
robotics \cite{Hartley2004}. The task is to extract the true 3D geometry of a
real scene from a sequence of noisy sensor readings online. Solutions to this
problem are useful for navigation, mapping, object scanning, and more. The
problem can be broken down into two components: localization (\ie estimating
the sensor's pose and trajectory), and mapping (\ie reconstructing the scene
geometry and texture).
Consider house-scale (300 square meter) real-time 3D mapping and localization on
a \Tango device. A user (or robot) moves around a building, scanning
the scene. At house-scale, we are only concerned with features with a resolution
of about 2-3 cm (walls, floors, furniture, appliances, etc.). To facilitate
scanning, real-time feedback is given to the user on the device's screen. The
user can export the resulting 3D scan without losing any data.
\figref{fig:first_figure} shows an example of this use case (\sref{section:mapping})
in progress. As an application of house-scale mapping, we consider a small
flying robot equipped with a \Tango sensor onboard (\figref{fig:robot}). The
robot must navigate indoor spaces by mapping them online (\sref{section:robot}).
Having all of the sensing and processing onboard the robot reduces the need for
streaming sensor data to offboard devices.
\begin{figure}[t!]
\centering
\begin{subfigure}{\linewidth} \centering
\includegraphics[width=0.75\textwidth]{img/mapdevice}
\caption{\chisel creating a map
of an entire office building floor on a mobile device in real-time.}
\label{fig:map_device}
\end{subfigure}
\begin{subfigure}{\linewidth} \centering
\includegraphics[width=0.75\textwidth]{img/apartment_scene_color.png}
\caption{Reconstructed apartment scene at a voxel resolution of 2cm.}
\label{fig:apartment_color}
\end{subfigure}
\caption{\chisel running on Google's \Tango~\cite{Tango} device.}
\label{fig:first_figure}
\end{figure}
These use cases require that the 3D reconstruction algorithm must run entirely
onboard; fast enough to allow real-time interaction. Importantly, the entire
dense 3D reconstruction must fit inside the device's limited (2-4GB) memory
(\sref{section:hardware}). Because some mobile devices lack sufficiently
powerful discrete graphics processing units (GPU), we choose not to rely on general
purpose GPU computing to make the problem tractable in either creating or
rendering the 3D reconstruction.
3D mapping algorithms involving occupancy grids \cite{Elfes1989}, keypoint
mapping \cite{KleinSparse} or point clouds \cite{RusinkiewiczPoints,
TanskanenMetric, WeiseScanning} already exist for mobile phones at small scale
-- but at the scales we are interested in, their reconstruction quality is
limited. Occupancy grids suffer from aliasing and are memory-dense, while
point-based methods cannot reproduce surface or volumetric features of the scene
without intensive post-processing.
Many state-of-the-art real-time 3D reconstruction algorithms \cite{Newcombe,
Whelan2013,WhelanLoopClose,Bylow2013,NiessnerHashing} compute a truncated
signed distance field (\TSDF) \cite{Curless1996} of the scene. The \TSDF stores a
discretized estimate of the distance to the nearest surface in the scene. While
allowing for very high-quality reconstructions, the \TSDF is very
memory-intensive. The size of the \TSDF needed to reconstruct an entire house
may be on the order of several gigabytes (\sref{section:memory}), and
rendering the resulting reconstruction in real-time would certainly overwhelm
the graphics capabilities of a mobile phone. Previous works \cite{Whelan2013,
WhelanLoopClose} have extended the \TSDF to larger scenes by storing a
\textit{moving} voxelization and throwing away data incrementally -- but we are
interested in preserving the volumetric data for later use.
Experimentally, we found that most ( $\sim 93\%$) of the space in a typical
scene is in fact empty (\tabref{table:volumecount}). Iterating over empty space
for the purposes of reconstruction and rendering wastes computation time; and storing it
wastes memory. Noting this fact, we make use of a data structure introduced by
Niessner \etal \cite{NiessnerHashing}, the Dynamic spatially-hashed
\cite{SpatialHashing} \TSDF (\sref{section:spatialhash}), which stores the
distance field data as a two-level structure in which static 3D voxel
\textit{chunks} are dynamically allocated according to observations
of the scene. Because the data structure has $\mathcal{O}(1)$ access
performance, we are able to quickly identify which parts of the scene should be
rendered, updated, or deleted. This allows us to minimize the amount of time we
spend on each depth scan and keep the reconstruction running in real-time on the device.
For localization, we use a mix of visual-inertial odometry \cite{VINS, VINS2}
and sparse keypoint-based mapping \cite{FastSlam} as a black box
(\sref{section:pose}). These approaches use only the 2D wide-angle camera and
the IMU, and take in no information from the depth sensor. As a consequence,
our mapping system is \textit{open-loop} in the sense that it does not attempt
to explicitly estimate the pose of the sensor from the 3D reconstruction. In spite
of this limitation, we found that using open-loop visual odometry was
sufficient to create high-quality, house-scale reconstructions
(\figref{fig:night3}). Additionally the user can post-process the model using
global optimization of the trajectory through bundle adjustment, which our
system can also do on-device
We provide qualitative and quantitative results on publicly available RGB-D
datasets \cite{FREIBURG}, and on datasets collected in real-time from two
devices (\sref{section:experiments}). We compare different approaches for
creating (\sref{section:scan_compare}) and storing (\sref{section:memory}) the
\TSDF in terms of memory efficiency, speed, and reconstruction quality. Our
system (called \chisel), is able to produce high-quality, large scale 3D
reconstructions of outdoor and indoor scenes in real-time onboard a mobile device.
\section{Related Work}
Mapping paradigms generally fall into one of two categories: landmark-based (or
\emph{sparse}) mapping, and high-resolution \emph{dense} mapping
\cite{FastSlam}. While sparse mapping generates a metrically consistent map of
landmarks based on key features in the environment, dense mapping globally
registers all sensor data into a high-resolution data structure. In this work,
we are concerned primarily with dense mapping, which is essential for high
quality 3D reconstruction.
Because mobile phones typically do not have depth sensors, previous works
\cite{TanskanenMetric, DTAM, LSDSlam} on dense reconstruction for mobile phones
have gone to great lengths to extract depth from a series of registered monocular camera
images. Since our work focuses on mobile devices with integrated depth sensors,
such as the Google \Tango devices \cite{Tango}, we do not need to perform
costly monocular stereo as a pre-requisite to dense reconstruction. This allows
us to save our memory and CPU budget for the 3D reconstruction itself.
One of the simplest means of dense 3D mapping is storing multiple registered
point clouds. These point-based methods \cite{RusinkiewiczPoints,
TanskanenMetric, WeiseScanning, LSDSlam} naturally convert depth data into
projected 3D points. While simple, point clouds fail to capture local scene
structure, are noisy, and fail to capture \emph{negative} (non-surface)
information about the scene. This information is crucial to scene reconstruction
under high levels of noise \cite{Klingensmith2014}.
Elfes \cite{Elfes1989} introduced Occupancy Grid Mapping, which divides the
world into a voxel grid containing occupancy probabilities. Occupancy grids
preserve local structure, and gracefully handle redundant and missing data.
While more robust than point clouds, occupancy grids suffer from aliasing, and
lack information about surface normals and the interior/exterior of obstacles.
Occupancy grid mapping has already been shown to perform in real-time on \Tango
\cite{Tango} devices using an Octree representation \cite{Wurm2010}.
% Attempts to extend occupancy grid maps to 3D have sometimes relied on octrees.
% Rather than storing a fixed-resolution grid, octrees store occupancy data in a
% spatially organized tree. In typical scenes, octrees reduce the required memory
% over occupancy grids by orders of magnitude. Octomap \cite{Wurm2010} is a
% popular example of the octree paradigm. However, octrees containing only
% occupancy probability suffer from many of the same problems as occupancy grids:
% they lack information about the interior and exterior of objects, and suffer
% from aliasing. Further, octrees suffer from logarithmic reading, writing, and
% iteration times, and have very poor memory locality characteristics.
Curless and Levoy \cite{Curless1996} created an alternative to occupancy
grids called the Truncated Signed Distance Field (\TSDF), which stores a
voxelization of the signed distance field of the scene. The \TSDF is negative
inside obstacles, and positive outside obstacles. The surface is given
implicitly as the zero isocontour of the \TSDF. While using more memory
than occupancy grids, the \TSDF creates much higher quality surface
reconstructions by preserving local structure.
In robotics, distance fields are used in motion planning
\cite{RatliffChomp}, mapping \cite{VandapelKA05}, and scene understanding.
Distance fields provide useful information to robots: the distance to the
nearest obstacle, and the gradient direction to take them away from obstacles.
A recent work by Wagner \etal \cite{WagnerICRA13} explores the direct use of a
\TSDF for robot arm motion planning; making use of the gradient information
implicitly stored in the \TSDF to locally optimize robot trajectories. We are
interested in similarly planning trajectories for autonomous flying robots by
directly using \TSDF data (\sref{section:robot}). Real-time onboard 3D mapping,
navigation and odometry has already been achieved on flying robots
\cite{FlyingNavigation, OSMAV}using occupancy grids. Our system is
complimentary to this work.
\emph{Kinect Fusion} \cite{Newcombe} uses a \TSDF to simultaneously extract the
pose of a moving depth camera and scene geometry in real-time. Making
heavy use of the GPU for scan fusion and rendering, \textit{Fusion} is capable
of creating extremely high-quality, high-resolution surface reconstructions within
a small area. However, like occupancy grid mapping, the algorithm relies on a
single fixed-size 3D voxel grid, and thus is not suitable for reconstructing
very large scenes due to memory constraints. This limitation has generated
interest in extending \TSDF fusion to larger scenes.
Moving window approaches, such as \emph{Kintinuous} \cite{Whelan2013} extend
Kinect Fusion to larger scenes by storing a moving voxel grid in the GPU. As
the camera moves outside of the grid, areas which are no longer visible are
turned into a surface representation. Hence, distance field data is prematurely
thrown away to save memory. As we want to save distance field data so it can be
used later for post-processing, motion planning, and other applications, a
moving window approach is not suitable.
Recent works have focused on extending \TSDF fusion to larger scenes by
compressing the distance field to avoid storing and iterating
over empty space. Many have used \textit{hierarchal} data structures such as
octrees or KD-trees to store the \TSDF \cite{Zeng2012, Chen2012}. However, these
structures suffer from high complexity and complications with parallelism.
An approach by Niessner \etal \cite{NiessnerHashing} uses a two-layer
hierarchal data structure that uses spatial hashing \cite{SpatialHashing} to
store the \TSDF data. This approach avoids the needless complexity of other
hierarchical data structures, boasting $\mathcal{O}(1)$ queries, and avoids
storing or updating empty space far away from surfaces.
Our system adapts the spatially-hashed data structure of
Niessner \etal \cite{NiessnerHashing} to \Tango devices. By carefully
considering what parts of the space should be turned into distance fields at
each timestep, we avoid needless computation (\tabref{table:meshingtimes}) and
memory allocation (\figref{fig:memory_data}) in areas far away from the sensor.
Unlike \cite{NiessnerHashing}, we do not make use of any general purpose GPU
computing. All \TSDF fusion is performed on the mobile processor, and the
volumetric data structure is stored on the CPU. Instead of rendering the scene
via raycasting, we generate polygonal meshes incrementally for only the parts
of the scene that need to be rendered. Since the depth sensor found on the
\Tango device is significantly more noisy than other commercial depth
sensors, we reintroduce space carving \cite{Elfes1989} from occupancy grid
mapping (\sref{section:carving}) and dynamic truncation
(\sref{section:dynamic_trunc}) into the \TSDF fusion algorithm to improve
reconstruction quality under conditions of high noise. The space carving and
truncation algorithms are informed by a parametric noise model trained for the
sensor using the method of Nguyen \etal \cite{Nguyen2012}.
\section{System Implementation}
\subsection{Preliminaries}
\label{section:prelim}
Consider the geometry of the ideal pinhole depth sensor. Rays emanate from the
sensor origin to the scene. \figref{fig:raycast_diagram} is a diagram of a ray hitting a
surface from the camera, and a voxel that the ray passes through. Call the
origin of the ray $o$, and the endpoint of the ray $\mathbf{x}$.
The length of the ray is given by $z = \|\mathbf{o}
- \mathbf{x}\|$. The direction of the ray is given by $\mathbf{\hat{r}} =
\frac{\mathbf{o} - \mathbf{x}}{z}$. The endpoint of each ray represents a point
on the surface. We can also parameterize the ray by its direction and
endpoint, using an interpolating parameter $u$:
\begin{equation}
\mathbf{v}(u) = \mathbf{x} - u\mathbf{\hat{r}}
\end{equation}
At each timestep $t$ we have a set of rays $Z_t$. In practice, the rays are
corrupted by noise. Call $d$ the true distance from the origin of the sensor to
the surface along the ray. Then $z$ is actually a random variable drawn from a
distribution dependent on $d$ called the \textit{hit probability}
(\figref{fig:integration_diagram}). Assuming the hit probability is Gaussian, we
have:
\begin{equation}
\label{eqn:hitprobability}
z \sim \mathcal{N}(d, \sigma_d)
\end{equation}
\noindent where $\sigma_d$ is the standard deviation of the depth noise for a
true depth $d$. We train this model using a method from \textit{Nguyen} \etal
\cite{Nguyen2012}.
Another complication is that since depth sensors are not ideal, an actual depth
reading corresponds to many possible rays through the scene. Therefore, each
depth reading represents a \textit{cone}, rather than a ray. The ray passing
through the center of the cone closely approximates it near the sensor, but the
approximation gets worse further away.
% Rays are in practice stored in a 2D depth image at each timestep. We can
% calibrate the depth sensor as a pinhole camera with camera intrinsics $\{f_x,
% f_y, c_x, c_y\}$. Using the intrinsics, we can project onto the camera plane
% using the pinhole model. The function $\mathbf{Project} :
% \mathbf{R}^3\to\mathbf{R}^2$ accomplishes this. We can also bilinearly
% interpolate points in the depth image. The function $\mathbf{InterpolateDepth}
% : \mathbf{R}^2 \to \mathbf{R}^+$ does this.
\subsection{The Truncated Signed Distance Field}
\label{section:TSDF}
We model the world~\cite{Curless1996} as a volumetric signed distance field $\Phi: \mathbf{R}^3
\to \mathbf{R}$. For any point in the world $x$, $\Phi(x)$
is the distance to the nearest surface, \emph{signed} positive if the point is
outside of obstacles and negative otherwise. Thus, the zero isocontour $(\Phi =
0)$ encodes the surfaces of the scene.
Since we are mainly interested in reconstructing surfaces,
we use the Truncated Signed Distance Field (\TSDF) \cite{Curless1996}:
\begin{equation}
\Phi_{\tau}(\mathbf{x}) =
\begin{cases}
\Phi(\mathbf{x}) & \text{if} |\Phi(\mathbf{x})| < \tau \\
\text{undefined} & \text{otherwise}
\end{cases}
\end{equation}
where $\tau \in \mathbf{R}$ is the \emph{truncation distance}. Curless and
Levoy \cite{Curless1996} note that very near the endpoints of depth rays, the
\TSDF is closely approximated by the distance along the ray to the nearest observed
point.
% Furthermore, computing the first-order Taylor series approximation of $\Phi_{\tau}$
% about a point on the surface $\mathbf{x}_s$, we get:
% \begin{align}
% \label{eqn:atsdf}
% \Phi_{\tau}(\mathbf{x}) &\approx \Phi_{\tau}(\mathbf{x}_s) + \nabla \Phi(\mathbf{x}_s)^{T} (\mathbf{x} - \mathbf{x}_s) \nonumber \\
% &= \nabla \Phi(\mathbf{x}_s)^{T} (\mathbf{x} - \mathbf{x}_s) = \mathbf{\hat{n}}(\mathbf{x}_s)^{T}(\mathbf{x} - \mathbf{x}_s)
% \end{align}
% where we denote $\nabla \Phi(\mathbf{x}_s) = \mathbf{\hat{n}}(\mathbf{x}_s)$, the unit surface normal at $\mathbf{x}_s$,
% encoding the observation that $\Phi$ changes fastest along the surface normal, and $||\nabla \Phi|| = 1$.
%
% Now consider a ray from an ideal (noise-free) pinhole depth sensor with origin at $\mathbf{o}$ that
% hits the surface at $\mathbf{x}_s$. Not only does this ray provide the measurement $\mathbf{x}_s$,
% it also carves out empty space from $\mathbf{o}$ to $\mathbf{x}_s$. First, we parametrize the ray as
% \begin{equation}
% \mathbf{x}(u) = \mathbf{x}_s - u\mathbf{\hat{r}}
% \end{equation}
% where $u \in \mathbf{R}$ and $\mathbf{\hat{r}}$ is the unit vector from $\mathbf{o}$ to $\mathbf{x}_s$.
%
% Then, we use \eref{eqn:atsdf} to approximate $\Phi$ along the ray as (\figref{fig:raycast_diagram})
% \begin{equation}
% \Phi_{\tau}(\mathbf{x}(u)) \approx \mathbf{\hat{n}}(\mathbf{x}_s)^{T}(\mathbf{x}_s - u\mathbf{\hat{r}} - \mathbf{x}_s)
% = -u\mathbf{\hat{n}}(\mathbf{x}_s)^{T}\mathbf{\hat{r}}
% \end{equation}
%
% Some works \cite{Bylow2013} approximate
% $\mathbf{\hat{n}}$ by locally fitting a plane around $\mathbf{x}_s$, but this comes at the cost of
% computation. \ssnote{My rewrite stops here. Confused.}
%
% Consider the geometry of the ideal pinhole depth sensor. Rays emanate from the
% sensor origin to the scene. At each timstep $t$ we have a set of rays
% $Z_t$. Call the origin of the ray $o$, and the endpoint of the ray $\mathbf{x}$.
% The length of the ray is given by $z = \|\mathbf{o}
% - \mathbf{x}\|$. The direction of the ray is given by $\mathbf{\hat{r}} =
% \frac{\mathbf{o} - \mathbf{x}}{z}$. The endpoint of each ray represents a point
% on the surface. So if the sensor is perfect, we know that the signed distance
% function $\Phi(\mathbf{x}) = 0$ at that point. In practice, the ray is
% corrupted by noise. Call $d$ the true distance from the origin of the sensor to
% the surface along the ray. Then $z$ is actually a random variable drawn from a
% distribution dependent on $d$ called the \textit{hit probability}
% (\figref{fig:integration_diagram}).
%
% The ray also implies that all the points in space along the ray from the sensor
% origin up to the endpoint must be outside of any surface (or else the ray would
% have been occluded by a a surface). Because real objects have non-zero
% thickness, some part of the space along the ray beyond the endpoint will be
% inside an object.
%
% Further, very near the surface of the object, the signed distance field can
% be approximated by the distance along the ray to the endpoint of the ray
% (\figref{fig:raycast_diagram}). Near $\mathbf{x}$, the signed distance field
% is approximately the distance to $\mathbf{x}$. So, whenever $|u|$ is very
% small:
%
% \begin{equation}
% \label{eq:pointwise_tsdf}
% \Phi(\mathbf{x} - u\mathbf{\hat{r}}) \approx u
% \end{equation}
%
% The approximation is better whenever the ray is approximately
% perpendicular to the surface, and worst whenever the ray is parallel to the
% surface. Because of this, some works \cite{Bylow2013} instead approximate
% $\Phi$ by locally fitting a plane around $\mathbf{x}$ using the neighboring
% points and using the distance to that plane as a local approximation, \ie:
%
% \begin{equation}
% \label{eq:planar_tsdf}
% \Phi(\mathbf{x} - u\mathbf{\hat{r}}) \approx -u~\mathbf{\hat{r}} \cdot
% \mathbf{\hat{n}_x}
% \end{equation}
%
% \noindent where $\mathbf{\hat{n}_x}$ is the local surface normal around
% $\mathbf{x}$. In general, the planar approximation (\eref{eq:planar_tsdf})
% is much better than the pointwise approximation (\eref{eq:pointwise_tsdf}),
% especially when surfaces are nearly parallel to the sensor -- but computing surface normals is not
% always computationally feasible. In our work, we have chosen to use
% the simpler pointwise approximation.
\begin{figure*}
\begin{subfigure}{1.0\columnwidth}
\centering
\includegraphics[width=1.0\textwidth]{img/Raycast}
\caption{Raycasting through a voxel.}
\label{fig:raycast_diagram}
\end{subfigure}
\begin{subfigure}{1.0\columnwidth}
\centering
\includegraphics[ width=1.0\textwidth]{img/tsdf_integration.pdf}
\caption{Noise model and the space carving region.}
\label{fig:integration_diagram}
\end{subfigure}
\caption{Computing the Truncated Signed Distance Field (\TSDF).}
\end{figure*}
%
% This observation leads to the Truncated Signed Distance Field (TSDF)
% \cite{Curless1996}. The TSDF is defined as:
%
% \begin{equation}
% \Phi_{\tau}(\mathbf{x}) =
% \begin{cases}
% \Phi(\mathbf{x}) & \text{if} |\Phi(\mathbf{x})| < \tau \\
% \text{undefined} & \text{otherwise}
% \end{cases}
% \end{equation}
%
% \noindent where $\tau \in \mathbf{R}$ is the ``truncation distance'' in which
% the TSDF is defined. Following \cite{Curless1996}, to update the TSDF with a
% particular sensor measurement, we update each voxel along the ray from $u =
% -\tau$ to $\tau$, and take the weighted running average of distance field estimates over time. To do
% this, we can introduce a global weighting function $W : \mathbf{R}^3 \to
% \mathbf{R^{+}}$ that represents the confidence in the estimate of the TSDF at
% every point in space.
\begin{algorithm}[t!]
\caption{Truncated Signed Distance Field}
\label{alg:TSDF}
\begin{algorithmic}[1]
\For {$t \in [1 \ldots T]$} \Comment{\textit{For each timestep}}
\For {$\{\mathbf{o}_t, \mathbf{x}_t\} \in Z_t$} \Comment{\textit{For each ray in scan}}
\State {$z \gets \|\mathbf{o}_t - \mathbf{x}_t\|$}
\State {$\mathbf{r} \gets \frac{\mathbf{x}_t - \mathbf{o}_t}{z}$}
\State{$\tau \gets \mathcal{T} (z)$} \Comment{\textit{Dynamic truncation distance}}
\label{alg:line:dynamic_tsdf}
\State {$\mathbf{v}_c \gets \mathbf{x}_t - u \mathbf{r}$}
\For{$u \in [\tau + \epsilon, z]$} \Comment{\textit{Space carving region}}
\If {$\Phi_{\tau}(\mathbf{v}_c) \leq 0$ }
\State {$\Phi_{\tau}(\mathbf{v}_c) \gets \text{undefined}$}
\label{alg:line:voxel_carve}
\State {$W(\mathbf{v}_c) \gets 0$}
\EndIf
\EndFor
\For{$u \in [-\tau, \tau]$} \Comment{\textit{Hit region}}
\State {$\Phi_{\tau}(\mathbf{v}_c) \gets \frac{ W(\mathbf{v}_c)
\Phi_{\tau}(\mathbf{v}_c) + \alpha_{\tau}(u) u}{ W(\mathbf{v}_c)
+\alpha_{\tau}(u)}$}
\label{alg:line:tsdf_update}
\State {$W(\mathbf{v}_c) \gets W(\mathbf{v}_c)+ \alpha_{\tau}(u)$}
\EndFor
\EndFor
\EndFor
\end{algorithmic}
\end{algorithm}
The algorithm for updating the \TSDF given a depth scan used in
\cite{Curless1996} is outlined in
\algoref{alg:TSDF}. For each voxel in the scene, we store a signed distance
value, and a weight $W : \mathbf{R}^3 \to \mathbf{R^{+}}$ representing the confidence
in the distance measurement. Curless and Levoy show that by taking a weighted
running average of the distance measurements over time, the resulting
zero-isosurface of the \TSDF minimizes the sum-squared distances to all the ray
endpoints.
We initialize the \TSDF to an undefined value with a
weight of $0$, then for each depth scan, we update the weight and \TSDF value
for all points along each ray within the truncation distance $\tau$. The weight
is updated according to the scale-invariant weighting function $\alpha_{\tau}(u)
:[-\tau,\tau]\to \mathbf{R^{+}} $.
% for which the following holds:
%
% \begin{equation}
% \forall \tau, \int_{-\tau}^{\tau} \alpha_{\tau}(u) du = K
% \end{equation}
%
% \noindent and $K$ is a constant. This is to ensure that changing the truncation
% distance does not change the speed at which the TSDF converges.
It is possible \cite{Nguyen2012} to directly compute the weighting function
$\alpha_{\tau}$ from the hit probability \eref{eqn:hitprobability} of the
sensor; but in favor of better performance, linear, exponential, and constant
approximations of $\alpha_{\tau}$ can be used \cite{Curless1996, Newcombe,
Whelan2013, Bylow2013}. We use the constant approximation
$\alpha_{\tau}(u) = \frac{1}{2 \tau}$. This results in poorer surface
reconstruction quality in areas of high noise than methods which more closely
approximate the hit probability of the sensor.
Generally, either the weight $\alpha_{\tau}$ \cite{Newcombe, Whelan2013}, or
the distance update of the ray (\algoref{alg:TSDF} line
\ref{alg:line:tsdf_update}) \cite{Curless1996, Bylow2013} ought to be scaled by
the dot product of the local surface normal (computed by differentiating the
depth image) and the direction of the ray to improve reconstruction of surfaces
that are not perpendicular to the image plane. However, we found computing
surface normals to be computationally infeasible on the mobile device, and so
neglect this in our implementation.
\subsection{Colorization}
\label{section:color}
As in \cite{Bylow2013, Whelan2013}, we create colored surface reconstructions by
directly storing color as volumetric data. Color is updated in exactly the same
manner as the \TSDF. We assume that each depth ray also corresponds to a color
in the RGB space. We step along the ray from $u \in [-\tau, \tau]$, and update
the color of each voxel and its weight:
\begin{align}
C(\mathbf{v}) \gets \frac{W_c(\mathbf{v}) C(\mathbf{v}) +
\alpha_c(u) \mathbf{c}}{W_c(\mathbf{v})}
\\
%
W_c(\mathbf{v}) \gets W_c(\mathbf{v}) + \alpha_c(u)
\end{align}
\noindent where $\mathbf{v} = \mathbf{x} - u\mathbf{\hat{r}}$, $\mathbf{c}
\in \mathbf{R}^{3+}$ is the color of the ray in color space, and $\alpha_c$ is a
color weighting function. As in \cite{Bylow2013},
we have chosen RGB color space for the sake of simplicity, at the expense of
color consistency with changes in illumination.
% In both \cite{Bylow2013} and \cite{Whelan2013}, the
% color weighting function is proportional to the dot product of the ray's
% direction and an estimate of the surface normal. But since surface normal
% estimation is computationally expensive, we instead use the same weight for
% both distance and color.
% We must also deal with the fact that color images and depth data may be
% asynchronous. In our case, depth data is often delayed behind color data by as
% much as 30 milliseconds. So, for each depth scan, we project the endpoints of
% the rays onto the nearest color image in time to the depth image, disregarding
% occlusion, and use bilinear interpolation on the color image to determine the
% color of each ray.
\subsection{Dynamic Truncation Distance}
\label{section:dynamic_trunc}
As in\cite{Nguyen2012, NiessnerHashing}, we use a dynamic truncation distance based on the
noise model of the sensor rather than a fixed truncation distance to account for
noisy data far from the sensor. The truncation distance is given as a function
of depth:
\begin{equation} \mathcal{T} (z) = \beta\sigma_{z} \end{equation}
\noindent where $\sigma_{z}$ is the standard deviation of the noise for a depth
reading of $z$ \eref{eqn:hitprobability}, and $\beta$ is a scaling
parameter which represents the number of standard deviations of the noise we
are willing to consider. Algorithm \ref{alg:TSDF}, line
\ref{alg:line:dynamic_tsdf} shows how this is used.
%
% Using the dynamic truncation distance has the effect that further away from the
% sensor, where depth measurements are less certain and sparser, depth
% measurements are smoothed over a larger area of the distance field; while nearer
% to the sensor, where depth measurements are more certain and closer together,
% the truncation distance is smaller.
\subsection{Space Carving}
\label{section:carving}
When depth data is very noisy and sparse, the relative importance of negative
data (that is, information about what parts of the scene do not contain
surfaces) increases over positive data \cite{Elfes1989, Klingensmith2014}. Rays
can be viewed as \textit{constraints} on possible values of the distance field. Rays
passing through empty space constrain the distance field to positive values all
along the ray. The distance field is likely to be nonpositive only very near the
endpoints of rays.
We augment our \TSDF algorithm with a \textit{space carving}
\cite{Elfes1989} constraint. \figref{fig:integration_diagram} shows a hypothetical
plot of the hit probability of a ray $P\left(z - d = u \right) $, the pass probability
$P\left(z - d < u\right)$, vs. $u$, where $z$ is the measurement from the
sensor, and $d$ is the true distance from the sensor to the surface. When the
pass probability is much higher than the hit probability in a particular voxel,
it is very likely to be unoccupied. The hit integration region is inside
$[-\tau, \tau]$, whereas regions closer to the camera than $\tau + \epsilon$
are in the space carving region. Along each ray within the space carving region,
we \chisel away data that has a nonpositive stored SDF. Algorithm
\ref{alg:TSDF}, line \ref{alg:line:voxel_carve} shows how this is
accomplished.
Space carving gives us two advantages: first, it dramatically improves the
surface reconstruction in areas of very high noise (especially around the edges
of objects: see \figref{fig:dragon_closeup}), and second, it removes
some inconsistencies caused by moving objects and localization errors. For
instance, a person walking in front of the sensor will only briefly effect the
\TSDF before being removed by space carving.
\subsection{The Dynamic Spatially-Hashed \TSDF}
\label{section:spatialhash}
\begin{figure}[t!]
\begin{subfigure}[b]{0.45\linewidth} \centering
\includegraphics[width=1.0\textwidth]{img/chunks.pdf}
\caption{Spatial Hashing.}
\label{fig:chunks}
\end{subfigure}
\begin{subfigure}[b]{0.55\linewidth} \centering
\hspace{-5em}
\includegraphics[width=1.0\textwidth]{img/frustum_cull}
\caption{frustum Culling.}
\label{fig:frustum_cull}
\end{subfigure}
\caption{\figref{fig:chunks}: chunks are spatially-hashed \cite{SpatialHashing} into a
dynamic hash map.
\figref{fig:frustum_cull}: only voxels within the frustum (in green) are stored in memory.}
\end{figure}
Each voxel contains an estimate of the signed distance field and an
associated weight. In our implementation, these are packed into a single 32-bit
integer. The first 16 bits are a fixed-point signed distance value, and the
last 16 bits are an unsigned integer weight. Color is similarly
stored as a 32 bit integer, with 8 bits per color channel, and an 8 bit weight.
A similar method is used in \cite{Newcombe, Whelan2013, Bylow2013, NiessnerHashing} to store
the \TSDF. As a baseline, we could consider simply storing all the required
voxels in a monolithic block of memory. Unfortunately, the amount of memory storage required
for a fixed grid of this type grows as $\mathcal{O}(N^3)$, where $N$ is the
number of voxels per side of the 3D voxel array. Additionally, if the size of
the scene isn't known beforehand, the memory block must be resized.
% For example, at a resolution of $3\text{cm}$, a $30\text{m}$ TSDF cube
% with color would occupy 8 Gigabytes of memory. Worse, most of that memory would be
% uselessly storing unseen free space.
For a large-scale real-time surface reconstruction application, a less
memory-intensive and more dynamic approach is needed. Confronted with this
problem, some works have either used octrees \cite{Wurm2010, Zeng2012,
Chen2012}, or use a moving volume\cite{Whelan2013}. Neither of these approaches
is desirable for our application. Octrees, while maximally memory efficient,
have significant drawbacks when it comes to accessing and iterating over the
volumetric data \cite{NiessnerHashing}. Every time an octree is queried, a
logarithmic $\mathcal{O}(M)$ cost is incurred, where $M$ is the depth of the
Octree. In contrast, queries in a flat array are $\mathcal{O}(1)$. An octree
stores pointers to child octants in each parent octant. The octants themselves
may be dynamically allocated on the heap. Each traversal through the tree
requires $\mathcal{O}(M)$ heap pointer dereferences in the worst case.
Even worse, adjacent voxels may not be adjacent in memory, resulting in very
poor caching performance \cite{CacheStructures}. Like \cite{NiessnerHashing},
we found that using an octree to store the \TSDF data to reduce iteration
performance by an order of magnitude when compared to a fixed grid.
\begin{figure*}[t!]
\centering
\begin{subfigure}[b]{0.22\linewidth} \centering
\includegraphics[width=1.0\textwidth]{img/memoryusage.pdf}
\caption{Memory usage efficiency.}
\label{fig:memory_data}
\end{subfigure}
\begin{subfigure}[b]{0.22\linewidth} \centering
\includegraphics[width=1.0\textwidth]{img/memoryusage2.pdf}
\caption{Varying depth frustum.}
\label{fig:memory_data2}
\end{subfigure}
\begin{subfigure}[b]{0.27\linewidth} \centering
\includegraphics[width=1.0\textwidth]{img/freiburg_5m.png}
\caption{5m depth frustum.}
\label{fig:freiburg_5m}
\end{subfigure}
\begin{subfigure}[b]{0.27\linewidth} \centering
\includegraphics[width=1.0\textwidth]{img/freiburg_2m.png}
\caption{2m depth frustum.}
\label{fig:freiburg_2m}
\end{subfigure}
\caption{Spatial Hashing (SH) is more memory efficient than Fixed Grid (FG) and provides control over depth frustum.}
\end{figure*}
\begin{table}[t!]
\centering
\begin{tabular} {rrr}
\toprule
Voxel Class & Voxel Count & \% of Bounding Box \\
\midrule
Unknown Culled & 62,105,216 & 77.0 \\
Unknown & 12,538,559 & 15.5 \\
Outside & 3,303,795 & 4.1 \\
Inside & 2,708,430 & 3.4 \\
\bottomrule
\end{tabular}
\caption{Voxel statistics for the Freiburg 5m dataset.}
\label{table:volumecount}
\end{table}
Instead of using an Octree, moving volume, or a fixed grid, we use a hybrid data
structure introduced by Niessner \etal \cite{NiessnerHashing}. We divide
the world into a two-level tree. In the first level we have \emph{chunks} of
voxels. In \figref{fig:chunks}, voxel chunks (grey) are allocated near the
surface (blue) of the scene. Chunks are spatially-hashed \cite{SpatialHashing}
into a dynamic hash map. Each chunk consists of a fixed grid of ${N_v}^3$
voxels, which are stored in a monolithic memory block. Chunks are allocated
dynamically from a growing pool of heap memory as data is added, and are
indexed in a spatial 3D hash map \cite{SpatialHashing} by their spatial
coordinates. As in \cite{SpatialHashing, NiessnerHashing} we use the hash
function: $ \textbf{hash}(x, y, z) = p_1 x\oplus p_2 y \oplus p_3 z
~\text{mod}~n $, where $x, y, z$ are the 3D integer coordinates of the chunk,
$p_1, p_2, p_3$ are arbitrary large primes, $\oplus$ is the xor operator, and $n$ is
the maximum size of the hash map.
Since chunks are a fixed size, querying data from the chunked \TSDF involves
rounding (or bit-shifting, if $N_v$ is a power of two) a world coordinate to a
chunk and voxel coordinate, and then doing a hash-map lookup followed by an
array lookup. Hence, querying is $\mathcal{O}(1)$ \cite{NiessnerHashing}.
Further, since voxels within chunks are stored adjacent to one another in
memory, cache performance is improved while iterating through them. By carefully
selecting the size of chunks so that they corresponding to $\tau$, we only
allocate volumetric data near the zero isosurface, and do not waste as much
memory on empty or unknown voxels.
\subsection{Frustum Culling and Garbage Collection}
\label{section:frustum}
To determine which chunks should be updated by a depth scan and drawn, we use
frustum culling, a well known technique in computer graphics. We create a
camera frustum with a far plane at the maximum depth reading of the camera, and
the near plane at the minimum depth of the camera. We then take the
axis-aligned bounding box of the camera frustum, and check each chunk inside the
axis-aligned bounding box for intersection with the camera frustum. Only those
which intersect the frustum are updated.
Since the frustum is a conservative approximation of the space that could be
updated by a depth scan, some of the chunks that are visible to the frustum will
have no depth data associated with them. \figref{fig:frustum_cull} shows all the
\TSDF chunks which intersect the depth camera frustum and are within $\tau$ of
a hit as green cubes. Only these chunks are updated when new data is received.
Chunks which do not get updated during a depth scan are garbage collected
(deleted from the hash map). Since the size of the camera frustum is in a fixed
range, the garbage collection process has $\mathcal{O}(1)$ performance. Niessner
\etal \cite{NiessnerHashing} use a similar approach for deciding which voxel
chunks should be updated using raycasting rather than geometric frustum
culling. We found frustum culling faster than raycasting on the mobile CPU.
\subsection{Depth Scan Fusion}
\label{section:scan_integration}
To fuse a depth scan into the \TSDF (\algoref{alg:TSDF}), it is necessary
to consider each depth reading in each scan and upate each
voxel intersecting its depth cone according to its distance to an observed
surface. The most straightforward way of doing this is to simply step along
each cone, and then use a fast rasterization algorithm \cite{RayTracing} to determine which voxels
intersect with the cone. Most works approximate the depth cone with a ray
through the center of the cone \cite{Newcombe, NiessnerHashing}. We will call
this the Raycasting approach. However, has performance bound by
$\mathcal{O}(N_{\text{ray}} \times l_{\text{ray}})$, where $N_{\text{ray}}$ is
the number of rays in a scan, and $l_{\text{ray}}$ is proportional to the
length of a ray being integrated. If we use a fixed-size truncation distance
and do not perform space carving, $l_{\text{ray}} = \tau$.
However, with space-carving (\sref{section:carving}), the length of a ray
being integrated is potentially unbounded.
A useful approximation of raycasting is \textit{projection mapping}. Used in
\cite{Newcombe,Nguyen2012, Bylow2013, Klingensmith2014}, projection mapping
works by projecting the visual hull of each voxel onto the depth image and
comparing the depth value there with the geometric distance from the voxel to
the camera plane. \algoref{alg:projection_mapping} describes projection mapping.
Instead of iterating over each ray, projection mapping iterates over each voxel.
Like most other works, we approximate each voxel's visual hull by a single point
at the center of the voxel.
Projection mapping's performance is bounded by $\mathcal{O}(N_v)$, where $N_v$
is the number of voxels affected by a depth scan. This value is nearly constant
over time, and depending on the resolution of the \TSDF, may be significantly
less than the number of rays in a scan. However, projection mapping suffers from
resolution-dependent aliasing errors, because the distance to the center of a
voxel may differ from the true length of the ray passing through the voxel by up
to the length of the diagonal of the voxel. Further, by ignoring the visual
hull of the voxel and only projecting its center, we ignore the fact that
multiple depth cones may intersect a voxel during each scan.
\sref{section:scan_compare} compares the performance and quality of these methods.
\begin{algorithm}
\caption{Projection Mapping}
\label{alg:projection_mapping}
\begin{algorithmic}[1]
\For {$\mathbf{v}_c \in V$} \Comment{For each voxel}
\State {$z \gets \|\mathbf{o}_t - \mathbf{v}_c\|$}
\State{$z_p \gets \mathbf{InterpolateDepth}(\mathbf{Project}(\mathbf{v}_c))$}
\State{$u \gets z_p - z$}
\State{$\tau \gets \mathcal{T} (z_p)$} \Comment{Dynamic truncation}
\If{$u \in [\tau + \epsilon, z_p]$} \Comment{Space carving region}
\If {$\Phi_{\tau}(\mathbf{v}_c) \leq 0$ }
\State {$\Phi_{\tau}(\mathbf{v}_c) \gets \text{undefined}$}
\label{alg:line:voxel_carve}
\State {$W(\mathbf{v}_c) \gets 0$}
\EndIf
\EndIf
\If{$u \in [-\tau, \tau]$} \Comment{Hit region}
\State {$\Phi_{\tau}(\mathbf{v}_c) \gets \frac{ W(\mathbf{v}_c)
\Phi_{\tau}(\mathbf{v}_c)+ \alpha(u) u}{W(\mathbf{v}_c) +\alpha(u)}$}
\label{alg:line:tsdf_update}
\State {$W(\mathbf{v}_c) \gets W(\mathbf{v}_c)+ \alpha(u)$}
\EndIf
\EndFor
\end{algorithmic}
\end{algorithm}
\subsection{Rendering}
\label{section:render}
Most other \TSDF fusion methods \cite{Newcombe, NiessnerHashing} render the
scene by directly raytracing it on the GPU. Hardware limitations prevent us from doing
this. Instead, we use an approach from computer graphics for rendering large
terrains \cite{GPUGEMS3} using incremental Marching Cubes \cite{Lorensen1987}.
For each chunk, we store a triangle mesh. Meshing is done asynchronously with
the depth scan fusion, and is performed lazily (\ie, only when chunks need to
be rendered).
Triangle meshes are generated at the zero isosurface whenever the chunk has been
updated by a depth scan. As in \cite{Bylow2013, Whelan2013}, colors are
computed for each vertex by trilinear interpolation. At the borders of chunks,
vertices are duplicated. Only those meshes which are visible to the virtual camera
frustum are rendered. Meshes that are very far away from the camera are
rendered as a colored bounding box. Meshes far away from the virtual camera are
destroyed once the number of allocated meshes exceeds some threshold.
\subsection{Pose Estimation}
\label{section:pose}
We assume we have \textit{a priori} black-box pose estimation of the
sensor from a seperate pose tracking system. Specifically, in our experiments,
pose data is estimated from an onboard visual-inertial odometry system that fuses
data from a wide-angle camera, an inertial measurement unit, and a gyroscope at
60 Hz using 2D feature tracking and an Extended Kalman Filter. A more detailed
discussion of this system can be found in \cite{VINS, VINS2}. Unlike
most other \TSDF fusion methods \cite{Newcombe, Whelan2013,
Bylow2013, NiessnerHashing} , our implementation makes no attempt to estimate
the pose of the sensor directly from depth data. Instead, we use depth data
only for surface reconstruction. Unfortunately, decoupling the pose estimation
and surface reconstruction results in failure to correct for drift in the pose
estimation using depth data, and can result in mis-alignment of depth scans
\figref{fig:drift} shows a top down view of a $\sim
175$ meter long corridor reconstructed by a user holding the device. The
left figure shows the reconstruction achieved by visual odometry alone,
with $\sim 5$ meter drift present at the end of the corridor. The right
figure shows the reconstruction after the trajectory has been globally
optimized offline using bundle adjustment, overlaid on the floorplan of
the building. We found that using direct scan-to-model ICP (as in
\cite{Newcombe}) to augment visual odometry and sparse keypoint mapping to be
too slow to run in real-time on the device along with \TSDF fusion. This is not
to say that another approach combining sparse keypoints with occlusion data
from the \TSDF (as in \cite{StuecklerSparseDense}), would not work.
% Section removed to save space
% \subsubsection{Pose Interpolation}
% We will assume that pose estimation, depth sensing, and color sensing are
% asynchronously executed, so we may not have all three measurements at a given
% time. To deal with asynchronous sensing and pose estimation, we linearly
% interpolate poses over time to determine the pose of a given sensor at any given time.
%
% Call the global pose estimate of the sensor at time $t$, ${H_{s}}^t$.
% Additionally, assume we have apriori extrinsic pose estimates for the color
% sensor ${E_{c}}$ and the depth sensor ${E_{d}}$. We are given a discrete
% trajectory of pose estimates from the visual-inertial odometry system:
%
% $$ {H_{s}}^{t_1}, {H_{s}}^{t_2}, \ldots, {H_{s}}^{t_T} $$
%
% Suppose at time $t'$, we get a color image. Assume that $t'$ is not represented
% in the list of sensor body poses. Then, either $t' < t_1$, $t' > t_T$, or
% $t_a < t < t_b'$ for some $a, b \in [1, T]$. In the first case ($t' < t_1$), we
% disregard the color image. In the second case ($t' > t_T$), we wait until more data is
% acquired. In the third case, we would interpolate the pose of the color camera
% as:
%
% $$ {H_{c}}^{t'} \approx E_c \cdot \textbf{Interp}\left(H_{s}^{t_a}, H_{s}^{t_b},
% \frac{t' - t_a}{t_b - t_a}\right) $$
%
% Where $\textbf{Interp}$ is a function that interpolates between two
% poses by linearly interpolating translation while spherically interpolating
% rotation, using $\frac{t' - t_a}{t_b - t_a} \in [0, 1]$ as an interpolation
% parameter.
% Section removed to save space
% \subsection{User Interface}
% We developed a simple and intuitive prototype interface to test our approach on
% the device. Users were presented with options for controlling the resolution of
% the reconstruction, and were able to control when the system was active and
% inactive, allowing them to pause when moving objects obscured objects they were
% trying to model.
%
% We present the user with camera controls as well. A virtual first person