-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Richard Zhao
committed
May 10, 2017
1 parent
0ab0250
commit 72ec939
Showing
1 changed file
with
25 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,9 @@ and [Richard Zhao](mailto:[email protected]) (richardz) | |
![frames]({{ site.baseurl }}/public/img/temple_3.gif) | ||
![flow]({{ site.baseurl }}/public/img/flow.gif) | ||
|
||
The top gif is an input frame sequence, from which we calculate an optical flow (bottom) which | ||
represents movement of subjects in the input. | ||
|
||
## Summary | ||
|
||
We implement super-realtime (>30fps), high resolution optical flows on a mobile GPU platform. Fast | ||
|
@@ -23,15 +26,22 @@ as object detection or image stabilization. | |
|
||
## Challenges | ||
|
||
Image pyramids | ||
The main technical challenges associated with this project involve optimizing the algorithm to run | ||
on the NVIDIA Jetson, which has a less powerful CPU and GPU than traditional desktop machines. | ||
|
||
Number of patches increases | ||
Since copying memory between the device and host is the main performance bottleneck, we designed | ||
the architecture as a pipeline which essentially performs copies at just the beginning and end of | ||
the pipeline. | ||
|
||
Maintaining accuracy of the flow | ||
The most significant computational bottleneck in the original implementation was the construction | ||
of image pyramids (a series of downsampled images, and their gradients). We used CUDA kernels to | ||
significantly improve the performance of this step. | ||
|
||
Memory management | ||
Additionally, during the gradient descent phase of the algorithm, which acts on local patches of | ||
the image, careful management of thread blocks is required to hide the system's memory latency. | ||
|
||
Optimizing to Jetson (which has a lackluster CPU) | ||
Finally, all of our optimizations are done while preserving the accuracy of the computed flow. This | ||
makes our approach both fast and accurate enough for realtime use. | ||
|
||
## Preliminary Results | ||
|
||
|
@@ -40,8 +50,16 @@ All results are from our code running on an NVIDIA Jetson TX2. | |
### Optical Flow (total) | ||
|
||
Using a hybrid GPU-CPU implementation, we achieve an end-to-end latency of roughly 10ms. This | ||
is a speedup of roughly 10x. | ||
is a speedup of around **10x**. | ||
|
||
### Image pyramid construction | ||
|
||
90 ms => 3 ms (30x speedup) | ||
The image pyramid construction step was optimized to run in just 3 ms, which is a speedup of | ||
**30x** over our optimized CPU version, which takes 90 ms. | ||
|
||
## Remaining Work | ||
|
||
Before the deadline, we stil have some final tuning to do on the gradient descent algorithm, | ||
and finalizing the video processing pipeline (currently, our pipeline operates on two images at a | ||
time). The performance figures should remain, however, roughly the same. | ||
|