From 72ec93980c4de66f9909d773ff08e79c44bacf62 Mon Sep 17 00:00:00 2001 From: Richard Zhao Date: Wed, 10 May 2017 15:03:09 -0400 Subject: [PATCH] Update prelim results --- docs/index.md | 32 +++++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/docs/index.md b/docs/index.md index 5233f6b..b217ffd 100644 --- a/docs/index.md +++ b/docs/index.md @@ -15,6 +15,9 @@ and [Richard Zhao](mailto:richardz@andrew.cmu.edu) (richardz) ![frames]({{ site.baseurl }}/public/img/temple_3.gif) ![flow]({{ site.baseurl }}/public/img/flow.gif) +The top gif is an input frame sequence, from which we calculate an optical flow (bottom) which +represents movement of subjects in the input. + ## Summary We implement super-realtime (>30fps), high resolution optical flows on a mobile GPU platform. Fast @@ -23,15 +26,22 @@ as object detection or image stabilization. ## Challenges -Image pyramids +The main technical challenges associated with this project involve optimizing the algorithm to run +on the NVIDIA Jetson, which has a less powerful CPU and GPU than traditional desktop machines. -Number of patches increases +Since copying memory between the device and host is the main performance bottleneck, we designed +the architecture as a pipeline which essentially performs copies at just the beginning and end of +the pipeline. -Maintaining accuracy of the flow +The most significant computational bottleneck in the original implementation was the construction +of image pyramids (a series of downsampled images, and their gradients). We used CUDA kernels to +significantly improve the performance of this step. -Memory management +Additionally, during the gradient descent phase of the algorithm, which acts on local patches of +the image, careful management of thread blocks is required to hide the system's memory latency. -Optimizing to Jetson (which has a lackluster CPU) +Finally, all of our optimizations are done while preserving the accuracy of the computed flow. This +makes our approach both fast and accurate enough for realtime use. ## Preliminary Results @@ -40,8 +50,16 @@ All results are from our code running on an NVIDIA Jetson TX2. ### Optical Flow (total) Using a hybrid GPU-CPU implementation, we achieve an end-to-end latency of roughly 10ms. This -is a speedup of roughly 10x. +is a speedup of around **10x**. ### Image pyramid construction -90 ms => 3 ms (30x speedup) +The image pyramid construction step was optimized to run in just 3 ms, which is a speedup of +**30x** over our optimized CPU version, which takes 90 ms. + +## Remaining Work + +Before the deadline, we stil have some final tuning to do on the gradient descent algorithm, +and finalizing the video processing pipeline (currently, our pipeline operates on two images at a +time). The performance figures should remain, however, roughly the same. +