From 72ec93980c4de66f9909d773ff08e79c44bacf62 Mon Sep 17 00:00:00 2001
From: Richard Zhao <richardz@andrew.cmu.edu>
Date: Wed, 10 May 2017 15:03:09 -0400
Subject: [PATCH] Update prelim results

---
 docs/index.md | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/docs/index.md b/docs/index.md
index 5233f6b..b217ffd 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -15,6 +15,9 @@ and [Richard Zhao](mailto:richardz@andrew.cmu.edu) (richardz)
 ![frames]({{ site.baseurl }}/public/img/temple_3.gif)
 ![flow]({{ site.baseurl }}/public/img/flow.gif)
 
+The top gif is an input frame sequence, from which we calculate an optical flow (bottom) which
+represents movement of subjects in the input.
+
 ## Summary
 
 We implement super-realtime (>30fps), high resolution optical flows on a mobile GPU platform. Fast
@@ -23,15 +26,22 @@ as object detection or image stabilization.
 
 ## Challenges
 
-Image pyramids
+The main technical challenges associated with this project involve optimizing the algorithm to run
+on the NVIDIA Jetson, which has a less powerful CPU and GPU than traditional desktop machines.
 
-Number of patches increases
+Since copying memory between the device and host is the main performance bottleneck, we designed
+the architecture as a pipeline which essentially performs copies at just the beginning and end of
+the pipeline.
 
-Maintaining accuracy of the flow
+The most significant computational bottleneck in the original implementation was the construction
+of image pyramids (a series of downsampled images, and their gradients). We used CUDA kernels to
+significantly improve the performance of this step.
 
-Memory management
+Additionally, during the gradient descent phase of the algorithm, which acts on local patches of
+the image, careful management of thread blocks is required to hide the system's memory latency.
 
-Optimizing to Jetson (which has a lackluster CPU)
+Finally, all of our optimizations are done while preserving the accuracy of the computed flow. This
+makes our approach both fast and accurate enough for realtime use.
 
 ## Preliminary Results
 
@@ -40,8 +50,16 @@ All results are from our code running on an NVIDIA Jetson TX2.
 ### Optical Flow (total)
 
 Using a hybrid GPU-CPU implementation, we achieve an end-to-end latency of roughly 10ms. This
-is a speedup of roughly 10x.
+is a speedup of around **10x**.
 
 ### Image pyramid construction
 
-90 ms => 3 ms (30x speedup)
+The image pyramid construction step was optimized to run in just 3 ms, which is a speedup of
+**30x** over our optimized CPU version, which takes 90 ms.
+
+## Remaining Work
+
+Before the deadline, we stil have some final tuning to do on the gradient descent algorithm,
+and finalizing the video processing pipeline (currently, our pipeline operates on two images at a
+time). The performance figures should remain, however, roughly the same.
+