diff --git a/README.md b/README.md
index 5dd1785..e4ab739 100644
--- a/README.md
+++ b/README.md
@@ -54,18 +54,18 @@ pip install -e .
 
 **Pretrained models:**
 
-The models will be downloaded automatically when you run the demo script.
-
-| Model    | Download link | File size | MD5 checksum |
-| -------- | ------- | ------- | ------- |
-| Flow prediction network, small 16kHz | <a href="https://databank.illinois.edu/datafiles/k6jve/download" download="mmaudio_small_16k.pth">mmaudio_small_16k.pth</a> | 601M | af93cde404179f58e3919ac085b8033b |
-| Flow prediction network, small 44.1kHz | <a href="https://databank.illinois.edu/datafiles/864ya/download" download="mmaudio_small_16k.pth">mmaudio_small_44k.pth</a> | 601M | babd74c884783d13701ea2820a5f5b6d |
-| Flow prediction network, medium 44.1kHz | <a href="https://databank.illinois.edu/datafiles/pa94t/download" download="mmaudio_small_16k.pth">mmaudio_medium_44k.pth</a> | 2.4G | 5a56b6665e45a1e65ada534defa903d0 |
-| Flow prediction network, large 44.1kHz (recommended) | <a href="https://databank.illinois.edu/datafiles/4jx76/download" download="mmaudio_small_16k.pth">mmaudio_large_44k.pth</a> | 3.9G | fed96c325a6785b85ce75ae1aafd2673 |
-| 16kHz VAE | <a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-16.pth">v1-16.pth</a> | 655M | 69f56803f59a549a1a507c93859fd4d7 |
-| 16kHz BigVGAN vocoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/best_netG.pt">best_netG.pt</a> | 429M | eeaf372a38a9c31c362120aba2dde292 |
-| 44.1kHz VAE |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-44.pth">v1-44.pth</a> | 1.2G | fab020275fa44c6589820ce025191600 |
-| Synchformer visual encoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/synchformer_state_dict.pth">synchformer_state_dict.pth</a> | 907M | 5b2f5594b0730f70e41e549b7c94390c |
+The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`
+
+| Model    | Download link | File size |
+| -------- | ------- | ------- |
+| Flow prediction network, small 16kHz | <a href="https://databank.illinois.edu/datafiles/k6jve/download" download="mmaudio_small_16k.pth">mmaudio_small_16k.pth</a> | 601M |
+| Flow prediction network, small 44.1kHz | <a href="https://databank.illinois.edu/datafiles/864ya/download" download="mmaudio_small_16k.pth">mmaudio_small_44k.pth</a> | 601M |
+| Flow prediction network, medium 44.1kHz | <a href="https://databank.illinois.edu/datafiles/pa94t/download" download="mmaudio_small_16k.pth">mmaudio_medium_44k.pth</a> | 2.4G |
+| Flow prediction network, large 44.1kHz **(recommended)** | <a href="https://databank.illinois.edu/datafiles/4jx76/download" download="mmaudio_small_16k.pth">mmaudio_large_44k.pth</a> | 3.9G |
+| 16kHz VAE | <a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-16.pth">v1-16.pth</a> | 655M |
+| 16kHz BigVGAN vocoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/best_netG.pt">best_netG.pt</a> | 429M |
+| 44.1kHz VAE |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-44.pth">v1-44.pth</a> | 1.2G | 
+| Synchformer visual encoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/synchformer_state_dict.pth">synchformer_state_dict.pth</a> | 907M |
 
 The 44.1kHz vocoder will be downloaded automatically.
 
@@ -100,8 +100,8 @@ MMAudio
 
 ## Demo
 
-By default, these scripts uses the `large_44k` model. 
-In our experiments, it only uses around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.
+By default, these scripts use the `large_44k` model. 
+In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.
 
 ### Command-line interface
 
@@ -111,7 +111,7 @@ python demo.py --duration=8 --video=<path to video> --prompt "your prompt"
 ```
 See the file for more options.
 Simply omit the `--video` option for text-to-audio synthesis.
-The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in lower quality.
+The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
 
 
 ### Gradio interface
@@ -132,3 +132,6 @@ We believe all of these three limitations can be addressed with more high-qualit
 
 ## Training
 Work in progress.
+
+## Evaluation
+Work in progress.