My personal insights using Rave (training compendium) #300
Replies: 2 comments 3 replies
-
Hi, many many thanks for your kind and very detailed review of your personal experience. It is a great pity that Ircam until now has never presented a decent and well organized documentation on this project. I asked developers many times clues, posting here, contacting the via mail for additional insights about configuration strategies etc.. but they are almost not supportive and as we can see they do not even answer the most basic questions over here. It is sad but it is the plain reality. So your input is highly apreciated, many many thanks, and let’s hope that in a near future things will change a bit from Ircam side if they intend to support this project for real, providing more resources and content to Rave and its team. |
Beta Was this translation helpful? Give feedback.
-
Finally, good news.
Many many thanks for the update.
all the best
*Prof. Federico Placidi*
CAC and Live electronics
www.slmc.it
+39 06 48 700 17
whatsapp +39 3661019692
…On Sun, Mar 24, 2024 at 9:27 PM jchai.me ***@***.***> wrote:
Maybe worth mentioning is that Axel Chemla Romeu Santos gave a
presentation at the IRCAM forum this past week giving an update on RAVE.
Two of the main things he reported as coming soon were an updated RAVE VST
and detailed documentation videos.
—
Reply to this email directly, view it on GitHub
<#300 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AXYTIRYTBNQ6XZ72LD5FTZLYZ4ZJTAVCNFSM6AAAAABEPJVIOGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DQOJUHE3TQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
NEW: Clore.AI tutorial - rent 4090 server and train A++ RAVE model in just 3-5 days for $12 - $20
NEW: There are 3 official tutorials now that explains quite a few bits and pieces that were previously unknown (one about training). Unfortunately still a lot of questions that this guide wrestles with remain unanswered. I have since then rewritten my guide to reflect all of the recent insigths from this and my own.
Watch the new tutorials:
I have raised a Github issue to improve documentation, and one of the suggestions was that community members could do this. This is why I want to post my personal experiences as a first step here.
I want to stress that I don't really have deeper insights into how Rave is programmed, or the more academic side of machine learning, and also not music generation in general. I am just a former sysadmin doing ML stuff as a hobby, basically to test my 3090. Therefore my understanding might be limited. And what I have experienced to be true doesn't necessarily reflect on what is truly the case, although I try my best to check if those things are universally so.
My experiences are more or less limited to the v2 model, 1/2 channels, wasserstein and an accidental messed up "discrete" model. I am not considering special application such as low latency or low compute power. Also I will only talk about Pure Data to run nn~ and not the commercial alternatives.
Also as time passes, things might change, so what's now broken might be not some months later. I'm just trying to document what you can expect and where you have to be careful, and can run into problems. And maybe no amount of fiddling can fix it for you.
First, read the README.md of course, but also watch all the Youtube videos (second one is not in README.md!):
Then follow this guide (at minimum read "my command chain from start to finish"), which contains critical information not mentioned elsewhere.
Update: Also read those two. They are pretty short, and I overlooked them for years!
what to expect, quality of output
I would say the quality with v2 is really good, but you can still notice it isn't perfect. From the impressiveness of it, I would say it is somewhat similar as if going from 128kbps to 96kbps, or 256kbps to 128kbps if you will. But this doesn't pertain only or actually to pure audio data, but generally how the model learns the entire sound. It sounds quite good, but just by a notch, you can still notice it is slightly inferior to the original. With a lot of sounds, like drums and a lot of instruments, this loss of quality doesn't matter at all and might not even be audible. Conversely, people have suggested to just train the model longer in both phases to yield quasi-indistinguishable quality (which I have not tested to the extreme). It probably still depends on the sound somewhat how much this is true.
I think RAVE is great, extremely usable, and just the state of the art. Probably Rave is the best thing out there so far.
I was asked why the model doesn't properly transform a humming sound (voice) into a violin sound. This is because the human voice sounds so radically different to a violin. The model only really understands it's training data, and there is no special logic in Rave that makes the model detect melodies or musical patterns, other than whatever the model believes to make sense on it's own, by pure magic of machine learning (which is often not really logical or intuitive in human terms). You can use nn~ and Pure Data and various filters on your voice input, so it essentially sounds more like a violin in a very crude kind of manner. And then the model will be more inclined to actually play the melody you hum, and it will of course then sound very much like a violin. But it will still transfer features from the input (which is what makes Rave great), so it doesn't sound identical to a real violin. If you know better methods, let me know. There are other projects like MusicGen that take different approaches, and those are designed around feeding it note sheets or text data and such. This is not something that Rave can do (but could potentially do in the future, if someone programmed a prior model for this). It is pretty much a very candid low-level kind of transformation, but one that can work and yields much better results than other approaches. And if you want something higher level, you have to build something on top of it (like simply input wav filtering, or more advanced new prior designs) or choose a different project.
From what I understand, to make a Rave model output more coherent patterns, you can train a prior, which operates on the latent space only. The latent space is essentially a choke point designed into the model (16 channels at 20Hz), where you have the opportunity to manipulate between encoder and decoder, whatever the model distilled to be really meaningful inside it's own mystery black box understanding about the input audio. The 16 different channels will randomly resemble some kind of (possibly ineffable) feature quality, like "loudness" or "pitchy" or "vibrato-ish". So the prior is an extension that works on top of your model in this dimensional space. From how I understand it, the prior can for example make your model speak rather (unintelligible) words and sentences, than just single syllables. I don't have experience with that much yet, unfortunately. But I believe as it is now to use it out of the box, this is mainly useful to make your model output more coherent sounds, when not providing any, or not very definitive input sounds. The general experience seems that the output of the priors currently leaves much to be desired. But ostensibly, potentially somehow you can do all sorts of things with a prior. Maybe some day even feed it text commands and note sheets like in MusicGen, if someone does the programming for this.
With msprior it should be true that currently only the prior configs "modulated_alibi" and "rwkv_semantic" give you a "semantic control" interface, as shown on the screenshot on the msprior Github page. You can then use joysticks or motion sensors to perform semantic manipulations (e.g. how entire sentences are phrased and shaped), rather than just skew the latent dimensions of the model directly (e.g. how the pitch of the voice is or how soft/harsh it sounds). I am not sure though if you need to pick "discrete" config in rave as architecture for this to work properly. Please read "the actual training" on how you cannot combine "discrete" with other configs.
Check out RAVE-Latent-Diffusion. It generates audio unconditionally on the latent space of Rave, and in their demo makes it output very well sounding and very coherent techno music. You can also easily combine this with latent manipulations as suggested in "python script" here. At the end of this guide there are audio examples I generated from both things.
hardware and time requirements, cost and recommendations
Any GPU less beefy than a 3090 is potentially problematic and too slow.
The initial v1 config has been designed for a machine with 16GB VRAM, but with later configs this got bigger and bigger. The default batch size is 8, which directly correlates with VRAM used, and "could" be lowered if your GPU has too little VRAM. So batch size of 1 would be 2GB instead of 16GB ... however no one I spoke to knows if changing batch size or changing it by a lot affects (and potentially ruins) training results significantly (I sometimes switched to 4 sometimes in phase 2 because this phase uses more VRAM, with no apparent downside, but when I used 6 with v3 it actually destroyed results right away). Rave is both compute-heavy and VRAM heavy. Changing batch size from 8 to 4 is about 5% slower.
Do not use Google Colab. They changed their plan some year back, and now you get like 10-20x less compute power for the same price. Before you could do 3/4 of a run or 1/2 of a run in 1 month on $10 plan (V100, almost half as fast as 3090). Depending on your settings, a 6M run takes somewhere in the range of 150-250 hours on V100, which currently costs about $400-$600 on Colab.
A used 3090 now costs $600 on Ebay and renting it is like $1-$3 per day on clore.ai (it varies, sometimes as little as $1 and sometimes more). For my country power is €0.33/kWh, so I pay like €20-€40 (€28-€55 with wasserstein and increased num_signal) for a sort of "production-level" model (2 ch takes twice as long) and it takes like 6/12 days (1ch/2ch), but at minimum 3/6 days to make it sound somewhat usable/workable (which again would be more like 8/16 days and 4/8 days with wasserstein +num_signal). To test the tools I used mono in my first run. Then there was an issue due to lack of / ambiguous documentation and I had to discard second run. So now with my third run, I am close to €100 in power cost alone.
I hope this guide helps you to avoid this extra cost, and why I try to be as comprehensive and verbose as possible.
Since the 4090 has more than double the power of a 3090 and is easy to get at $4/day at the open Marketplace at clore.ai, I pretty much think nothing else makes sense anymore and it is reasonably fast as well. If you don't fool around, just push the prepared dataset-DB server-2-server, and then run rave training instantly, you could easily get it done close to perfection at the cost of maybe just $16 or so. Please read the section about clore.ai at the end.
On 4090 with heavily customized wasserstein.gin, I somehow get an astonishing 2 million steps per day in phase 1 (v2 1ch), but only 0.7M steps per day in phase 2. So to train to 3M total (which yielded good results for me in the past with v2 vanilla) I would need all in all with file upload a little less than 3 days ($12). And to get to 6M (= totally impossible to get better quality by running even longer) it would be 7 days rental time ($28).
On v2 2-channels wasserstein causal with 4x the sample window on 3090, I got only 250k steps per day in phase 1 and somewhat less than half that in phase 2 (without wasserstein training is about 40% faster overall, given that phase 1 is equal in length, which it is not by default). In phase 2 it uses 22GB VRAM with batch size 4. V2 without wasserstein, 1 channel and 2x sample window only took 20GB with batch size 8 (I think). Please note that I did increase num_signals (sample window) as mentioned, which inflates thoses numbers more or less by how much bigger the sample window was.*
As I revised this guide, I am now inclined to recommend against using num_signal at all, because of such and maybe other unintended consequences.
If talking speed always use steps per day/hour as seen in Tensorboard, since the other measurements differ with different parameters.
running on Windows
If you use WSL2 on Windows, you basically get a Ubuntu Linux VM with CUDA Integration, so it shouldn't be an issue to do all this under Windows as well. If there are any issues, let me know so I can update this guide. The only annoying thing I remember about WSL2 is that you have to download it from the Microsoft store, and that the store needs you to register an account and malfunctions upon first use (packages claimed to be unavailable).
general procedure
Follow the README.md or this guide to train and export your model.
During training your model will generate checkpoints, which can be used to either export a model or resume training. The best.ckpt is a safe and much older version. Conversely with the epoch-epoch=XXXX.ckpt there are no 100% guarantees that it doesn't somewhat degrade your model somehow (a little bit only, I suppose). I don't know how important this really is, but I think it is not such a huge deal to use epoch-.. checkpoints a couple of times. Just, if you can avoid it, don't abort the training or wait for best.ckpt.
I highly recommend that you immediately export your very first checkpoint (& cancel training if necessary), then test it in nn~, generate your prior if desired (again only use first checkpoint to export it), and just test out if whatever tools you use actually work. Because a lot of stuff doesn't actually work and might never work. You don't want to train on "discrete" for 6 days and then notice it only produces silent output in nn~. Or that v3 config doesn't actually export. It is really important to test out the entire command chain before trying to produce anything of use. As a rule of thumb, mono is very well supported, but stereo is not really and sometimes creates issues, or doesn't allow you to progress. You certainly can create a normal 2 channel model though and prior (unknown if prior output is actually intelligible and it trains only on mono-audio, so not true stereo) that works in nn~. You can also use RAVE-Latent-Diffusion with 2 channel model via hack/fix. But as of right now, it will only work with mono wav files and produces mono output.
During training Tensorboard is very useful:
tensorboard --logdir=~/output/ --bind_all
In Tensorboard, you have three things are really important:
Rave always trains your model in two distinct phases:
Phase 1 is about teaching your model what the sounds are all about. In this phase, audio samples will always sound very very distorted and bad, stingy and offensive kind of bad. But you should very slowly notice very small improvements in how well the sound is reflected. With v2 the default length is 1M steps, with other configs like wasserstein the default is 200k steps. In this phase you should be able after a while to somewhat make out the sounds from the training data, somewhat as if listening to a very distorted analogue radio call. If you only hear totally pure noise or muted audio, then there is something wrong.
Phase 2, the adversarial phase, is when you will actually be able to notice much bigger improvements in audio quality. Although it will sound very noisy and bad for quite a while, if you reach some 500k steps of this phase it will sound better and better. After 1M steps or so, it might actually sound workable and good. People have recommended 3M steps total for good results, so 2M steps in this phase. If you run it longer, you have to watch out for bad effects such as overfitting. Make sure to compare your final model's output in Tensorboard as well as nn~ with various input wavs against a previous version that sounded almost as good. Also watch out for artifacts, sometimes those will be present from the beginning and never clear, so that's a bummer. I have often seen people ujse 6M total for best results.
As explained in the tutorial fidelity_95, states that the model believes to be able to explain 95% of the data with so many dimensions. It can fluctuate up and down in the beginning, but if it plummets below 3 and stays there for like 100k steps or more and until the end of phase 1, then it indicates that your model has degraded (unless your audio data is super simple). I have had that happen for example from 100k to 200k steps with extremely noisy mashup breakcore music and --config noise.
Otherwise the curves usually just go up and down with some smoothing and make less and less progress. Unless the phases switches (which is at 1M with default + v2), then the curve radically flips.
Generally speaking, the training works very reliable given that you had good input data. It is not as if you have to be afraid all the time that something is going wrong by chance. If it isn't broken right away (either mute or audio in Tensorboard or audio that is 100% static noise), then probably whatever else it is will just iron out over time.
my command chain from start to finish
Before installing rave with pip and using rave, make sure to use python-3.11 and pytorch-cuda, if your distribution offers multiple versions pytorch. There might also be multiple -cuda and -rocm versions for other APIs, but I think you only need pytorch for Rave. You can also just use conda on Linux, which installs custom python version and all the packages into a seperate environment. I am using 3.10.9 in conda currently because Archlinux uses 3.12 and it doesn't work.
First, gather your input data. Ideally you want very clean studio-level recordings of your sounds (no noise, echos, etc.) and you want lots and lots of data. A minimum of 2-3 hours has been recommended, but the more the better. In order to not run into IO slowdowns, use either a few GB less than RAM size or read from NVMe/SSD. If you train on just a few minutes of data, results might be way too poor and also your training might fail. Rave will learn any and all sounds from the data, including noise.
Do not use --lazy, simply convert your sound data to pcm_s16le, with either two channels or one channel as desired for the model.
2 channels ffmpeg prepping
IFS=$'\n'; for i in `find ./ -name '*.mp3' -maxdepth 1`; do ffmpeg -i "$i" -c:a pcm_s16le -ar 44100 -ac 2 -y "${i##*/}".wav && touch -r "$i" "${i##*/}".wav; done
For 1 channel downmix, change -ac 2 to -ac 1.
1 channel, left only:
IFS=$'\n'; for i in `find ./ -name '*.mp3' -maxdepth 1`; do ffmpeg -i "$i" -c:a pcm_s16le -filter_complex '[0:a]channelsplit=channel_layout=stereo:channels=FL[left]' -map '[left]' -ar 44100 -y "${i##*/}".wav && touch -r "$i" "${i##*/}".wav; done
Now put all the wav files into a new folder "./raw_wav_files/".
run rave preprocess
rave preprocess --input_path ./raw_wav_files/ --output_path ./output_pp/ --channels 1 --sampling_rate 44100
It is only possible to alter the sampling rate to 22050, whereas 44100 is the default. Change channels as desired. I now recommend against using num_signal. From what I understand now, num_signal is the raw sample length in which your data is chopped up. It is unknown how that alters the model's behavior. If you want to, it is mandatory to use (some sort of) power of two for this. The default value is 131072 which is 3 seconds. When you double this number, it halves the number of steps and hence doubles the length of your training. It also doubles the amount of VRAM required. This perhaps increases output quality and addresses issues such as longer-than-num_signal audio patterns being chopped up too much to be learned properly, or too much chopping resulting in a bad understanding. But in the end no one knows at this point. It could also mess things up like possibly altering batch size could. I have trained multiple times with increased num_signal (since I had long coherent sound patterns that last 10-20 seconds) with good results though. Maybe someone with a better understanding can clarify this more.
the actual training
rave train --config v2 --config wasserstein --override PHASE_1_DURATION=1000000 --db_path ./output_pp/ --out_path ./output/ --name Mymodel --gpu 0 --val_every 5000 --channels 1 --batch 8
Now it is paramount to understand, that you cannot wildly combine config parameters with each other, even if you have seen this done elsewhere, and it doesn't result in error messages. To be really sure, you have to actually open the config files in a text editor, and check if they contain conflicting information, or otherwise seem to make sense with each other. For example --config discrete overrides the encoder specified by --config wasserstein, so this will give you botched results, without being immediately obvious.
It is unfortunately not exactly documented on what configs work with each other and even what all the different configs do. There is for example a "discrete_v3". So this seems to suggest to me, that using just "discrete" with v3 can yield worse results, or is otherwise not functioning well, but with v2 there is no such issue. But it could also mean, that this is just the third version of the discrete config, and the second version of it was somehow trash and discarded. Someone in Discord, if I understood them correctly, also said that you can't combine discrete with v1/v2/v3 in general (even though people have done this in Discord and there are Colab notebooks which suggest it, even with wasserstein, which is quite certainly wrong). I think it is certainly so, that you can combine wasserstein with v2 and maybe v3, but then you have to be really really careful when adding other stuff on top of it. Check out the test_configs.py file to see what configs are certain to be safe to combine. This is not to say v3 wouldn't work with wasserstein because it isn't listed (?), but just that apparently this automated test doesn't account for it.
We have tried "discrete" and it always produces silent output (in Python script and nn~), no matter what you do and if you use it with msprior. Is this a bug, how does it work? The "discrete" encoder, it remains a mystery. In the tutorial it was confirmed that discrete is basically only good for msprior. But like I said, it was broken with msprior for us also. So I would say just don't use discrete.
DO NOT MESS WITH ANYTHING INSIDE .GIN FILES! I have done this several times under the impression that it probably makes a lot of sense, because I was initially thinking the same things that Grok & ChatGPT told me later when I fed them most of RAVE source code and paper... however in the end it just made everything worse and ruined results. It is not that easy unfortunately. For all I know the only sort of safe thing you can override is PHASE_1_DURATION to mayhbe 1M or 2M for large datasets(?).
In the tutorial it was explained, that wasserstein increases quality at the penality of decreased latent representation and generalization. So it doesn't seem like a good bargain to me.
Now the model will start training, and you see the steps in the progress bar per epoch. Check your progress in Tensorboard. After 5000 steps, it will give you a checkpoint that you can export, or use to resume training with --ckpt.
training on Clore.ai
To be smart about cost, you should do everything on your home PC first, including the execution of the training command to see if it produces errors. If you already have a VPS, free server from Oracle Cloud, etc. then it is advisable that you upload your "db_path" (./output_pp/) data first on this "second" server, because it will have much faster upload speeds and you never know if your instance suddenly gets deleted and such things and this can save you time just in case. My dataset was 70GB for example, and 40GB compressed with XZ, and it took me about 12 hours to upload. But because the Clore server I rented had a very slow DSL connection, it took about 6 hours to download this 40GB, so it saved me only 1 Euro compared to uploading it directly to the Clore server. Still even this often makes sense, considering on Clore the max duration is often set somewhat too low, e.g. 3 days not 4 days, and you might have very slow upload at home. But of course when your dataset is like only 1GB compressed, then you probably don't really need to bother to cache your dataset first on a second server ... I hope I am not overcomplicating it.
Those are the SSH commands if you already have a second server to cache your data on, before renting the Clore server:
Like I have said before, on the open server Marketplace on clore.ai you can rent a 4090 for $4/day and then be finished in 3-6 days (which would take more like 6-12 days on a 3090 and allows you to get truly pristine results). Considering that a 3090 usually costs $10-20 per day on lots of ordinary platforms, and similarly the same compute power for a full training run on Google Colab is in the range of hundreds of dollars: clore.ai is absolutely amazing, and in my mind like a cornerstone for anyone to be able to train rave models at "almost" no cost.
Is it too good to be true? Maybe... On the one hand it is extremely easy to rent servers there. Just make an account, send over some Bitcoin into a hosted wallet, wait 2 hours for Bitcoin to confirm, and then click "rent" on a server from the list: done.
But here is where it started to get weird: 12 servers in a row were stuck "deploying" when I first tried. I tried waiting 5 minutes, 10 minutes, 2 hours, 6 hours even. I tried changing the options: Jupyter image, Pytorch image, whatever image. I tried different IPs and ports to connect, SSH or webadmin. I checked the logs: empty ... none of the servers actually worked. But the servers were burning Bitcoin while being stuck "deploying". I mean of course at $2 a day, it would take me hundreds of servers in a row to try, for the rental cost to amount to just $1, if I cancelled each one after 10 minutes. But I think there might be some sort of scam going on here, where some server providers put fake servers on the market and put them cheapest on the list, thus grabbing just a few cents from each new customer, which for the scammers amounts to something over longer timespans.
So tapping in the dark and not knowing how long to wait etc. and Google being zero help, it was sadly a VERY annoying process for me.
But in the end what I did was to wait for a server with >4 out of 5 rating, and I picked the "stable diffusion webui" image, which might be of importance to do. I think within 30 seconds the server could be reached via SSH and switched to "deployed" faster than I could check it (maybe 1-2 minutes?). The longest it ever took me to connect to a server (which was severely broken) was 3 minutes. This image was also used in the demo of Clore.ai on Youtube. Maybe the whole issue is not entirely due to scam, but just because some servers only support this "stable diffusion webui" image. Or maybe it is a combination of scam and other such invisible issues. This also put me in a very skeptical mood, that maybe some server providers will just kill my instance if burn too many Watts per hour for them. And secretly they want to cherry pick on stable diffusion end-users, who maybe only use the GPU 10% of the time while being busy clicking in the UI. However so far once it worked, it just continued to work, and I never made such an experience. But who knows?! It is an open market and those servers are not managed by Clore directly. So you could encounter these or those people, there is no telling.
I hope Clore.AI fixes this in the future, by only charging BTC when the server is verified to work via SSH or web port. And then they could also allow customer comments on the server, so scams will be much easier to tell. And also induce a penalty where some amount of the earnings will be cashed in by Clore.AI if there are excessive amounts of disconnects, and servers that can't be reached (so it will charge scammers more coins than they can earn). Also currently both the "container" and "deployment" logs in "show logs" are always empty, no matter if it works or not, which is misleading.
Update: I just saw a 4090 for $3.5 with no ratings and it also works (used same options). The Webui loaded in just under 5 minutes, but I didn't try longer than 30 seconds to connect to SSH initially, so I can't tell how much faster SSH worked. The upload speed is now about 80Mbit/s (with the other server it was more like 20Mbit/s). I then trained a different config, and it all worked ... except for the fact that the card ran only at 40% the normal speed, despite producing identical hash rates in benchmark. This kind of behavior could only be explained by some kind of artificial bottleneck (=scam?), like maybe the PCIe/memory bus accidentally underclocked to the extreme ... I confirmed the PCIe lanes used were 16x, power limits and clock rates of GPU all correct, I couldn't find out what it really was... so I cancelled the server after wasting 9 hours. Then some other time I rented a 2.5/5 rated server and it was 3x slower than normal on "hashcat -b", and this server remained in the Marketplace list again and again the whole day because it was broken.
But good thing for you, now you know how you can do a simple test with "hashcat -b" that should take no more than 3 minutes to prevent this.
Here is what I recommend you do:
If you see the WebUI, then congratulations it worked!
Note down the Public address of Port 22 under "Forwarded ports", e.g. n1.c5.clorecloud.net:10050, and connect via SSH like so:
As mentioned, a 4090 should take almost exactly a few seconds under 3 minutes to finish the
hashcat -b
command, which is reported at the end. Sometimes it can take 4 minutes if the server is under load, but repeated testing should yield better results. Here is a link for a good vs bad result for comparison (hashcat v.5.1.0). This way if the server has been misconfigured/manipulated to run slower, you can now detect this and decide to simply cancel the order. Otherwise the server is probably good to use! I just tested a 3090 Ti by accident and it was 4 minutes for this card.If you are a beginner to Linux, it is critical to understand that disconnecting via SSH will kill whatever you are running in the shell. This is why "screen" or "tmux" exist. They create shell sessions that automatically detach and you can reattach to. To create a screen session, simply type
screen bash
. You can then list active sessions withscreen -ls
and usescreen -r [name]
to resume (usescreen -d [name]
to detach if attachment is stale andecho $STY
to check if already inside screen). When running SSH uploads on a remote machine (rsync thankfully has resume functionality with -P option), or when running the rave training, and you forget to run this inside a screen session, this will abort your progress and can be very very annoying.So always make sure to run your training and remote commands inside
screen
.If you want to push your dataset DB to the server directly, use this SSH command:
OR if you already compressed and uploaded it to another server, use this command to decompress on the fly (no resume):
In the time that the upload is running, you can install conda, activate environment and install rave:
Now you are all set to run the training and check your Tensorboard (note: the forwarded public port differs, e.g. http://n1.c5.clorecloud.net:10050)!
I use this command to sync the training results every 15 minutes:
I hope this makes it a lot easier for you to set up a server!
This is just orders of magnitudes better and cheaper than Colab. Even with the stupid hassle caused by the "fake" servers on the Clore market.
exporting the model
If you have not chosen to train mono audio, you have to add this information to the
config.gin
in the output/ path:model.RAVE.n_channels = 2
To export your checkpoint, run the export command like this:
rave export --run output/Mymodel_5b61af7ec4/ --channels 1 --sr 44100 --streaming
Now you can load your model into nn~ with Pure Data or use the following Python script (adapted from the 30m Rave demo video).
If you are not using the model with nn~, it maybe(?) makes sense to use
--nostreaming
and a second file for that.Unfortunately,
rave generate
currently neither works with mono nor stereo models.python script
I have tried to fix this script to work with stereo, but somehow it only produces double-mono with my model (but in nn~ it does produce actual stereo). It doesn't seem to be that simple to do. At least though it doesn't fail with 2 channel model.
python generate.py --model Mymodel_2ch_wsreal_5b61af7ec4/version_3/checkpoints/Mymodel_2ch_wsreal_5b61af7ec4.ts --input something_else.wav --duration 30 && mplayer something_else.wav_out.wav
You can simply copy & paste those latent alterations in other audio generation projects, like RAVE-Latent-Diffusion, right before the rave.decode(z) step.
compiling nn~
If you are on Windows, it ships precompiled.
Unfortunately I don't really recall much about compiling nn~, but I think it was a little bumpy. I think the first issue was not having the -cuda version of pytorch installed, and the second issue was that I had to specify where it was installed, like so:
TORCH_INSTALL_PREFIX="/usr" cmake ../src/ -DCMAKE_BUILD_TYPE=Release
Then you have to put it into the proper PureData directory:
cp ./frontend/puredata/nn_tilde/nn~.pd_linux ~/.local/lib/pd/extra/nn\~.pd_linux
using PureData
PureData doesn't seem as well supported. The Max software has a 30 day free trial, but it is Windows only.
This program might seem quite awkward to use at first but it is actually not that bad.
Go to File->Preferences->Edit P.. and enter a new search path. This path should contain all your stuff, like input wavs and the model.ts file. Now hit File->New.
The following things are not really obvious at first:
The rest should be much more obvious and you can see how to work with it from the Youtube video I posted in the beginning at the end, or other videos.
What I found most helpful so far is
osc~
,noise~
,-~
and*~
. For example feednoise~
to*~
on the left side andosc~ 0.5
to the right side, it will oscillate the noise with 0.5Hz. Then feed that signal to left side of another*~
and connect a slider on the right side. Right-click slider and set range between 0 and 1 => simple volume control. If you feed a signal to-~
on the left and a*~
with a slider on the right, it will subtract that slided signal from the other signal. Then there are filters likebp~
(bandpass). Simply connect the signal to left ofbp~
and a slider (or Number, does the same thing) on the right, set it to like 0.02 and at the top another slider.This is immediately useful to manipulate the latent space an input wav and should give you quite an interesting experience. But Pure Data is so much more capable of doing so much more stuff, you should really check out what else it can do.
When it comes to actually loading your model with nn~, do it as suggested in the docs with two objects like so: "nn~ mymodel.ts encode 40000" and "nn~ mymodel.ts decode 40000". Connect everything, feed it input, you should hear the output. Notice the 40000 value at the end, which is the buffer size. This buffer size is outrageously huge (many seconds). But what I have found is that with a low buffer and the normal buffer size, the output sounded very wobbly and distorted. So far I have never bothered to try to fix this. You should check as well if doing the same results in remarkable improvements and then lower it further such that it becomes more usable.
For some reason nn~ didn't run in GPU mode for me and it needs like Ryzen 5 5XXX at least to run somewhat well in CPU mode. I hacked the source code to bypass the GPU check, which is trivial and maybe not required for you, so I won't explain this further. Just be aware that it can run in GPU mode and that stuttering etc. in CPU mode is normal, if you don't have a beefy and new CPU.
Now the model does accept various "messages" which you can connect at the top left, as you can do with readsf~. This is important to keep in mind when dealing with the prior. You have to check the source code what those messages are (documentation lacking), some are sometimes visible in the demo videos and in some example screenshot somewhere. For msprior there is for example "set temperature XXX" and "set listen true", "set listen false". Any nn~ model should also accept "gpu true" or "set gpu" or something like this, but this did not work for me. I think it only works in the Max plugin.
Generating and using the prior
From my experience with simple v2 mono test run, I am fairly certain that you currently need to use msprior and that "rave train_prior" is abandoned/broken (or only works with plain v1?). "rave train_prior" didn't work with mono or stereo either way, not matter what I tried. But I was able to train msprior with a 2 channel model, when I converted the input audio in preprocessing to mono! It remains to be seen though if the output is intelligible. To reflect stereo sound properly, it must necessarily also be able to train on stereo, which is clearly not the case. There is also the question of whether or not using anything but v1 (like v2/v3 and wasserstein also) will influence prior training in a bad way. Other people said to me that their prior output also was not coherent and intelligible. But personally I have probably not let it run long enough to reach certain conclusions.
If you follow the documentation, the process should be very simple. The only issue with msprior docs is, that the config files don't match and it is not obvious what the new configs correspond to. I have just randomly picked rwkv for a short test, but the test more or less yielded garbage-ish results and then I stopped caring. What you probably wanted in the first place is encoder_decoder ... so maybe the next best thing is "modulated_alibi" (not tried it) or "rwkv_semantic" (fails with error) now? I don't know. I can only tell you that it fails if you don't supply a config parameter.
If I used "set listen" it only generated pure noise, it just seemed defective. Then I actually supplied --continuous when exporting as required, but it was still only some kind of deep-ish noise, until I figured out to use "set temperature 200" and then "set reset". But what it generated didn't sound much different than what it generated without the prior just from silence or a little noise input. Even worse it generated (very random and incoherent) noises at a rate about 10x faster than what I felt was desirable for my purposes, and lowering temperature didn't really improve this, and in the lower "7x too fast" range, it was only this deep noise again. I also didn't get any of the "semantic control" inputs, which I now assume to be only provided by "modulated_alibi" or "rwkv_semantic".
Like in the Github issue I mentioned in the beginning, there are many questions about how to use this stuff properly, and how it interacts with different configs. So it might or might not be basically garbage with this and that combo, I don't know. For example msprior complains (but doesn't fail) if you don't use a "discrete" model, and that it limits functionality / "pretrained_embedding". But what exactly does this mean now to the end result? From what I understand, "discrete" is kind of bad for quality (and we were only able to produce silent output with it, no matter if prior was used or what not) so I rather take my chances without it. Does the prior work better or worse with "causal", no idea.
Considering it didn't really turn out well, those would be the commands I used:
As mentioned, it seems to me like the prior functionality is kind of neglected and it only really yields usable results with very specific not or not explicitly documented config combinations, maybe in both the model training, and the prior training. And maybe you have to make a lot of quality sacrifices for a prior to work (i.e. regress to plain v1-only), like you see in the demo video. The docs are very suggestive to me, that you basically need your rave model to train on "discrete" architecture to produce the more desirable and functional results. But whether or not that is really so, is rather ambiguous and not explicitly stated. "Discrete" doesn't seem to work at all for me and someone else also.
I hope such questions can be addressed better in the future by more documentation.
Rave-Latent-Diffusion
I have briefly tried RAVE-Latent-Diffusion (unconditional audio generation) and it worked for my v2 mono test model. But sadly it doesn't seem to support 2 channels (see Github issue). I am no longer really trying to fix it.
Commands used:
Here is my fix to make it work with 2 channels. This will however only output double-sided (identical) mono, like the "python script". I don't understand why that is.
Here is the final output. I put two versions on Youtube with different temperature. Please note that I simply ran the generator twice and then combined the mono audios to stereo.
ASMR Rapunzel model: (click image)
Model files: https://mega.nz/file/ZfI1WCjT#UAu4I5HM_YIhfVFICrgTpIGLllauAsfs-iT-plJJnVQ
video converter
Here is some bash mumbo-jumbo that I have used to convert Youtube videos. The idea is to pass the sound through a bunch of filters, namely amplification + limiters + high and lowpass, so it fits better to whatever sounds the model understands. The script is really ugly trash with lots of deficits. But hey, it works. Use AMP to make silent sounds more loud, VOL to lower volume and OUTLEVEL for the final volume level to the model. SPEED is pitch, not actually speed. It is fiddly to make this turn out right. You basically want to raise OUTLEVEL to 1.0 and AMP to 2-6, then find the right pitch. But you will probably hear the original sound "punching through" the model. So you have to adjust the OUTLEVEL again to something like 0.1-0.6. But not as low as that the model wouldn't reflect the sound anymore or only poorly so. In theory the limiter should mostly accomplish this automatically, but in doesn't actually work this way for whatever reason.
ASMR Rapunzel model:
welding_example.out.out.mp4
wood_turning_example.mp4
Well, it is a work in progress... :D
Rave VST / Neutone VST
I have not tried Rave VST, which I think is experimental. But I have now successfully run Neutone VST in Fruity Loops Studio 20 via Wine on Linux (instructions here). Check out Neutone FX, you can easily download and Run Rave models from there in your DAW and this is probably the number 1 place to get your model to the market and musicians. Neutone FX presumably can't run Rave models directly, you have to use their SDK to embed it.
end
This is pretty much all I know so far. Please correct me if you find anything wrong or if you know something better.
Best of luck!
Beta Was this translation helpful? Give feedback.
All reactions