BarkProcessor voice_preset doesn't work #34634

etheryee · 2024-11-07T04:01:37Z

System Info

transformers version: 4.47.0.dev0
Platform: Windows-11-10.0.22631-SP0
Python version: 3.12.7
Huggingface_hub version: 0.26.2
Safetensors version: 0.4.5
Accelerate version: 1.1.0
Accelerate config: not found
PyTorch version (GPU?): 2.5.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA GeForce RTX 4080 SUPER

Who can help?

@ylacombe

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Code:
from bark import SAMPLE_RATE, generate_audio, preload_models
import sounddevice
from transformers import BarkModel, BarkProcessor
import torch
import numpy as np
from optimum.bettertransformer import BetterTransformer
from scipy.io.wavfile import write as write_wav
import re

def barkspeed(text_prompt):
processor = BarkProcessor.from_pretrained("suno/bark-small")
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)
model = BetterTransformer.transform(model, keep_original_model=False)
model.enable_cpu_offload()
sentences = re.split(r'[.?!]', text_prompt)
pieces = []
for sentence in sentences:
inp = processor(sentence.strip(), voice_preset=SPEAKER).to(device)
audio = model.generate(**inp, do_sample=True, fine_temperature=0.4, coarse_temperature=0.5)
audio = ((audio/torch.max(torch.abs(audio))).numpy(force=True).squeeze()*pow(2, 15)).astype(np.int16)
pieces.append(audio)
write_wav("bark_generation.wav", SAMPLE_RATE, np.concatenate(pieces))
sounddevice.play(np.concatenate(pieces), samplerate=24000)
sounddevice.wait()

Error Message:
****The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Traceback (most recent call last):
File "F:\OllamaRAG\BarkUsage\BarkUsage.py", line 56, in
barkspeed("""Hey, have you heard about this new text-to-audio model called "Bark"?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\OllamaRAG\BarkUsage\BarkUsage.py", line 47, in barkspeed
audio = model.generate(**inp, do_sample=True, fine_temperature=0.4, coarse_temperature=0.5)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\Program Files\anaconda3\envs\ollamaRAG\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "F:\Program Files\anaconda3\envs\ollamaRAG\Lib\site-packages\transformers\models\bark\modeling_bark.py", line 1737, in generate
coarse_output = self.coarse_acoustics.generate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\Program Files\anaconda3\envs\ollamaRAG\Lib\site-packages\transformers\models\bark\modeling_bark.py", line 1078, in generate
semantic_output = torch.hstack([x_semantic_history, semantic_output])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

Expected behavior

I used the code to generate some audio. Before I upgraded transformers and bark, the voice preset didn't work, bark kept changing preset. In the first half part of call function in Barkprocessor, it seemed fine, tensors were loaded properly. But in the generate function history_prompt was empty at first, then it was loaded as all 10000, After I upgraded transformers and bark, the error message shows. And after I delete the voice_preset=SPEAKER part, the code works, but with changing preset as well. Please could anyone tell me how I can get the preset to work.

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-11-15T14:22:46Z

cc @ylacombe

github-actions · 2024-12-10T08:04:05Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

aiwantaozi · 2024-12-18T05:25:07Z

Hello. I've tried the version 0.47 of Transformers. It seems that the specified voice_preset has taken effect. I'd like to ask whether this issue has been fixed or not.

Rocketknight1 · 2024-12-18T18:22:08Z

gentle ping @eustlb

foxumulder2017 · 2024-12-19T13:53:12Z

this worked for me:

apparently, not all tensors under inputs are moved to cuda when this is called:

inputs.to(device)

I moved the voice_preset tensors first to cuda before moving the inputs.

# load npz file
import numpy

# manually create the voice_preset dict
with numpy.load('bark/assets/prompts/en_speaker_6.npz') as data:
    voice_preset = {
    'fine_prompt':data['fine_prompt'],
    'coarse_prompt':data['coarse_prompt'],
    'semantic_prompt':data['semantic_prompt']
    }


text_prompt = "sample text prompt"
inputs = processor(text_prompt,voice_preset=voice_preset)
inputs['history_prompt'] = inputs['history_prompt'].to(device)
inputs.to(device)

etheryee · 2024-12-20T03:32:43Z

@foxumulder2017 I used your method and it works. Many thx. But it is confusing. Why on earth was it designed like this? And here is the code:

for sentence in sentences:
inp = processor(sentence.strip(), voice_preset="v2/en_speaker_6")
inp['history_prompt'] = inp['history_prompt'].to(device)
inp.to(device)
audio = model.generate(**inp, do_sample=True, fine_temperature=0.4, coarse_temperature=0.5)
audio = ((audio/torch.max(torch.abs(audio))).numpy(force=True).squeeze()*pow(2, 15)).astype(np.int16)
pieces.append(audio)

etheryee · 2024-12-20T03:35:41Z

Hello. I've tried the version 0.47 of Transformers. It seems that the specified voice_preset has taken effect. I'd like to ask whether this issue has been fixed or not.

Hi. I used the 4.47 from the beginning. It didn't work. @foxumulder2017 showed the reson why my code didn't work.

github-actions · 2025-01-13T08:05:23Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

etheryee added the bug label Nov 7, 2024

LysandreJik added the Audio label Nov 15, 2024

aiwantaozi mentioned this issue Nov 28, 2024

Audio distortion generated using the bark model gpustack/vox-box#4

Open

github-actions bot closed this as completed Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BarkProcessor voice_preset doesn't work #34634

BarkProcessor voice_preset doesn't work #34634

etheryee commented Nov 7, 2024

LysandreJik commented Nov 15, 2024

github-actions bot commented Dec 10, 2024

aiwantaozi commented Dec 18, 2024

Rocketknight1 commented Dec 18, 2024

foxumulder2017 commented Dec 19, 2024

etheryee commented Dec 20, 2024

etheryee commented Dec 20, 2024

github-actions bot commented Jan 13, 2025

BarkProcessor voice_preset doesn't work #34634

BarkProcessor voice_preset doesn't work #34634

Comments

etheryee commented Nov 7, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

LysandreJik commented Nov 15, 2024

github-actions bot commented Dec 10, 2024

aiwantaozi commented Dec 18, 2024

Rocketknight1 commented Dec 18, 2024

foxumulder2017 commented Dec 19, 2024

etheryee commented Dec 20, 2024

etheryee commented Dec 20, 2024

github-actions bot commented Jan 13, 2025