ChatTTS to replay wechat group messages
Generate human chat style audio from texts
A quick test of ChatTTS.
Project repo: github.com/2noise/ChatTTS
Utilities
A conversation class that helps to fix speakers. Invoke the talk_all()
to convert a list of (name, message)
tuples to audio conversation.
class Conversation:
def __init__(self, chattts_model):
self.speakers = {}
self.chattts_model = chattts_model
def talk_one(self, name, message):
if name not in self.speakers:
self.speakers[name] = self.chattts_model.sample_random_speaker()
print(f'sample new speaker: {name}')
else:
print('existing speaker')
spk_emb = self.speakers[name]
params_infer_code = {'spk_emb': spk_emb}
wavs = chat.infer([message], params_infer_code=params_infer_code)
return wavs[0]
def talk_all(self, chats):
wavs = []
for name, message in chats:
print(name, message)
wav = self.talk_one(name, message)
wavs.append(wav)
return wavs
Output audio files to WAV and MP3 given a file prefix:
from scipy.io import wavfile
from pydub import AudioSegment
def output_audio(wavs, output_prefix):
output_file = f'{output_prefix}.wav'
# Convert the WAV array to a numpy array
# audio_data = np.array(wav)
audio_data = np.concatenate(wavs, axis=1)
# Set the sample rate of the audio (replace 44100 with the actual sample rate)
sample_rate = 24_000
# Save the audio as a WAV file
wavfile.write(output_file, sample_rate, audio_data[0])
audio = AudioSegment.from_wav(output_file)
audio.export(f'{output_prefix}.mp3', format='mp3')
Parse HuluNote
I use hulunote.com to log messages in wechat groups that I manage. It supports to export messages into JSON format. Some tweak is needed before passing to ChatTTS.
Functions to parse HuluNote JSON:
import re
# The HuluNote 葫芦笔记 string format.
def parse_chat_log(chat_log):
# pattern = r"- (.*?):: (.*?)$"
pattern = r"^(.*?):: (.*?)$"
matches = re.findall(pattern, chat_log, re.MULTILINE)
parsed_messages = []
for match in matches:
name = match[0]
message = match[1]
parsed_messages.append((name, message))
return parsed_messages
def refine_chat_text(line):
replace_texts = {
'[偷笑]': '[laugh]'
}
for k, v in replace_texts.items():
line = line.replace(k, v)
return line
Application
Parse HuluNote:
fn = '5-31-2024,导出葫芦笔记数据.json'
import json
raw = json.loads(open(fn).read())
len(raw)
for day in raw:
if day['title'] == '2024-05-27':
chatlog = day['children']
break
chats = []
for c in chatlog:
if ('fromusername' in c['string']) and ('tousername' in c['string']):
continue
# print(c['string'])
chats.extend(parse_chat_log(c['string']))
Init model:
import torch
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')
import ChatTTS
chat = ChatTTS.Chat()
chat.load_models()
Put them all together:
conv = Conversation(chat)
wavs = conv.talk_all(chats)
output_audio(wavs, 'complaints-group-consistent-speakers')
Example and Glitches
Here's one example I generated via WeChat group chat log.
There are some glitches in my initial test:
- Punctuations are not well handled. The model tries to pronounce
"(", "[", "..."
. - Numbers are not well handled. Audio corresponding to texts like "100", "20%" are all garbled.
- When one piece of texts are too long, the latter portion becomes pure noise somehow. One may consider to break the text into shorter sentences.