Generate human chat style audio from texts


A quick test of ChatTTS.

Project repo:


A conversation class that helps to fix speakers. Invoke the talk_all() to convert a list of (name, message) tuples to audio conversation.

class Conversation:
    def __init__(self, chattts_model):
        self.speakers = {}
        self.chattts_model = chattts_model
    def talk_one(self, name, message):
        if name not in self.speakers:
            self.speakers[name] = self.chattts_model.sample_random_speaker()
            print(f'sample new speaker: {name}')
            print('existing speaker')
        spk_emb = self.speakers[name]
        params_infer_code = {'spk_emb': spk_emb}
        wavs = chat.infer([message], params_infer_code=params_infer_code)
        return wavs[0]
    def talk_all(self, chats):
        wavs = []
        for name, message in chats:
            print(name, message)
            wav = self.talk_one(name, message)
        return wavs

Output audio files to WAV and MP3 given a file prefix:

from import wavfile
from pydub import AudioSegment

def output_audio(wavs, output_prefix):
    output_file = f'{output_prefix}.wav'
    # Convert the WAV array to a numpy array
    # audio_data = np.array(wav)
    audio_data = np.concatenate(wavs, axis=1)
    # Set the sample rate of the audio (replace 44100 with the actual sample rate)
    sample_rate = 24_000
    # Save the audio as a WAV file
    wavfile.write(output_file, sample_rate, audio_data[0])
    audio = AudioSegment.from_wav(output_file)
    audio.export(f'{output_prefix}.mp3', format='mp3')

Parse HuluNote

I use to log messages in wechat groups that I manage. It supports to export messages into JSON format. Some tweak is needed before passing to ChatTTS.

Functions to parse HuluNote JSON:

import re

# The HuluNote 葫芦笔记 string format.
def parse_chat_log(chat_log):
    # pattern = r"- (.*?):: (.*?)$"
    pattern = r"^(.*?):: (.*?)$"
    matches = re.findall(pattern, chat_log, re.MULTILINE)

    parsed_messages = []
    for match in matches:
        name = match[0]
        message = match[1]
        parsed_messages.append((name, message))

    return parsed_messages

def refine_chat_text(line):
    replace_texts = {
        '[偷笑]': '[laugh]'
    for k, v in replace_texts.items():
        line = line.replace(k, v)
    return line


Parse HuluNote:

fn = '5-31-2024,导出葫芦笔记数据.json'
import json
raw = json.loads(open(fn).read())
for day in raw:
    if day['title'] == '2024-05-27':
        chatlog = day['children']

chats = []
for c in chatlog:
    if ('fromusername' in c['string']) and ('tousername' in c['string']):
    # print(c['string'])

Init model:

import torch
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True

import ChatTTS
chat = ChatTTS.Chat()

Put them all together:

conv = Conversation(chat)
wavs = conv.talk_all(chats)
output_audio(wavs, 'complaints-group-consistent-speakers')

Example and Glitches

Here's one example I generated via WeChat group chat log.

There are some glitches in my initial test:

  • Punctuations are not well handled. The model tries to pronounce "(", "[", "...".
  • Numbers are not well handled. Audio corresponding to texts like "100", "20%" are all garbled.
  • When one piece of texts are too long, the latter portion becomes pure noise somehow. One may consider to break the text into shorter sentences.

