ChatTTS to replay wechat group messages

ChatTTS to replay wechat group messages

Generate human chat style audio from texts

·

2 min read

A quick test of ChatTTS.

Project repo: github.com/2noise/ChatTTS

Utilities

A conversation class that helps to fix speakers. Invoke the talk_all() to convert a list of (name, message) tuples to audio conversation.

class Conversation:
    def __init__(self, chattts_model):
        self.speakers = {}
        self.chattts_model = chattts_model
    def talk_one(self, name, message):
        if name not in self.speakers:
            self.speakers[name] = self.chattts_model.sample_random_speaker()
            print(f'sample new speaker: {name}')
        else:
            print('existing speaker')
        spk_emb = self.speakers[name]
        params_infer_code = {'spk_emb': spk_emb}
        wavs = chat.infer([message], params_infer_code=params_infer_code)
        return wavs[0]
    def talk_all(self, chats):
        wavs = []
        for name, message in chats:
            print(name, message)
            wav = self.talk_one(name, message)
            wavs.append(wav)
        return wavs

Output audio files to WAV and MP3 given a file prefix:

from scipy.io import wavfile
from pydub import AudioSegment

def output_audio(wavs, output_prefix):
    output_file = f'{output_prefix}.wav'
    # Convert the WAV array to a numpy array
    # audio_data = np.array(wav)
    audio_data = np.concatenate(wavs, axis=1)
    # Set the sample rate of the audio (replace 44100 with the actual sample rate)
    sample_rate = 24_000
    # Save the audio as a WAV file
    wavfile.write(output_file, sample_rate, audio_data[0])
    audio = AudioSegment.from_wav(output_file)
    audio.export(f'{output_prefix}.mp3', format='mp3')

Parse HuluNote

I use hulunote.com to log messages in wechat groups that I manage. It supports to export messages into JSON format. Some tweak is needed before passing to ChatTTS.

Functions to parse HuluNote JSON:

import re

# The HuluNote 葫芦笔记 string format.
def parse_chat_log(chat_log):
    # pattern = r"- (.*?):: (.*?)$"
    pattern = r"^(.*?):: (.*?)$"
    matches = re.findall(pattern, chat_log, re.MULTILINE)

    parsed_messages = []
    for match in matches:
        name = match[0]
        message = match[1]
        parsed_messages.append((name, message))

    return parsed_messages

def refine_chat_text(line):
    replace_texts = {
        '[偷笑]': '[laugh]'
    }
    for k, v in replace_texts.items():
        line = line.replace(k, v)
    return line

Application

Parse HuluNote:

fn = '5-31-2024,导出葫芦笔记数据.json'
import json
raw = json.loads(open(fn).read())
len(raw)
for day in raw:
    if day['title'] == '2024-05-27':
        chatlog = day['children']
        break

chats = []
for c in chatlog:
    if ('fromusername' in c['string']) and ('tousername' in c['string']):
        continue
    # print(c['string'])
    chats.extend(parse_chat_log(c['string']))

Init model:

import torch
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')

import ChatTTS
chat = ChatTTS.Chat()
chat.load_models()

Put them all together:

conv = Conversation(chat)
wavs = conv.talk_all(chats)
output_audio(wavs, 'complaints-group-consistent-speakers')

Example and Glitches

Here's one example I generated via WeChat group chat log.

There are some glitches in my initial test:

  • Punctuations are not well handled. The model tries to pronounce "(", "[", "...".
  • Numbers are not well handled. Audio corresponding to texts like "100", "20%" are all garbled.
  • When one piece of texts are too long, the latter portion becomes pure noise somehow. One may consider to break the text into shorter sentences.

Did you find this article valuable?

Support HU, Pili by becoming a sponsor. Any amount is appreciated!