How It Works

Open Avatar Chat uses a modular Handler architecture, combining different ASR, LLM, TTS, and Avatar modules through configuration files.

Architecture Overview

The system consists of these core modules:

Client Handler: Manages WebRTC audio/video stream connections
VAD Handler: Voice Activity Detection, identifying speech start/end
ASR Handler: Automatic Speech Recognition, converting speech to text
LLM Handler: Language model inference, generating dialogue responses
Agent Handler: Multi-turn tool-calling Agent, replacing traditional LLM Handler (Beta)
TTS Handler: Text-to-Speech, synthesizing response text into speech
Avatar Handler: Digital human driver, generating facial animations from speech

Performance

Using a PC with an i9-13900KF processor and Nvidia RTX 4090, the average response delay is about 2.2 seconds after ten tests.

The delay is measured from the end of user speech to the start of the digital human's speech, including RTC round-trip time, VAD stop delay, and computation time.

Data Flow

User sends audio/video stream via browser (WebRTC)
VAD detects whether the user is speaking
ASR converts speech to text
LLM/Agent generates response text
TTS converts text to speech
Avatar generates facial animation from speech
Synthesized audio/video stream returns to user via WebRTC

How It Works ​

Architecture Overview ​

Performance ​

Data Flow ​

How It Works

Architecture Overview

Performance

Data Flow