We gratefully acknowledge support from
the Simons Foundation and member institutions.

**long Xue is qualified to endorse.

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

**long Xue: Is registered as an author of this paper.
Can endorse for cs.AI, cs.CL, cs.LG, cs.SD, eess.AS. (why?)

Yayue Deng, Yicheng Han, Yingming Gao and Ya Li are not registered as owners of this paper. (why?)