Questions about scripting with the Festival text-to-speech engine

I’m trying to script a series of spoken messages using the Festival text-to-speech engine, the text2wave command in particular. I took a look at the basics of how Festival scripting works via scm and xml files, yet there are things I can’t seem to find any useful information on. If anyone is familiar with the software I wanted to ask about how I’m meant to use the system in this format.

What I essentially want is to have different voices spoken at different locations in the resulting audio, using different voices if possible. Something among the lines of: Wait 5 seconds, say “foo” in voice X, wait 10 seconds, say “bar” in voice Y. Is this possible to script in a single scm / xml definition, any examples of how to do it?

I’d also like to include other sounds in the equation. Can the schematic for the text2wave command take another wav / ogg and throw it in together with the spoken voices? Overlap is okay… was thinking of using this to add music without having to do other changes with a ffmpeg command.

In addition: Is there a way to change the pitch of a voice? I only found a way to set the speed in the scm using the line (Parameter.set 'Duration_Stretch 1). Do I need to make my own variant for a voice to do that, and how is this done if yes?