Speech synthesis has many applications, and it is the constant demand for better and more realistic representations of speech that drives the research on at such a rate. Speech synthesis is used extensively in automatic telephone answering services - BT estimate that its use in Directory Enquires for reading out numbers saves them millions a year.
SPeech synthesis involves representing a combination of the vocal folds (like a string instrument...) and the air breathed out (like a wind instrument...) with the resonance on the various cavities that alter the sound and change the values of the formants, thus affecting the timbre of the sound. It is a complicated system which is why it is difficult to synthesise, but despite the problems, good progress is being made.
One of the problems facing speech synthesisers are to do with the large number of ways that certain phonemes (the most basic sounds in a language - there are 44 in English) are connected - for example my mouth does not hold the same position for the b in bed as for the b in bid, since I am already starting to think about changing to an e or and i and this is reflected in the sound of the b. If I was to take samples so that I can just connect together the phonemes, I would need many more than 44 to include all the subtle variations depending on the following and preceding sounds.
Another major problem is the emphasis a human will give to words - by raising or lowering the pitch, or by slowing down and speeding up, or by making some words louder than others. Even more difficult to synthesise are the whoops of delight of groans of defeat that we all make, which seem perfectly simple to humans but which are a very different matter for computers.