Wednesday, December 14, 2011

Future of Apps: Text to Speech


Text to speech is a technology which started in the 1700’s. A text to speech system turns texts into speech using a speech synthesizer, which is often a type of software. In short, a text to speech synthesizer reads out loud whatever is written into the system software.
There are two major components of a text-to-speech engine, which are the front-end and the back-end.
The front-end plays two major roles. Its first role is text normalization. This process helps to convert digits, numbers or even abbreviations into words in written-out forms. Then, it proceeds with text-to-phoneme. It is the process of assigning phonetic transcriptions to the written-out words earlier in the first stage and divides the text into phrases, clauses or sentences. The output of the front-end forms the symbolic linguistic representation.
Next, the text to speech synthesizer begins in the back-end. This is when the symbolic linguistic representation is converted into sound. Here, the phrases, clauses and sentences are given stress and emphasis. Other than that, the software also computes the pitch and duration during the speech. That is how the text is converted into speech.
The criteria which determine the quality of a speech synthesis are naturalness and intelligibility. An ideal speech synthesizer is natural and intelligible. Naturalness describes how similar the speech synthesis is as compared to human speech whereas intelligibility measures how easily the synthesized speech can be understood by human. The two technologies which maximize both the qualities are concatenative synthesis and formant synthesis.