Text
to speech is a technology which started in the 1700’s. A text to speech system turns texts into speech using a speech synthesizer,
which is often a type of software. In short, a text to speech
synthesizer reads out loud whatever is written into the system
software.
There
are two major components of a text-to-speech engine, which are the
front-end and the back-end.
The
front-end plays two major roles. Its first role is text
normalization. This process helps to convert digits, numbers or even
abbreviations into words in written-out forms. Then, it proceeds with
text-to-phoneme. It is the process of assigning phonetic
transcriptions to the written-out words earlier in the first stage
and divides the text into phrases, clauses or sentences. The output
of the front-end forms the symbolic linguistic representation.
Next,
the text to speech synthesizer begins in the back-end. This is when
the symbolic linguistic representation is converted into sound. Here,
the phrases, clauses and sentences are given stress and emphasis.
Other than that, the software also computes the pitch and duration
during the speech. That is how the text is converted into speech.
The
criteria which determine the quality of a speech synthesis are
naturalness and intelligibility. An ideal speech synthesizer is
natural and intelligible. Naturalness describes how similar the
speech synthesis is as compared to human speech whereas
intelligibility measures how easily the synthesized speech can be
understood by human. The two technologies which maximize both the
qualities are concatenative synthesis and formant synthesis.