Technology

Overview

Lessac Technologies has developed a unique, patented, and automated paradigm shifting method for conversion of plain text files into human-quality speech. Lessac has developed a text-to-speech engine (TTS or voice synthesizer) that, unlike the monotone voices of AT&T, IBM, Nuance and others, automatically synthesizes a specific prosodic male or female voice with typical human sounding expressive changes in pitch, energy and pacing, while relying on plain un-annotated text as the sole input. The Lessac automated narrator system parses plain text as words, sentences, and paragraphs to be spoken expressively with meaning, while being consistently recognizable as speech from the same "person."

Lessac Technology

The Company's unique technology is based on the 50-plus years of voice research conducted by the late renowned speech professor and practitioner Arthur Lessac. Arthur's kinesensic methods yield prosodic speech pronunciation that emphasizes the inherent musicality and expressiveness of human speech. Since the 1930's, he had investigated how the human body and voice function naturally and instinctively. He was among the most highly regarded teachers of voice, speech, singing and movement. Among the three generations of actors, singers, and dancers touched by his teaching are Morris Charnovsky, Irene Daily, Michael Douglas, Faye Dunaway, Nina Foch, George Grizzard, Carol Haney, Linda Hunt, Frank Langella, Chris Lloyd, Catherine Malfitano, Michael O'Keefe, Peter Scolari, Martin Sheen, and Beatrice Straight. His two books, The Use and Training of the Human Voice: A Bio Dynamic Approach to Vocal Life, and Body Wisdom: The Use and Training of the Human Body, have become required reading for countless students and remain a lasting contribution to the field of acting and performing. Arthur established the Lessac Training and Research Institute to extend his legacy.

Reproducible Results

Professor Lessac developed his annotation system to identify where pronunciations of the text could be altered by the speaker so as to convey additional expressive information through pitch change, added tonality, and rhythmic articulation, all directed towards improving the listener's perception and understanding of the meaning of the speaker's intended message. This has proven extremely valuable to actors, singers and orators and has also been shown to be a reliably reproducible method for enhancing communication. Reproducible means that for three individuals A, B, and C, who are assumed to be trained in his methods, A may annotate a text to be produced in a particular style, B may pronounce the text according to A's annotations without further instructions, and C may recover with high accuracy A's original annotation from listening to C's speech. The decisions annotator A makes about a specific annotation of the text remove the possible ambiguities in words, phrases, and complex sentences. The speaker B, by pronouncing the text according to these annotations, will convey a non-ambiguous message to the listener C, so that C can accurately decode the meaning and expressive form that was intended by the annotator A. The proven reproducibility of Arthur Lessac's basic work is the starting point for the Company's quantitative modeling of an individual's voice to be used in speech synthesis.

Lessemes - More Useful Than Phonemes

Successful production of natural sounding synthesized speech requires developing a sufficiently accurate and complete set of graphic symbols that relate the input text to be pronounced with the corresponding speech utterances that are heard by the listeners. For several decades, research and development efforts in text-to-speech synthesis systems have focused on the use of phonemes to represent the text to be synthesized. The Company, rather than adopting phonemes, has derived a new set of graphic symbols which we call Lessemes. When the Lesseme system is used for annotating text to be synthesized, it explicitly captures the musicality of speech. The fundamental concept underlying Lessac Technologies is that the manually applied phonosensory symbol system for expressive speech as conceived by Arthur Lessac, and as enhanced and extended by the Company, may be accomplished with fully automated processes. In text to speech, the Company's development work has centered around two major software components:

  1. a linguistic front-end which takes plain text in digital form as input, and outputs a sequence of prosodic and phonosensory graphic symbols (Lessemes) representing how the text is to be synthesized as speech; and
  2. a signal processing back-end which takes prosodic and phonosensory graphic symbols as input to produce near human-sounding synthesized speech as output to be rendered or played (e.g. streaming audio, or a .wav or .mp3 file).

As mentioned above, typical TTS engines rely on a phoneme and context labeled database of speech snippets. There are 54 phonemes in the IPA (International Phonetic Alphabet) phoneme set for American English. Current TTS engines rely on this limited phoneme set and relatively uninformed classification methods such as CART (classification and regression tree analysis) to sub-classify the speech snippets from a specific phoneme sub-set in the acoustic database into separate selection bins. When used to synthesize speech, this algorithmically classified database is searched, and the selected speech snippets are concatenated in a new and different order. In order to produce tolerable speech, providers of presently available TTS synthesis have found it necessary to constrain the intonational range and rhythms found in natural speech. The result has been synthesis that is flat, stilted and robotic.

The Lessac TTS engine is different in that our speech snippet labels include prosodic information directly. In our synthesizer there are over 800 different Lessemes in use. This enables each acoustic unit to be more precisely aligned with, and prosodically labeled, according to the text to be synthesized as speech. In addition to the machine readable output used by the signal processing back-end, our linguistic front-end produces a human readable graphic output stream which can be thought of as notated text plus a musical score. Lessac used this human readable form as the combined script and score for the Lessac trained voice model during the recording of the first acoustic database, thus creating a direct mapping between each speech snippet in the acoustic database and its Lesseme label, as opposed to relying on algorithmic classification models. For subsequent TTS voices, we have used already existing audio books as the speech source. These audio books were recorded by voice actors that had not been trained in the Lessac methods. Our front-end linguistic model has proven sufficiently robust to Lesseme label these audio book recordings, and the resulting text-to-speech voices have proven to be quite good. A Lessac TTS voice can be made from a sufficient quantity of any pre-existing high quality voice recordings from the same speaker.

Linguistic Front-end

The linguistic front-end is a rules based system which derives prosody labeling (Lessemes) from plain text input. The prosody labeling rules currently operate at the paragraph level, and are based on expert linguistic knowledge from a wide variety of fields including phonetics, phonology, morphology, syntax, light semantics and discourse. Simplistically, the Lessac front-end predicts prosody and in turn labels the text, building from, at the lowest level, letters, spaces and punctuation marks. These letters, spaces and punctuation marks are interpreted by the front-end, and assembled as syllables, words, phrases, sentences and paragraphs to be spoken, along with context aware prosodic labeling for appropriate intonation, emphasis, inflection, co-articulations and pauses. Once this assembly process has been completed, the predicted speech is rendered as a Lesseme stream to be spoken by the synthesizer. Each Lesseme in this stream also has a linguistic feature bundle associated with it. The feature bundle contains all of the associated linguistic information which the front-end rule system is able to extract directly or indirectly from the text, including items such as specific sound class (e.g., nasal, vowel, voiceless fricative), local context (e.g. the preceding and following Lesseme within the text), and global context (e.g. suprasegmental intonations and phrase final effects).

The Lesseme stream and feature bundle constitute the linguistic front-end speech prediction, and are used to drive the signal processing back-end. All of the information provided to drive the signal processing back-end is categorical, it represents our understanding of the speech sounds, but it is not directly measurable in the same way as pitch, energy, duration and other acoustic features can be measured. The Lessac linguistic front-end makes categorical and relative predictions about acoustic feature values. In other text-to-speech engines, pitch and duration are often computed from explicit pitch and duration models, most frequently using CART or other similar methods. Conversely, the Lessac front-end provides a very rich and specific categorical set of context aware Lesseme labels. There are over 800 Lesseme labels for reportorial American English, with extensive further context specification for each Lesseme label provided by the associated Linguistic feature bundle.

Signal Processing Back-end

The Lessac signal processing back-end uses conventional unit selection and concatenation approaches melded with Lessac specific technology to synthesize speech. It consists of four major components:

  1. A hierarchical target costing model based on the linguistic front-end Lesseme labels and associated feature bundles. This model estimates how close a specific candidate speech snippet in the acoustic database is to the ideal speech snippet implied by the front-end linguistic categorization. This is a Lessac specific approach.
  2. A multi-dimensional join costing model which estimates how well a specific speech snippet candidate would fit with each potential adjacent speech snippet candidate once concatenated. Important acoustic characteristics considered and weighted in this join cost model are such acoustic speech signal characteristics as F0, energy, duration, MFCC's, and the first and second derivatives of each of these. These acoustic characteristics are directly derived from each speech snippet, and serve as a proxy measure of auditory features as perceived by the human ear/body/brain system. For instance, MFCC's are similar to a spectral representation that is believed to exist as excitation patterns on the basilar membrane. Further, F0 and energy are physical proxies for sensation of pitch and loudness.
  3. An optimal trajectory module based on a hierarchical mixtures of experts model which derives from the speech snippet candidates the ideal prosodic speech path that most closely approximates the lowest weighted target costs. Specific speech snippets that are both low in overall weighted costs (both target and join), and close to the optimal trajectory are selected as the speech snippets for concatenation. This is also a Lessac specific approach.
  4. A concatenation module that is used to morph paste together the specific speech snippets that comprising the optimal trajectory.

A more detailed technical description of the Lessac automated narrator system is available in Lessac's Blizzard Challenge white papers for 2010 and 2011, as well as in the list of patents below.

Patents

The following U.S. patents have been granted:

  1. Expressive Parsing in Computerized Conversion of Text to Speech; US Patent # 6,847,931 B2; Jan 25, 2005,
  2. Text to Speech; US Patent # 6,865,533 B2; Mar 8, 2005.
  3. Speech Training Method With Alternative Proper Pronunciation Database; US Patent # 6,963,841 B2; Nov 8, 2005.
  4. Method of Recognizing Spoken Language with Recognition of Language Color; US Patent # 7,280,964 B2; Oct. 9, 2007
  5. Prosodic Speech Text Codes and Their Use in Computerized Speech Systems; US Patent # 7,877,259, January 25, 2011
  6. System-effected text annotation for expressive prosody in speech synthesis and recognition, # 8,175,879, May 8, 2012
  7. Computerized Speech Synthesizer for synthesizing speech from text, #8,219,398, July 10, 2012.

    The Company also holds additional international patents.

256 Highland Street   West Newton, MA 02465   +1-617-548-1944
info@lessactech.com