Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Speech synthesis
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Deep learning-based synthesis === {{Main|Deep learning speech synthesis}} [[File:Larynx-HiFi-GAN speech sample.wav|thumb|Speech synthesis example using the HiFi-GAN neural vocoder]] Deep learning speech synthesis uses [[deep neural network]]s (DNN) to produce artificial speech from text (text-to-speech) or spectrum (vocoder). The deep neural networks are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text. [[15.ai]] uses a ''multi-speaker model''—hundreds of voices are trained concurrently rather than sequentially, decreasing the required training time and enabling the model to learn and generalize shared emotional context, even for voices with no exposure to such emotional context.<ref>{{cite web |last=Temitope |first=Yusuf |date=December 10, 2024 |title=15.ai Creator reveals journey from MIT Project to internet phenomenon |url=https://guardian.ng/technology/15-ai-creator-reveals-journey-from-mit-project-to-internet-phenomenon/ |access-date=December 25, 2024 |website=[[The Guardian (Nigeria)|The Guardian]] |quote= |archive-url=https://web.archive.org/web/20241228152312/https://guardian.ng/technology/15-ai-creator-reveals-journey-from-mit-project-to-internet-phenomenon/ |archive-date=December 28, 2024}}</ref> The [[deep learning]] model used by the application is [[Nondeterministic algorithm|nondeterministic]]: each time that speech is generated from the same string of text, the intonation of the speech will be slightly different. The application also supports manually altering the [[Emotional prosody|emotion]] of a generated line using ''emotional contextualizers'' (a term coined by this project), a sentence or phrase that conveys the emotion of the take that serves as a guide for the model during inference.<ref name="automaton2">{{cite web |last=Kurosawa |first=Yuki |date=2021-01-19 |title=ゲームキャラ音声読み上げソフト「15.ai」公開中。『Undertale』や『Portal』のキャラに好きなセリフを言ってもらえる |url=https://automaton-media.com/articles/newsjp/20210119-149494/ |url-status=live |archive-url=https://web.archive.org/web/20210119103031/https://automaton-media.com/articles/newsjp/20210119-149494/ |archive-date=2021-01-19 |access-date=2021-01-19 |website=AUTOMATON |quote=}}</ref><ref name="Denfaminicogamer2">{{cite web |last=Yoshiyuki |first=Furushima |date=2021-01-18 |title=『Portal』のGLaDOSや『UNDERTALE』のサンズがテキストを読み上げてくれる。文章に込められた感情まで再現することを目指すサービス「15.ai」が話題に |url=https://news.denfaminicogamer.jp/news/210118f |url-status=live |archive-url=https://web.archive.org/web/20210118051321/https://news.denfaminicogamer.jp/news/210118f |archive-date=2021-01-18 |access-date=2021-01-18 |website=Denfaminicogamer |quote=}}</ref> [[ElevenLabs]] is primarily known for its [[browser-based]], AI-assisted text-to-speech software, Speech Synthesis, which can produce lifelike speech by synthesizing [[vocal emotion]] and [[Intonation (linguistics)|intonation]].<ref>{{Cite web |date=January 23, 2023 |title=Generative AI comes for cinema dubbing: Audio AI startup ElevenLabs raises pre-seed |url=https://sifted.eu/articles/generative-ai-audio-elevenlabs/ |access-date=2023-02-03 |website=Sifted |language=en-US}}</ref> The company states its software is built to adjust the intonation and pacing of delivery based on the context of language input used.<ref name=":13">{{Cite magazine |last=Ashworth |first=Boone |date=April 12, 2023 |title=AI Can Clone Your Favorite Podcast Host's Voice |url=https://www.wired.com/story/ai-podcasts-podcastle-revoice-descript/ |magazine=Wired |language=en-US |access-date=2023-04-25}}</ref> It uses advanced algorithms to analyze the contextual aspects of text, aiming to detect emotions like anger, sadness, happiness, or alarm, which enables the system to understand the user's sentiment,<ref>{{Cite magazine |author=WIRED Staff |title=This Podcast Is Not Hosted by AI Voice Clones. We Swear |url=https://www.wired.com/story/gadget-lab-podcast-594/ |magazine=Wired |language=en-US |issn=1059-1028 |access-date=2023-07-25}}</ref> resulting in a more realistic and human-like inflection. Other features include multilingual speech generation and long-form content creation with contextually-aware voices.<ref name=":34">{{Cite web |last=Wiggers |first=Kyle |date=2023-06-20 |title=Voice-generating platform ElevenLabs raises $19M, launches detection tool |url=https://techcrunch.com/2023/06/20/voice-generating-platform-elevenlabs-raises-19m-launches-detection-tool/ |access-date=2023-07-25 |website=TechCrunch |language=en-US}}</ref><ref>{{Cite web |last=Bonk |first=Lawrence |title=ElevenLabs' Powerful New AI Tool Lets You Make a Full Audiobook in Minutes |url=https://www.lifewire.com/elevenlabs-new-audiobook-ai-tool-7550061 |access-date=2023-07-25 |website=Lifewire |language=en}}</ref> The DNN-based speech synthesizers are approaching the naturalness of the human voice. Examples of disadvantages of the method are low robustness when the data are not sufficient, lack of controllability and low performance in auto-regressive models. For tonal languages, such as Chinese or Taiwanese language, there are different levels of [[tone sandhi]] required and sometimes the output of speech synthesizer may result in the mistakes of tone sandhi.<ref>{{Cite journal |last=Zhu |first=Jian |date=2020-05-25 |title=Probing the phonetic and phonological knowledge of tones in Mandarin TTS models |url=http://dx.doi.org/10.21437/speechprosody.2020-190 |journal=Speech Prosody 2020 |pages=930–934 |location=ISCA |publisher=ISCA |doi=10.21437/speechprosody.2020-190|arxiv=1912.10915 |s2cid=209444942 }}</ref>
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Speech synthesis
(section)
Add topic