Is Voice Cloning the call of the hour?

Guest Column: Niraj Ruparel, Head of Mobile & Emerging Tech, GroupM India; Head of Voice, WPP India shares how voice synthesis (cloning) technology can boost customer engagement and do much more

by Niraj Ruparel
Published: Nov 4, 2020 7:50 AM | 5 min read

The voice synthesis (cloning) concept is not new. These products can be offered as on-premise softwares or can be deployed on cloud. Everyone will agree to the statement that customers drive the business. Be it any business, customers play an important role in making the business boat sail or sink. Hence every business pays a lot of attention to customer service that requires hours of time in emailing them, solving their queries and explaining them know-hows of processes through long documents with supporting screenshots. Interactive synthesized (cloned) audios can be more helpful instead of spending long hours attending the customers.

Interactive virtual assistants are widely accepted to take businesses on another level. These assistants have opened doors for businesses ranging from education, healthcare, eCommerce, entertainment, telecom, travel and hospitality, banking and finance, defence and government, etc.

The Voice Assistants can boost customer engagement with audible product descriptions. They can also enhance customer experience by interacting in personalized voices with them. Voice synthesis may be used with digital avatars to be used in online meetings without having to remain present in meetings. Working in teams can then become much easier with stronger communication even in the absence of face to face communication. The same can be useful for delivering a professional speech.

Due to the outbreak of COVID-19 all over the world, online learning has become popular among students and working professionals. The new normal conditions have generated demand for high-quality digital content in the form of recorded sessions, notes or ebooks to make learning easy. Voice synthesis (cloning) can reduce the burden of recording audio notes by the educators for every new session without retaking those in case of any mistakes. This can significantly improve the knowledge imparting process. The teaching and learning can be much better in the presence of recorded lectures by providing interactive content for online learning. It can also be useful for employees grooming and product training for clients by producing interactive content with minimum operational costs.

This is more relevant for people in the healthcare sector especially during these times. Voice plays a crucial role in building a trustworthy relationship between them and their patients. A familiar voice heard by patients can be more comforting for them.

Voice syncing is an AI technique that allows the software to learn and understand the nuances of one's voice by learning through your voice data, ranging from a few seconds to dozens of hours, depending on the quality and use-case desired. This is a potential game changer for conversational AI's and commerce in general. I mean who wouldn't want to buy a product if Mr Bachchan himself asks you to do?

If you are one of the stakeholders of business, this voice synthesis technology is for you irrespective of whether you belong to healthcare, education, banking and finance or any other industry. The operational costs of professionally recorded audios can be reduced up to a huge extent in the wake of profitability for your business.

But this is just a start to what would be a paradigm shift in media in general. In his book, Life 3.0, leading author and physicist Max Tegmark foresees a scenario in not too distant future when computers would be able to produce feature length movies and audio-based content, hyper personalized and locally relevant to everyone living on the planet.

This has awe inspiring applications for Education, Media, Entertainment and Communication in general. Being able to produce content in one's own voice at scale (without any of the equipment) and able to communicate dynamically in a voice of your choice (without them actually speaking) is the closest to science fiction we can get today in the digital domain.

Voice syncing works by learning millions of features of your voice and builds a template of your voice which can be used to produce any content by giving just a text input. The AI automatically takes care of features like language, accent, pitch and even expressions. Once the AI learns your voice, it can produce umpteen hours of audio content in the same aesthetic, saving huge amounts of time and cost.

But like any new technology and innovation, this too is a double-edged sword. It has great implications in the hands of creators but has also dire implications (aptly named Deepfakes) in the hands of hackers and scammers. Getting a call from the CEO of your company asking you to transfer large sums of money (when he himself is somewhere out) is just one of the examples. To counter these kinds of problems, some of the companies have put in place a proper and legal sounding onboarding of any voice to ensure its moral and ethical use. Creators can now sync their voice securely and produce content on scale by harnessing the power of this form of Artificial Intelligence.

This technology is still in its primacy but it definitely invites companies from diverse use-case to start building on this new layer of communication and content production. Apart from the obvious solution of saving enormous time and cost, it has deeper implications in terms of enhancing communication and making it more intimate.