In the book The Age of the Spiritual Machines: When Computers Exceed Human Intelligence, author, philosopher, and inventor Ray Kurzweil details his theory on the evolution of computer technology in which computers eventually surpass human intelligence and become self-aware (the popular movie The Matrix later used a similar storyline).
Kurzweil is also known for inventing some of the first speech-to-text software. He eventually sold that software to the company Lernout & Hauspie, which entered into a partnership with Microsoft in 1997. Kurzweil's original idea was based on his experiences while on an airplane with a blind man, when he realized the benefit of having books translated to audio for blind people. Today speech recognition has blossomed into an entire subsection of the software industry.
Speech recognition software has come a long way since the 1990s, but it is only recently that the largest software vendor in the universe has made a legitimate entry into this marketplace. Microsoft has now strategically positioned its Microsoft Speech Server 2004 to take advantage of speech recognition technologies and put them to use for businesses.
I'm going to outline the significant features that Speech Server offers, discuss some of the hardware and software requirements for running it, present two real-world examples of where Speech Server has been implemented, and provide a look at the pricing of the product.
Core features of Speech Server
Microsoft Speech Server (MSS) is designed to translate text to speech and speech to text. These are some of the key features of Speech Server:
- Uses Microsoft Speech Application SDK (SASDK). This is really the most important part of the Speech Server feature set. With the ability to perform incredible developmental projects, Speech Server sets itself apart.
- Uses industry standards. Speech Application Language Tags (SALT) is designed to extend existing Web markup languages (like HTML) by adding speech recognition and prompt functionality to those Web applications.
- Runs on Windows Server 2003, but can be accessed by PCs, cell phones, telephones, Tablet PCs, Pocket PCs, and other types of clients.
- Integrates telephony, the Web, and speech into one technology in order to deploy full-featured voice-only and multimodal applications—all tightly woven together. Speech Server also enables the convergence of your existing telephony and Web infrastructures in order to support a unified speech and telephony application model.
- Designed for mid-size to enterprise customers.
- Provides support for both touch-tone (DTMF) and speech-enabled applications.
- Offers interoperability with your existing telephony infrastructures.
- Allows for third-party integration and extensibility.
All software has minimum requirements and they are generally fairly palatable; however, because of the additional audio demands that MSS must deal with, the hardware requirements are slightly higher than the norm. The hardware and software minimum requirements (based on Microsoft's recommendations) are listed in Table A and Table B.
|CPU||1 GHz or greater|
|Hard drive||660 MB of available space, plus log file storage space|
|Video||Windows Server 2003-compatible video adapter capable of 800x600 resolution|
|Other hardware||Mouse, telephony card|
|Operating system(s)||Windows Server 2003 Standard or Enterprise Edition|
|Other software||.NET Framework, Microsoft Enterprise Instrumentation|
To see how speech recognition technology is being used in the real world, I thought we should take a look at some examples of where Microsoft Speech Server 2004 is already being utilized. Keep in mind that the following examples are two of the most prominent ones that Microsoft touts on its Web site, and so they are probably two of the more trouble-free implementations of Speech Server that have been achieved.
Microsoft had previously established strategic relationships with New York-based airline JetBlue in areas such as beta rollouts Windows Server 2003 and the .NET Framework, and so it only made sense for them to leverage those successes into a Speech Server partnership with JetBlue. Like most other companies, JetBlue was very interested in making their operations more efficient. JetBlue markets itself as a "tech-savvy" organization and states that 75 percent of its bookings occur online.
However, JetBlue still has a call center that fields requests for tickets. These requests come from both internal (JetBlue employees and partners) and external customers. It was the internal folks, or non-revenue customers, that JetBlue wanted to make as efficient as possible. By implementing Speech Server to handle all non-revenue calls, JetBlue allowed its call center personnel to more quickly handle external (revenue-generating) ticketing requests, which improved response time for those external customers. JetBlue has indicated that this success is critical to the overall future of the company by reducing costs significantly. JetBlue has current plans to expand its use of Speech Server to include options for checking into flights and sending alerts and updates on flights to customers.
New York City Department of Education
When New York City, perhaps the most high-profile city in the world, does almost anything it's a big deal, and that principle is certainly no exception when it comes to software. Heralded as the largest school system in the nation, the New York City Department of Education claims enrollment of upwards of 1.2 million children. Based on the theory that when parents become more involved in their children's education, kids can be more successful, the political leaders of New York decided to invest in Speech Server technology as a means to allow parents to be able to call a number and interactively (and easily) use voice commands to verify their children's attendance and grades, and even check the lunch menu. This information (which is also available on the NYDOE Web site) is now available via phone 24 hours a day and does not require direct intervention from guidance counselors, receptionists, or other staff members every time a parent requests this info.
How to get MSS
At the time this article is being published, Microsoft is offering a "deal" called the Speech Starter Kit, which allows you to purchase the appropriate telephony components, the software, and an evaluation version of Microsoft Speech Server 2004. Unfortunately, this offer is currently only available from Intel distributors and their channel members. For more information on this program, check here.
Revolutionary is one word that could be used to describe the significance of Speech Server's entry into this previously niche marketplace. By creating Speech Server, Microsoft has the ability to bring speech recognition software into the mainstream and to accomplish what many science fiction writers could only dream about. Clearly it is too early to determine if speech recognition is ready for the big time, but Speech Server should spark much interest in the IT arena and is much awaited.
Jeremy L. Smith, CISSP, is a cybersecurity and public safety professional who has worked with a variety of agencies to improve the security of their call centers and execute their public safety initiatives more effectively, including 911 call taking, cyber security, mass notification, and more. As the former chair of the NENA Security Working Group, he helped lead the development and creation of the public safety industry's first cyber security standards, NG-SEC. He is currently the general manager of the Mass Notification Division of Airbus DS Communications, a leader in the public safety market.