A Supervised Factorial Acoustic Model for Simultaneous Multiparticipant Vocal Activity Detection in Close-Talk Microphone Recordings of Meetings
Source: Carnegie Mellon University
The authors have implemented a supervised acoustic model for VAD in conversations with an arbitrary number of participants, and analyzed its performance with respect to the unsupervised AM baseline. Analysis consisted of a broad exploration of several parameters, two of which (inclusion of NLED features and decoding constraints on the maximum allowed number of simultaneously vocalizing participants) are explicitly intended to limit the deleterious effect of crosstalk. Additional parameters whose effect was analyzed included the number of Gaussians per mixture and the effect of the frame step. The authors findings show that the unsupervised AM baseline outperforms a supervised AM system which uses standard MFCC front-end features, but that this effect is reversed when NLED features are included.