1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.
Speaker diarization is an important problem that is topical, and is especially useful as a preprocessor for conversational speech related applications. The objective of this article is two-fold: (i) segment initialization by uniformly distributing speaker information across the initial segments, and (ii) incorporating speaker discriminative features within the unsupervised diarization framework. In the first part of the work, a varying length segment initialization technique for Information Bottleneck (IB) based speaker diarization system using phoneme rate as the side information is proposed. This initialization distributes speaker information uniformly across the segments and provides a better starting point for IB based clustering. In the second part of the work, we present a Two-Pass Information Bottleneck (TPIB) based speaker diarization system that incorporates speaker discriminative features during the process of diarization. The TPIB based speaker diarization system has shown improvement over the baseline IB based system. During the first pass of the TPIB system, a coarse segmentation is performed using IB based clustering. The alignments obtained are used to generate speaker discriminative features using a shallow feed-forward neural network and linear discriminant analysis. The discriminative features obtained are used in the second pass to obtain the final speaker boundaries. In the final part of the paper, variable segment initialization is combined with the TPIB framework. This leverages the advantages of better segment initialization and speaker discriminative features that results in an additional improvement in performance. An evaluation on standard meeting datasets shows that a significant absolute improvement of 3.9% and 4.7% is obtained on the NIST and AMI datasets, respectively.
Given an audio signal, speaker diarization involves answering the question of “Who spoke When?” [1]. A speaker diarization system annotates audio with relative speaker labels. The task involves estimating the number of speakers and assigning speech segments to different speakers. Speaker diarization has been used in various domains, such as telephone conversations, broadcast news, and meetings [1]. Diarization of conversational audio meetings is considered to be a challenging task owing to the spontaneity in the conversation. Diarization systems are often used as front-ends in applications that include automatic speech recognition, spoken keyword spotting, and speaker recognition [2].
Information Bottleneck (IB) based approach to speaker diarization has shown competitive performance for meeting recordings [3]–[4][5]. Owing to its non-parametric nature, IB based diarization has a very low Real Time Factor (RTF) value [3], [6]. RTF is the time taken by a system to process 1 second of speech data. Since diarization systems are mostly used in the pre-processing stage of many conversational speech applications, it is desirable to have diarization systems with low RTF value.