An algorithm for voice segregation



Journal Title

Journal ISSN

Volume Title


Texas Tech University


The ability to focus on one voice or sound of interest, the target voice, in the presence of one, several or many interfering sounds is known as the cocktail party effect. Many hearing disabilities limit the ability of die patient to isolate the target voice. Voice segregation is the isolation of the target voice by artificial/computational means. Of the many computational automatic voice segregation (AVS) methods available, spectral subtraction (SS) is one whose simplicity and efficiency makes it practical to use for a hearing aid. In SS, the spectrum of an estimate of the interfering sounds is subtracted from the input signal to obtain an estimate of the target voice. A drawback of SS is the fact that it can only effectively isolate the target voice when the interfering sounds are stationary. However, in many everyday situations, e.g., at a cafeteria or in the vicinity of ventilation fans, the interfering sounds are stationary.

This thesis presents the design, implementation and assessment of a quasi real-time single-channel frame-by-frame SS-based AVS algorithm that could be used for a hearing aid. Two methods of voice activity detection (VAD) for the algorithm are assessed in this document: energy/zero-crossing-rate VAD (EZVAD) and entropy VAD (EVAD). Since SS can only make an estimate of the spectrum of the interfering sounds during breaks in the target voice, VAD is necessary to detect periods in the input signal where the target voice may be absent. A computationally simple fundamental frequency estimator (FoE) that also tracks the gradient of the fundamental frequency, fo, of each frame is also designed and tested. The function of the FoE is to facilitate voice segregation when the interfering sounds are pitched sounds.

Tests on a MATLAB implementation of tiie design showed that SS does perform well provided the interfering sounds are stationary. A problem is the persistence of "musical noise" in tiie output of the system. Techniques that significantly reduce musical noise can only be implemented on non real-time SS systems. Also, EVAD was found to be feasible only when the system is non real-time. The FoE was able to track the target voice of an utterance that was recorded in isolation, but only under certain constraints that are not practical in everyday situations. Hence, the final design comprised just a SS algorithm and an EZVAD. The conclusion drawn by this thesis is that a simple SS-based AVS algorithm that uses EZVAD can significantly reduce near-random stationary interfering sounds, as well as interfering sounds that consist of pure tones.



Hearing aids -- Design and construction, Speech processing systems, Noise -- Measurement -- Mathematical models