A noise reduction system based on hybrid noise estimation technique and post-filtering in arbitrary noise environments


Junfeng, Li; Masato, Akagi

Speech Communication 48(2): 111-126

2006


Available
online
at
www.sciencedirect.com
SCIENCE
dDIRECT®
COM
MUNICATI
ON
ELSEVIER
Speech
Communication
48
(2006)
111-126
www.elsevier.com/locate/specom
A
noise
reduction
system
based
on
hybrid
noise
estimation
technique
and
post-filtering
in
arbitrary
noise
environments
Junfeng
Li
*,
Masato
Akagi
School
of
Information
Science,
Japan
Advanced
Institute
of
Science
and
Technology,
1-1
Asahidai,
Tatsunokuchi,
Nomigun,
Ishikawa
923-1292,
Japan
Received
10
August
2004;
received
in
revised
form
21
June
2005;
accepted
22
June
2005
Abstract
In
this
paper,
we
propose
a
novel
noise
reduction
system,
using
a
hybrid
noise
estimation
technique
and
post-filter-
ing,
to
suppress
both
localized
noises
and
non-localized
noise
simultaneously
in
arbitrary
noise
environments.
To
estimate
localized
noises,
we
present
a
hybrid
noise
estimation
technique
which
combines
a
multi-channel
estimation
approach
we
previously
proposed
and
a
soft-decision
single-channel
estimation
approach.
Final
estimation
accuracy
for
localized
noises
is
significantly
improved
by
incorporating
a
robust
and
accurate
speech
absence
probability
(RA-
SAP)
estimator,
which
considers
the
strong
correlation
of
SAPs
between
adjacent
frequency
bins
and
consecutive
frames
and
makes
full
use
of
the
high
estimation
accuracy
of
the
multi-channel
approach.
The
estimated
spectra
of
localized
noises
are
reduced
from
those
of
noisy
observations
by
spectral
subtraction.
Non-localized
noise
is
further
reduced
by
a
multi-channel
post-filter
which
is
based
on
the
optimally
modified
log-spectral
amplitude
(OM-LSA)
esti-
mator.
With
the
assumption
of
a
diffuse
noise
field,
we
propose
an
estimator
for
the
a
priori
SAP
based
on
the
coher-
ence
characteristic
of
the
noise
field
at
spectral
subtraction
output,
high
coherence
at
low
frequencies
and
low
coherence
at
high
frequencies,
improving
the
spectral
enhancement
of
the
desired
speech
signal.
Experimental
results
demonstrates
the
effectiveness
and
superiorities
of
the
proposed
noise
estimation/reduction
methods
in
terms
of
objective
and
subjec-
tive
measures
in
various
noise
conditions.
©
2005
Elsevier
B.V.
All
rights
reserved.
Keywords:
Hybrid
noise
estimation;
Post-filtering;
Coherence
function;
Speech
presence
uncertainty
1.
Introduction
Corresponding
author.
Tel.:
+81
761
51
1236;
fax:
+81
761
51
1149.
E-mail
addresses:
junfeng@jaist.ac.jp
(J.
Li),
akagi@jaist.
ac.jp
(M.
Akagi).
Recent
years,
noise
reduction
has
been
in
great
demand
for
an
increasing
number
of
speech
0167-6393/$
-
see
front
matter
©
2005
Elsevier
B.V.
All
rights
reserved.
doi:10.1016/j.specom.2005.06.013
112
J.
Li,
M.
Akagi
/
Speech
Communication
48
(2006)
111-126
applications,
such
as
automatic
speech
recognition
(ASR)
systems
and
celluar
telephony.
Although
the
ASR
systems
have
achieved
high
recognition
accuracy
in
laboratory
environments,
their
perfor-
mance
seriously
degrades
in
the
real-world
envi-
ronments
due
to
various
kinds
of
noises
(Rabiner
and
Juang,
1993).
Adverse
environments
also
dete-
riorate
the
quality
of
speech
that
is
transmitted
in
speech
communication
systems
(Elko,
1996).
One
direct
solution
to
this
problem
is
to
use
a
head-
set
or
hand-held
equipment,
however,
it
is
inconve-
nient
for
users.
One
potential
solution
is
to
construct
a
noise
reduction
system
as
a
front-end
processor
for
these
systems.
So
far,
a
variety
of
noise
reduction
algorithms
have
been
proposed
in
the
literature
(Akagi
and
Kago,
2002;
Boll,
1979;
Bitzer
and
Simmer,
2001;
Frost,
1972;
Griffiths
and
Jim,
1982;
Elko,
1996;
Gannot
et
al.,
2001).
Generally
speaking,
all
of
these
algorithms
can
be
classified
into
two
catego-
ries:
single-channel
technique
and
multi-channel
technique
according
to
the
number
of
sensors
they
need.
Compared
to
the
single-channel
technique,
the
multi-channel
technique
is
substantially
supe-
rior
in
reducing
noise
and
enhancing
speech,
due
to
its
spatial
filtering
capability
of
suppressing
the
interfering
signals
arriving
from
directions
other
than
the
specified
look-direction
(Bitzer
and
Simmer,
2001).
Therefore,
multi-channel
noise
reduction
approaches
have
attracted
increasing
research
interests.
The
linearly
constrained
adaptive
beamformer,
first
presented
by
Frost,
keeps
the
signals
arriving
from
the
desired
look-direction
distortionless
while
suppressing
the
signals
from
other
directions
by
minimizing
the
output
power
of
the
beam-
former
(Frost,
1972).
A
generalized sidelobe
can-
celler
(GSC)
beamformer,
as
an
alternative
implementation
structure
of
Frost
beamformer,
has
been
widely
researched
(Griffiths
and
Jim,
1982).
In
Frost
and
GSC
beamformers,
adaptive
signal
processing
is
normally
used
to
avoid
cancel-
lation
of
the
desired
speech
signal
(Frost,
1972;
Griffiths
and
Jim,
1982).
The
problem
of
these
algorithms
is
that
adaptive
signal
processing
de-
creases
the
stability
of
the
noise
reduction
system
under
practical
conditions.
A
small-scale
subtrac-
tive
beamformer-based
noise
reduction
algorithm
has
recently
been
proposed
by
(Akagi
and
Mizu-
machi,
1997;
Mizumachi
and
Akagi,
1999;
Akagi
and
Kago,
2002).
Its
superiorities
lie
in
the
fact
that
no
adaptive
signal
processing
is
adopted
and
high
performance
in
reducing
sudden
noise.
And
its
weakness
lies
in
the
assumption
that
only
local-
ized
noises
exist
in
the
environment.
Among
multi-channel
noise
reduction
systems,
post-filtering
is
normally
needed
to
improve
the
entire
performance
in
practical
applications
(Simmer
et
al.,
2001).
A
multi-channel
post-filter
is
first
presented
by
Zelinski
with
the
assumption
of
zero
cross-correlation
between
noise
signals
on
different
microphones
(Zelinski,
1988).
Recently,
it
has
been
extended
to
a
generalized
expression
based
on
a
priori
knowledge
of
the
noise
field
(McCowan
and
Bourlard,
2003).
Moreover,
Bitzer
et
al.
showed
that
neither
GSC
nor
Wiener
post-fil-
ter
can
work
well
at
low
frequencies
in
a
diffuse
noise
field
(Bitzer
et
al.,
1999).
An
alternative
solu-
tion,
proposed
by
Meyer
and
Simmer
(1997),
applies
spectral
subtraction
in
low
frequencies
and
a
Wiener
filter
in
high
frequencies
at
the
beam-
former
output.
The
two
main
drawbacks
of
these
filtering
approaches
lie
in
their
inability
to
sup-
press
spatially
correlated
noises,
and
the
required
voice
activity
detector
(VAD)
or
the
a
priori
knowledge
of
the
noise
field.
In
this
paper,
we
propose
a
novel
noise
reduc-
tion
system
to
deal
with
the
problem
of
suppressing
noises
in
arbitrary
environments.
A
generalized
sig-
nal
model,
consisting
of
localized
noises
and
non-
localized
noise,
is
first
introduced.
To
estimate
localized
noises,
we
present
a
hybrid
noise
estima-
tion
technique
which
effectively
combines
the
multi-channel
estimation
approach
we
previously
proposed
(Akagi
and
Kago,
2002)
and
a
soft-
decision
single-channel
approach.
A
robust
and
accurate
speech
absence
probability
(RA-SAP)
estimator
is
then
developed
which
makes
full
use
of
the
high
estimation
accuracy
of
the
multi-chan-
nel
estimation
approach.
By
incorporating
this
RA-SAP
estimator,
the
hybrid
estimation
tech-
nique
produces
much
more
accurate
estimates
for
localized noises
which
are
then
reduced
by
spectral
subtraction.
Non-localized
noise
is
reduced
by
a
post-filter
which
is
based
on
the
OM-LSA
estima-
tor.
Under
the
assumption
of
a
diffuse
noise
model
J.
Li,
M.
Akagi
I
Speech
Communication
48
(2006)
111-126
113
for
non-localized
noise,
we
propose
an
estimator
for
the
a
priori
SAP
based
on
the
coherence
charac-
teristics
of
the
noise
field
at
spectral
subtraction
output,
further
improving
the
noise
reduction
abil-
ity
of
the
post-filtering.
The
performance
of
the
proposed
noise
estimation/reduction
methods
is
evaluated
and
is
shown
to
result
in
significant
per-
formance
improvements
over
other
comparative
methods
in
various
noise
environments.
The
remainder
of
this
paper
is
organized
as
fol-
lows.
In
Section
2,
a
generalized
signal
model
is
introduced
along
with
an
overview
of
the
proposed
noise
reduction
system.
In
Section
3,
by
using
the
multi-
and
single-channel
approaches
and
a
RA-
SAP
estimator,
the
hybrid
estimation
technique
is
proposed
which
gives
more
accurate
spectral
estimates
for
localized
noises.
The
estimates
are
then
reduced
from
noisy
observations.
In
Section
4,
non-localized
noise
is
reduced
by
a
post-filter
based
on
OM-LSA
estimator.
Its
performance
is
further
improved
by
a
new
estimator
for
the
a
pri-
ori
SAP.
The
superiorities
of
the
proposed
noise
estimation/reduction
methods
are
verified
in
vari-
ous
noise
conditions
in
Section
5.
Finally,
some
conclusions
are
drawn
in
Section
6.
2.
An
overview
of
proposed
noise
reduction
system
In
this
section,
a
generalized
signal
model
is
introduced
and
an
overview
of
the
proposed
noise
reduction
system
is
given.
Considering
a
microphone
array
with
three
lin-
early
and
equidistantly
distributed
omni-direc-
tional
microphones
in
a
noisy
environment,
shown
in
Fig.
1,
a
generalized
signal
model
is
as-
sumed
in
which
the
observed
signals
consist
of
three
components.
The
first
is
the
desired
speech
signal
s(t)
arriving
from
a
direction
such
that
the
difference
in
arrival
time
between
the
two
main
microphones
is
2C.
The
second
is
localized
noise
signals
KW
arriving
from
directions
such
that
the
time
differences
are
2S
k
(k
=
1,
2,
...,K)
and
the
third
is
non-localized
noise
signal
nu
.
c(t),
modelled
as
diffuse
noise,
propagating
in
all
direc-
tions
simultaneously.
Thus,
the
observed
signals
imposing
on
three
microphones
(left,
center
and
speech
Left
Center
Right
Fig.
1.
Relationship
between
microphone
array
and
acoustic
signals.
right),
denoted
by
1(t),
c(t)
and
r(t),
can
be
given
by:
1(t)
=
s(t
+EiCt
S
k
)
+
nr(t),
(
1
)
k=1
C(t)
=
s(t)
Enc
k
(o+
(2)
k=1
r(t)
=
s(t
+
C)+
Enc
k
(t+60+n
7
c(t).
(
3
)
k=1
Based
on
this
generalized
signal
model,
the
inten-
tion
of
our
work
is
to
reduce
both
localized
and
non-
localized
noises
while
keeping
the
desired
signal
distortionless.
To
implement
this
idea,
a
noise
reduction
system
is
constructed
as
shown
in
Fig.
2,
which
mainly
consists
of
the
following
modules:
Time
delay
compensation:
This
module
compen-
sates
for
the
effect
of
propagation
between
speech
source
and
microphones
on
the
desired
speech
signal.
It
is
assumed
that
the
output
signals
of
this
module
are
perfectly
time
aligned
and
repre-
sented
by
the
same
symbols,
only
setting
=
0
in
Eqs.
(1)—(3),
for
notational
simplicity.
Spectral
analysis:
The
time
aligned
signals
are
analyzed
by
short
time
Fourier
transform
(STFT),
outputting
the
amplitude
spectra
and
phase
spectra.
Localized
noise
suppression:
To
suppress
local-
ized
noise
signals,
their
spectra
are
first
esti-
mated
by
a
hybrid
noise
estimation
approach
which
effectively
combines
a
multi-channel
Multi-Channel
Noise
Estimator
Single-Channel
Noise
Estimator
Time
Delay
Comp.
STFT
Analy.
O
Post-Filter
STFT
PI'
Synth.
114
J.
Li,
M.
Akagi
I
Speech
Communication
48
(2006)
111-126
Hybrid
Noise
Estimation
Fig.
2.
Block
diagram
of
the
proposed
noise
reduction
system.
technique
and
a
single-channel
technique
in
a
parallel
structure.
Moreover,
we
develop
a
RA-SAP
estimator
by
considering
the
correla-
tion
of
SAPs
and
making
full
use
of
the
high
esti-
mation
accuracy
of
the
multi-channel
estima-
tion
approach,
enhancing
the
final
estimation
accuracy
for
localized
noises.
The
estimated
localized
noises
are
then
subtracted
from
noisy
observations
by
spectral
subtraction.
Non-localized
noise
suppression:
To
further
sup-
press
non-localized
diffuse
noise,
a
multi-chan-
nel
post-filter
which
is
based
on
the
OM-LSA
estimator
is
adopted.
Moreover,
we
present
an
estimator
for
the
a
priori
SAP
based
on
the
coherence
characteristic
of
the
noise
field
at
spectral
subtraction
output,
high
coherence
at
low
frequencies
and
low
coherence
at
high
fre-
quencies,
enhancing
the
noise
reduction
perfor-
mance
of
this
post-filter.
Spectral
synthesis:
The
enhanced
speech
signal
is
synthesized
by
using
inverse
STFT
and
the
overlap-and-add
technique.
The
two
main
modules
mentioned
above,
local-
ized
noise
suppression
and
non-localized
noise
suppression,
will
be
discussed
in
detail
in
Sections
3
and
4.
3.
Localized
noises
suppression
using
a
hybrid
noise
estimation
technique
and
spectral
subtraction
In
this
section,
we
focus
on
suppressing
localized
noise
components.
To
do
this,
we
first
propose
a
high-performance
hybrid
estimation
technique
to
obtain
accurate
spectral
estimates
of
localized noises
which
are
then
reduced
by
spectral
subtraction.
3.1.
A
hybrid
noise
estimation
technique
In
this
subsection,
a
hybrid
estimation
tech-
nique,
combining
the
multi-channel
and
single-
channel
estimation
approaches,
is
presented
and
its
performance
is
further
enhanced
by
integrating
a
RA-SAP
estimator.
3.1.1.
The
multi-channel
noise
estimation
approach
Based
on
the
generalized
signal
model,
the
multi-channel
noise
estimation
approach
we
previ-
ously
proposed
(Akagi
and
Mizumachi,
1997;
Akagi
and
Kago,
2002)
can
be
reformulated
as
follows:
The
time-aligned
signals
1(t),
c(t)
and
r(t)
are
shifted
±t
in
the
time
domain
(T
0
0),
and
two
subtractive
beamformers
in
the
time
domain
are
constructed
as
(Akagi
and
Mizumachi,
1997):
g
,
r
(t)=
4
{[/(t+T)
-
/(t
-
t)]
-
[r(t
+
r(t
z)]},
(4)
1
gcr(t)
=
74
{[c(t
+
c(t
z)]
[r(t
+
r(t
z)]}.
(
5
)
In
order
to
simplify
implementation,
the
differ-
ences
of
non-localized
noises
at
different
micro-
phones
are
assumed
to
be
small
enough
to
be
J.
Li,
M.
Akagi
/
Speech
Communication
48
(2006)
111-126
115
ignored.
And
the
spectra
of
localized noises
can
be
easily
estimated
from
the
outputs
of
the
beam-
formers,
represented
in
the
time—frequency
do-
main
as
(T
=
S)
in
Eq.
(4)
and
i
=
g
in
Eq.
(5):
N,„(2,(o)
{G
b
.(2,o))/sin
2
c5,
=
G,(2,w)e
-
J4/sin
2
wg,
G,(
2
,(
0
)1E2,
(6)
where
(i)
N
w)
indicates
the
spectral
estimate
of
localized
noise
by
this
multi-channel
approach
in
2th
frame
and
wth
frequency
bin;
(ii)
G
ir
(2,
w)
and
G
cr
(2,
w)
are
the
STFTs
of
g
1
40
and
g
cr
(t);
(iii)
8
1
and
8
2
are
two
threshold
values
determined
experimentally;
(iv)
S
represents
the
virtual
direc-
tion
of
arrival
(DOA)
of
the
integrated
localized
noise
signal,
which
is
computed
from
the
outputs
of
two
beamformers
by
the
cross-correlation
based
DOA
estimation
algorithm
(Mizumachi
and
Aka-
gi,
1999).
It
should
be
noted
that
the
multi-channel
approach
produces
highly
accurate
spectral
esti-
mates
for
localized noises
by
this
analytical
estima-
tion
scheme.
3.1.2.
Proposed
hybrid
noise
estimation
technique
The
multi-channel
noise
estimation
approach
performs
well
in
most
cases.
However,
when
the
condition
toS
=
2kir
holds,
the
multi-channel
ap-
proach
fails
to
estimate
noises
since
the
beamform-
ers
do
not
output
any
signal.
This
phenomenon
corresponds
to
the
grating
sidelobes
of
the
sub-
tractive
beamformers
with
small-size
arrays.
To
deal
with
this
problem,
we
propose
a
hybrid
noise
estimation
technique
in
which
a
single-chan-
nel
estimation
approach
is
employed
when
the
multi-channel
estimation
approach
fails.
In
this
hybrid
estimation
technique,
the
values
of
sin
2
wS
and
sin
e
w
g,
in
Eq.
(6),
determine
either
the
output
of
the
single-channel
approach
or
that
of
the
multi-channel
approach
should
be
the
final
output
of
this
hybrid
structure.
When
the
maximum
of
sin
2
u)S
and
sin
e
w
g
is
larger
than
a
threshold
(an
empirical
constant),
the
output
of
the
multi-
channel
approach
is
more
accurate
and
preferable
as
the
final
output.
Otherwise,
the
output
of
the
single-channel
approach
is
more
accurate
and
preferable
as
the
final
output.
Thus,
the
final
spec-
tral
estimates
of
the
localized noises
by
the
pro-
posed
hybrid
estimation
technique
can
be
given
by
(Li
and
Akagi,
2004)
c
N
(2,w),
max
(sin
2
u)S,
sin
2
u)
>
8,
N
(2,
to)
=
N
s
(2,
to),
otherwise,
(
7
)
where
Si
c
„,
(A,
w)
and
N
c
s
(2,
w)
represent
the
spec-
tral
estimates
of
localized noises
by
the
multi-
channel
technique,
given
by
Eq.
(6),
and
by
the
single-channel
technique
detailed
in
the
following
subsection.
It
is
of
interest
to
note
that
the
hybrid
estimation
technique
is
expected
to
succeed
in
dealing
with
the
inherent
grating
sidelobes
of
the
subtractive
beamformer
with
small-size
arrays
and
the
inability
of
the
single-channel
approach
in
estimating
the
highly
non-stationary
noises.
3.1.3.
The
single-channel
noise
estimation
approach
—c
The
spectra
of
localized
noises,
N
s
(2,
w)
in
Eq.
(7),
should
be
computed
by
a
single-channel
esti-
mation
approach.
In
this
work,
a
soft-decision
single-channel
approach
is
adopted
(Cohen
and
Berdugo,
2001).
Under
the
speech
presence
uncer-
tainty,
this
method
updates
the
noise
spectral
esti-
mates
during
speech
pauses
or
holds
the
estimates
obtained
in
the
previous
pauses
during
speech
active
periods.
However,
the
spectral
estimation
capability
of
this
single-channel
approach
is
signif-
icantly
dependent
on
the
successfulness
or
failure
of
the
SAP
estimator
(Cohen
and
Berdugo,
2001).
Therefore,
its
performance
is
able
to
be
enhanced
by
integrating
a
RA-SAP
estimator.
3.1.4.
Further
enhance
hybrid
estimation
technique
with
a
RA-SAP
estimator
In
this
subsection,
we
further
enhance
the
proposed
hybrid
noise
estimation
technique
by
integrating
a
RA-SAP
estimator.
Considering
the
strong
correlation
of
SAPs
between
adjacent
fre-
quency
bins
and
consecutive
frames
and
making
full
use
of
the
high
estimation
accuracy
of
the
multi-
channel
approach,
a
RA-SAP
estimator
is
devel-
oped,
improving
the
performance
of
the
hybrid
esti-
mation
technique
by
combining
the
multi-channel
sin
2
a)(5
>
E
l
,
sin
2
a)(5
<
E
l
and
sin
2
w2
>
otherwise,
116
J.
Li,
M.
Akagi
/
Speech
Communication
48
(2006)
111-126
and
single-channel
approaches
in
an
effective
way.
Under
the
assumption
of
complex
Gaussian
sta-
tistic
model
and
applying
the
Bayes
rule
and
total
probability
theorem,
the
SAP,
which
is
a
condi-
tional
probability
of
speech
absent
state
given
noisy
observations
and
denoted
by
q(2,
w)
for
notational
simplicity,
can
be
given
by
(Ephraim
and
Malah,
1984):
1
-
q'(2,
co)
1
q(2,
co)
=
(1
+
q'(
2
,
co)
1
+
(2,
co)
x
exp
(7(2'
w)
(2,
(D)
)
(8)
1+
(2,w)
where
(i)
q'
(2,w)
is
the
a
priori
SAP;
(ii)
(2,
co)
=
r
1
s(2,
w)/li
n
(2,w)
and
7(
2
,
(
0
)
=
I
C(
2
,
01
2
1
ri
n
(2,
w)
are
the
a
priori
SNR
and
a
posteriori
SNR
as
named
in
(Ephraim
and
Malah,
1984),
and
r
,(2,
w)
represents
the
variance
of
speech
signal.
Eq.
(8)
demonstrates
that
for
the
given
a
priori
SAP
q'
(2,w),
the
SAP
q(2,
w)
is
dependent
on
the
a
priori
SNR
42,
co)
and
a
posteriori
SNR
7(2,w).
It
is
believed
that
accurate
and
robust
SAP
estimates
can
be
obtained
only
when
(2,
w)
and
y(2,
w)
are
accurate
and
robust
enough.
Consequently,
we
now
turn
to
the
issue
of
improving
the
accuracy
and
robustness
of
the
a
priori
SNR
(2,
w)
and
a
posteriori
SNR
y(2,
w)
estimates.
Taking
into
account
the
strong
correlation
of
SAPs
in
adjacent
frequency
bins
and
consecutive
frames,
we
calculate
the
a
priori
SNR
(2,
w)
and
a
posteriori
SNR
y(2,
w)
in
a
time-frequency
smoothing
way.
In
the
frequency
domain,
the
estimates
of
(2,
w)
and
y(2,
w)
are
smoothed
by
applying
a
normalized
window
b
of
size
2D
+
1,
given
by:
k=w+D
4
(2
,
co)
=
E
b(k)42,k),
k=w-D
k=w+D
7(2,
w)
=
E
b(k)7(2,k).
k=w-D
Estimation
accuracy
of
(2,
w)
and
y(2,
w)
is
im-
proved
due
to
the
fact
that
the
noise
spectra
in
adjacent
frequencies
are
likely
to
be
estimated
by
the
multi-channel
estimation
approach
with
high
accuracy.
Furthermore,
this
frequency-
smoothing
procedure
eliminates
fluctuations
of
the
estimates
of
the
a
priori
SNR
(2,
w)
and
a
posteriori
SNR
y(2,
w)
along
the
fre-
quency
axis
on
the
time-frequency
plane,
which
results
in
more
robust
SNR
estimates.
In
the
time
domain,
the
frequency-smoothed
estimates
of
the
a
priori
SNR
(2,
w)
and
a
pos-
teriori
SNR
7(2,
w)
are
further
processed
based
on
the
previous
values,
given
by:
r1.(
2
1
(0)
+
(1
-
x
max[y(2,
co)
-
1,
0],
(11)
(
w)
=
(
2
,
(
0
)
(12)
where
a
(0
<
a
<
1)
is
a
forgetting
factor
and
S(2
-
1,
w)
is
the
enhanced
speech
signal
in
the
previous
frame
at
spectral
subtraction
out-
put.
Actually,
Eq.
(11)
is
just
the
decision-direc-
ted
scheme
detailed
in
(Ephraim
and
Malah,
1984).
It
should
be
noted
that
the
smoothing
operation
in
the
time
domain
is
not
carried
out
for
the
a
posteriori
SNR,
since
it
should
be
calculated
from
the
current
observations
and
independent
on
the
previous
observations.
By
substituting
the
time-frequency
smoothed
a
priori
SNR
(2,
co)
and
a
posteriori
SNR
X2,
w)
into
Eq.
(8),
a
RA-SAP
estimator
is
derived,
which
improves
the
estimation
accuracy
of
the
proposed
hybrid
estimation
technique.
A
similar
time-frequency
smoothing
procedure
has
been
presented
by
Cohen
et
al.
in
a
single-
channel
algorithm
to
estimate
the
a
priori
SAP
(Cohen
and
Berdugo,
2001).
Cohen's
algorithm
improves
the
robustness
of
the
a
priori
SAP
esti-
mates,
however,
it
cannot
improve
their
estimation
accuracy
since
the
estimation
accuracy
of
noise
spectra
are
not
improved
by
the
time-frequency
smoothing
procedure.
In
the
proposed
hybrid
estimation
technique,
both
robustness
and
accuracy
of
SAP
estimates
would
be
improved
by
exploiting
the
newly
pre-
sented
time-frequency
smoothing
scheme,
when
multi-channel
microphones
are
available.
Further-
more,
the
RA-SAP
estimator
provides
much
higher
noise
estimation
accuracy
for
the
hybrid
estimation
J.
Li,
M.
Akagi
/
Speech
Communication
48
(2006)
111-126
117
technique,
from
several
aspects.
Firstly,
as
shown
in
Eq.
(7),
the
final
spectral
estimates
are
computed
by
the
multi-channel
approach,
which
produces
the
exact
high
accurate
spectral
estimates,
in
most
cases.
Secondly,
estimation
accuracy
of
the
single-
channel
approach
is
significantly
improved,
which
is
attributed
to
the
multi-channel
estimation
ap-
proach
and
this
RA-SAP
estimator.
Since
accurate
spectral
estimates
by
the
multi-channel
approach
are
likely
to
be
distributed
around
those
deter-
mined
by
the
single-channel
approach,
accuracy
and
robustness
of
the
SAP
estimator
can
be
en-
sured
by
applying
the
time—frequency
smoothed
a
priori
SNR
and
a
posteriori
SNR
which
are
more
accurate.
Furthermore,
this
RA-SAP
estimator
greatly
improves
the
estimation
accuracy
of
the
sin-
gle-channel
approach.
Finally,
the
improved
single-channel
approach
contributes
to
enhance
the
final
estimation
accuracy
of
the
hybrid
noise
estimation
technique.
estimator
is
adopted
as
post-filter
due
to
its
superio-
rity
in
eliminating
"musical
noises"
(Cohen
and
Berdugo,
2001)
caused
by
the
spectra
subtraction
in
this
work.
To
further
enhance
the
noise
reduction
ability
of
this
post-filter,
we
propose
an
estimator
for
the
a
priori
SAP
based
on
the
coherence
charac-
teristic
of
the
noise
field
at
spectral
subtraction
output.
4.1.
OM-LSA
estimator
The
basic
idea
of
the
OM-LSA
estimator
is
to
minimize
the
mean
square
error
between
the
log-
spectra
of
the
desired
speech
signals
and
their
opti-
mal
estimates.
Under
speech
presence
uncertainty
and
based
on
the
assumption
of
complex
Gaussian
statistic
model,
the
OM-LSA
estimator
is
given
by
(Cohen
and
Berdugo,
2001):
1—q
(2,0))
p(
2
,
01
)
G(2
7
CO
)
=
GI
-
11(
2
7w)
P
(14)
where
(i)
G
m
i
n
is
an
empirical
constraint
constant;
(ii)
q
p
(2,co)
is
the
SAP
for
post-filtering
which
is
calculated
at
spectral
subtraction
output;
(iii)
G
H
,
(2,
w)
is
the
gain
function
of
the
traditional
MMSE-LSA
estimator
when
speech
is
surely
pres-
ent
(Ephraim
and
Malah,
1985).
As
shown
in
Eqs.
(8)
and
(14),
the
OM-LSA
estimator
is
dependent
on
the
SAP
q
p
(2,co)
(calcu-
lated
by
Eq.
(8))
and
further
on
the
a
priori
SAP
q'
p
(2,
w)
at
spectral
subtraction
output.
An
estima-
tor
for
the
a
priori
SAP
q'
p
(2,
w)
has
been
pre-
sented
in
the
single-channel
scenario
(Cohen
and
Berdugo,
2001),
given
by:
(13)
(4(2,
a))
=
1
Plocal
(
2
7
a
)
)Pglobal
(
2
7
a
)
)Pframe
(
2
7
w)
7
3.2.
Suppress
localized
noise
with
spectral
subtraction
The
proposed
hybrid
noise
estimation
technique
gives
accurate
spectral
estimates
for
localized
noises.
Subsequently,
the
estimated
spectra
are
sub-
tracted
from
those
of
the
observed
noisy
signals
by
spectral
subtraction,
given
by
(Berouti
et
al.,
1979):
S(2,
co)
{
C(2,
co)
(2,
co)
C(2,
co)
>
aSt
c
(2,
co),
I3C(2,
w),
otherwise,
where
a
and
/3
are
the
overestimation
factor
and
spectral
floor
factor.
Since
the
spectral
estimates
of
localized noises
are
of
high
accuracy,
a
=
1
is
set
to
avoid
distorting
the
speech
signal.
And
13
is
determined
experimentally.
4.
Non-localized
noise
suppression
with
post-filtering
In
this
section,
we
address
the
problem
of
sup-
pressing
non-localized
noise
by
employing
a
multi-
channel
post-filter.
The
well-known
OM-LSA
(15)
where
P
loca1
(2,
w),
P
-
global(
2
,
w)
and
P
frame
(2,
w)
are
the
energy-based
speech
measures
on
the
current
frame
for
a
local
frequency
window,
a
larger
fre-
quency
window
and
for
the
whole
frame,
and
P
denotes
the
probability.
However,
the
energy
dis-
tributions
have
been
changed
at
the
spectral
sub-
traction,
resulting
in
the
failure
of
the
energy-
based
schemes
(e.g.
the
algorithm
in
(Cohen
and
Berdugo,
2001)).
Therefore,
we
should
develop
an
estimator
for
the
a
priori
SAP
q'
p
(2,
w)
based
on
the
unchanged
characteristics
of
the
noise
field
at
spectral
subtraction
output.
Theory
-
Input
- -
S.S.
Output
,
A
,
1
1
1
,
1!
.
1
lYt
I
AV!
/
'
2
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
Theory
50km/h
- -
100km/h
0
6000
5000
1000
1000
2000
3000
4000
5000
6000
Frequency
[Hz]
2000
3000
4000
Frequency
[Hz]
118
J.
Li,
M.
Akagi
I
Speech
Communication
48
(2006)
111-126
4.2.
Analysis
of
noise
field
at
spectral
subtraction
output
To
characterize
a
noise
field,
a
widely
used
mea-
sure
is
the
magnitude-squared
coherence
(MSC),
more
commonly
referred
to
as
coherence,
defined
as:
kkxy(
2,
w)I
2
F
.y(
2,
w
)
=
q
xx
(2,
(0
)0
ry
(2
'
w)
where
(i)
O
xy
(2,
w)
is
the
cross-spectral
density
be-
tween
two
signals
x(t)
and
y(t);
(ii)
„(2,
w)
and
O
yy
(2,
w)
are
the
auto-spectral
densities
of
x(t)
and
y(t),
respectively.
In
the
generalized
signal
model
given
by
Eqs.
(1)—(3),
non-localized
noise
is
modelled
as
diffuse
noise
characterized
by
the
following
MSC
function:
sin
(coc
/
1
v)
wd/v
where
d
and
v
are
the
distance
between
micro-
phones
and
the
velocity
of
sound.
Fig.
3
illustrates
the
MSCs
computed
with
real-
world
car
noises
and
the
theoretical
MSC
of
a
dif-
fuse
noise
field.
Obviously,
the
measured
MSCs
follow
the
trend
of
the
theoretical
MSC
of
a
diffuse
noise
field
with
some
relative
variances.
Further-
more,
at
spectral
subtraction
output,
the
diffuse
characteristic
of
non-localized
noise
is
maintained,
since
the
localized
noises
suppression
described
in
the
last
section
has
little
influence
on
non-localized
noise.
This
fact
is
confirmed
by
the
MSCs,
shown
in
Fig.
4,
computed
with
the
inputs
and
outputs
of
spectral
subtraction
when
only
diffuse
noise
exists.
This
motivates
us
to
propose
an
estimator
for
the
a
priori
SAP
based
on
the
unchanged
coherence
characteristic
at
spectral
subtraction
output.
4.3.
An
estimator
for
the
a
priori
SAP
This
subsection
deals
with
the
problem
of
detecting
the
desired
speech
signal
based
on
coher-
ence
characteristic
of
the
noise
field
at
spectral
subtraction
output.
Fig.
5
shows
an
example
of
the
average
MSC
over
all
frequencies
at
spectral
subtraction
output
when
the
speech
signals
are
spatially
strong
corre-
lated
and
the
noises
are
spatially
weak
correlated.
It
is
obvious
to
know
that
the
MSC
provides
use-
ful
information
for
detecting
the
speech
signal.
Based
on
this
fact
and
the
assumptions
of
zero
cor-
relation
between
speech
and
noise
signals
and
of
a
diffuse
noise
field,
we
present
an
estimator
for
the
a
priori
SAP
in
the
following.
(16)
F(w)
=
2
(17)
Fig.
3.
Magnitude-squared
coherence
in
car
environments:
Theoretical
MSC
(solid
line)
and
measured
MSCs
in
a
car
environment
with
speeds
of
50
km/h
(dotted
line)
and
100
km/h
(dashed
line).
The
distance
between
adjacent
microphones
is
10
cm.
Fig.
4.
Magnitude-squared
coherence
in
car
environments:
Theoretical
MSC
(solid
line)
and
measured
MSCs
at
the
input
of
the
system
(dashdot
line)
and
at
output
of
the
spectral
subtraction
(dashed
line).
The
distance
between
adjacent
microphones
is
10
cm.
-
MR
=
-5
dB
- -
BNB
=
0
dB
SNR
=
5
dB
A,
speech
active
speech
pause
speech
pause
J.
Li,
M.
Akagi
/
Speech
Communication
48
(2006)
111-126
119
0.9
0.8
2
0.7
a)
Er
cs
0.6
<
0.5
0.4
0.3
0.2
0
Fig.
5.
Average
MSCs
over
all
frequencies
in
a
car
environment
at
various
SNRs:
SNR
=
—5
dB
(dashdot
line);
SNR
=
0
dB
(dashed
line);
SNR
=
5
dB
(solid
line).
Noise
condition:
car
environment
with
the
speed
of
100
km/h.
At
spectral
subtraction
output,
the
MSC
func-
tion
F(2,
w),
is
first
calculated
based
on
the
output
signals.
The
output
signals
at
different
frequencies
are
characterized
by
different
coherence
values,
as
shown
in
Fig.
4.
This
observation
motivates
that
the
MSC
spectra
at
spectral
subtraction
output
are
divided
into
two
parts:
the
low
frequency
re-
gion
with
high
noise
coherence
and
the
high
fre-
quency
region
with
low
noise
coherence.
The
transient
frequency
between
two
regions
is
the
first
minimum
frequency
of
the
MSC
function
of
a
dif-
fuse
noise
field,
given
by
f
=
v/(2d),
where
d
and
v
are
the
distance
between
adjacent
microphones
and
sound
velocity.
To
determine
the
a
priori
SAPs
in
the
high
and
low
frequency
regions,
we
proposed
two
different
schemes
described
as
follows:
In
the
high
frequency
region,
the
MSC
spectra
are
divided
into
E
sub-bands
with
a
feasible
bandwidth,
BW,
and
are
averaged
over
the
fre-
quencies
in
each
sub-band,
obtaining
the
aver-
aged
MSC
F
e
(2,
co)
(e
=
1,2,
...
,
E)
in
eth
sub-band.
If
a
high
averaged
coherence
(higher
than
a
threshold
Tmaxe)
is
detected,
a
speech
present
state
is
detected
presumably.
If
a
low
averaged
coherence
(lower
than
a
threshold
T
m
i
n
e
)
is
detected,
a
speech
absent
state
is
detected
presumably.
Note
that
the
a
priori
SAP
decreases
as
the
MSC
increases.
For
the
MSCs
in
the
range
[Tmine,
Tmxe],
the
a
priori
SAPs
are
determined
by
linear
interpolation.
Thus,
the
a
priori
SAP
in
the
high
frequency
qp,h(
1
,
co)
r
egion,
q
p
'
,h
(2,
w),
is
given
by:
{0,
1,
Tma
xe
e
(1,W)
Tm
a
x
e
T
roine
(18)
where
dr'
and
w
e
hi
g
h
are
the
low
and
high
boundaries
of
e-th
sub-band.
In
the
low
frequency
region,
the
MSCs
com-
puted
in
this
region
fail
to
detect
the
speech
sig-
nal
since
both
speech
and
noise
are
strongly
correlated.
Based
on
the
MSC
value
F(2,
w)
computed
and
averaged
over
the
frequencies
higher
than
the
transient
frequency
and
follow-
ing
the
same
concept
used
in
the
high
frequency
region,
we
derive
an
estimator
for
the
a
priori
SAP
in
the
low
frequency
region,
q'
p1
(2,
w),
given
by:
0,
q
p
'
i
(2,w)
=
Tmax
F(2,
co)
T
max
T
m
i
n
and
F(2,
co)
=
E
F
(2,
w).
E
e
_
1
The
a
priori
SAP
estimates,
given
by
Eqs.
(18)
and
(19),
are
then
incorporated
into
the
post-filter-
ing,
further
improving
its
noise
reduction
perfor-
mance.
5.
Experiments
and
discussions
In
this
section,
two
sets
of
experiments
were
conducted
to
evaluate
and
compare
the
perfor-
mance
of
the
proposed
noise
estimation/reduction
5
10
15
20
25
30
35
40
Frame
number
T
e
(1,W)
>
Tmaxe,
r
e
(1,W)
<
Trome,
wE[w
i
e"
,
1
;
igh
l
otherwise,
F
(2
,
w)
>
T
max
,
F
(2
,
w)
<
T
m
i
n
,
otherwise,
(19)
(20)
120
J.
Li,
M.
Akagi
I
Speech
Communication
48
(2006)
111-126
Table
1
Parameters
set
in
the
experiments
M
=
256
8
1
=
0.5
8
2
=
0.1
=
0.1
D
=
2
ot
=
0.98
/3
=
0.001
BW=
32
=
12
o
=
340
m/s
T
e
=
0.7
T
e
=
0.2
T
=
0.7
T
=
0.2
G
min
=
0.005
methods
with
that
of
other
traditional
noise
esti-
mation/reduction
methods.
The
first
set
of
experi-
ments
was
devoted
to
show
the
pure
contribution
of
the
hybrid
noise
estimation
technique.
The
sec-
ond
set
was
to
evaluate
the
performance
of
the
proposed
noise
reduction
method
as
a
complete
system
and
further
compared
to
several
conven-
tional
noise
reduction
algorithms
in
the
simulated
and
real-world
noise
environments.
All
parameters
set
in
our
experiments
are
listed
in
Table
1.
5.1.
Evaluation
of
proposed
hybrid
noise
estimation
technique
In
the
first
set
of
experiments,
we
concentrated
on
the
improvement
in
estimation
accuracy
of
the
proposed
hybrid
noise
estimation
technique
for
localized noises
compared
to
the
corresponding
single-channel
and
multi-channel
estimation
approaches.
5.1.1.
Sound
data
To
objectively
evaluate
the
performance
of
the
proposed
hybrid
noise
estimation
technique,
54
clean
speech
sentences,
selected
from
ATR
data-
base
and
uttered
by
three
male
and
three
female
speakers,
were
used.
The
tested
noises
consisted
of
synthesized
noises
(white
Gaussian
noise
and
pink
noise)
and
real-world
car
noise.
The
speech
data
and
noise
data
were
first
resampled
to
12
kHz
and
linearly
quantized
at
16
bits.
The
noisy
signals
were
generated
by
mixing
the
clean
speech
signals
with
the
localized
tested
noises
with
direc-
tions
of
arrival
(DOAs)
10-80°
to
the
right.
5.1.2.
Evaluation
measure
The
performance
of
the
proposed
hybrid
noise
estimation
technique
was
evaluated
and
compared
to
the
corresponding
single-channel
and
multi-
channel
approaches
in
terms
of
Normalized
Esti-
mation
Error
(NEE),
defined
as
NEE
=
7,
1
E201og10
E
M
w—°
11
(I
N
Q
(
1
,
W
)
Nc
(
1
1
(D
)
I))
2=1
E :
0_0N
c
(2,
co)
(21)
where
Si
c
(2,
w)
and
IV
c(2,
w)
are
the
estimated
noise
spectrum
and
"ideal"
noise
spectrum
respectively;
M
and
L
are
the
length
of
STFT
and
the
number
of
frames.
It
should
be
noted
that
the
smaller
NEE
represents
the
more
accurate
noise
estimate
obtained
by
the
tested
estimation
technique.
5.1.3.
Evaluation
results
The
average
NEEs
over
the
localized
noise
sig-
nals
with
different
DOAs
in
the
tested
noise
condi-
tions
are
listed
in
Table
2.
Table
2
demonstrates
that
the
normalized
noise
estimation
error
is
con-
sistently
decreased
for
all
the
tested
noise
condi-
tions,
especially
for
localized
car
noise,
when
the
proposed
hybrid
estimation
technique
is
used.
This
improvement
amounts
to
3
dB
compared
to
the
single-channel
estimation
approach
alone
and
5
dB
compared
to
the
multi-channel
estimation
technique
alone
in
localized
car
noise
environ-
ment.
Fig.
6
illustrates
the
typical
examples
of
the
NEE
comparisons
of
the
single-channel,
mul-
ti-channel
and
hybrid
noise
estimation
techniques
in
localized
white
and
car
noise
environments.
All
the
observations
obtained
from
Table
2
and
Fig.
6
verify
the
superiority
of
our
proposed
hybrid
noise
estimation
technique
compared
to
Table
2
Average
NEEs
[dB]
in
various
noise
conditions
White
Pink
Car
Single-channel
—5.2842
—4.6423
—6.7910
Multi-channel
—12.8905
—10.2486
—4.5205
Hybrid
—13.3378
—11.1818
—9.7014
J.
Li,
M.
Akagi
I
Speech
Communication
48
(2006)
111-126
121
- -
Muib-Chan.
Hybrid
1
.
I
‘.
;I
tr,
fi
-16
0
(a)
-
-
-
ki
Siunge-
cC
hahna.n.
Hybrid
-2
,
/I
-6
18
0
(b)
Fig.
6.
Normalised
noise
estimation
error
(dB)
for
signals
processed
by
single-channel
technique
(dashdot),
multi-channel
technique
(dashed)
and
hybrid
technique
(solid)
under
(a)
white
noise
conditions
and
(b)
car
noise
conditions.
the
multi-channel
estimation
alone
approach
and
the
signal-channel
estimation
alone
approach.
5.2.
Evaluation
of
proposed
noise
reduction
system
The
second
set
of
experiments
was
conducted
to
evaluate
the
performance
of
the
proposed
noise
reduction
algorithm
as
a
complete
system
in
the
simulated
and
real-world
noise
environments.
Fur-
thermore,
its
performance
was
compared
with
that
of
other
conventional
noise
reduction
algorithms,
including
multi-channel
algorithms:
delay-and-
sum
beamformer,
delay-and-sum
beamformer
with
multi-channel
Wiener
post-filter
(Simmer
and
Wasiljeff,
1992)
and
the
subtractive
beam-
former
alone
based
algorithm
(Akagi
and
Kago,
2002),
and
single-channel
algorithm
based
on
OM-LSA
estimator
(Cohen
and
Berdugo,
2001),
under
various
noise
conditions
in
terms
of
objec-
tive
and
subjective
evaluation
measures.
5.2.1.
Sound
data
In
the
simulated
noise
environment,
the
input
noise
signals
consisted
of
localized
noise
signal
(directional
car
noise
with
DOA
of
40°
to
the
right)
and
non-localized
noise
signal
(diffuse
noise).
The
diffuse
noise
field
was
generated
by
placing
18
independent
pink
noise
sources
around
the
microphone
array.
The
noisy
data
were
ob-
tained
by
adding
the
recorded
input
noise
signals
to
the
clean
speech
signals
which
were
same
as
those
used
in
the
first
set
of
experiments
at
various
global
SNR
levels
in
the
range
[-5,
20]
dB.
In
the
real-world
noise
environment,
an
equi-
spaced
linear
array,
consisting
of
three
micro-
phones
with
inter-element
interval
of
10
cm,
was
mounted
above
the
windshield
in
the
car.
Car
noise
signals
were
recorded
while
the
car
was
run-
ning
on
the
highway
at
the
speed
of
50
km/h
and
100
km/h.
Speech
data
were
same
as
those
used
in
the
first
set
of
experiments
selected
from
ATR
database.
The
first
set
of
input
microphone
signals
were
generated
by
mixing
the
clean
speech
signals
and
car
noise
signals
at
various
global
SNR
levels
in
the
range
[-5,
20]
dB.
The
second
set
of
input
noisy
signals
was
generated
by
mixing
the
speech
signal,
car
noises
with
the
speed
of
100
km/h
and
another
interfering
speech
voice
(localized
noise)
with
DOA
of
60°
to
the
right.
This
interfering
speech
voice
was
used
to
imitate
the
passenger's
voice
in
car
environment.
5.2.2.
Objective
evaluation
measures
The
objective
evaluation
measure
used
in
our
experiments
is
segmental
SNR
(SEGSNR).
seg-
mental
SNR
(SEGSNR)
is
a
widely
used
objective
evaluation
criterion
for
speech
enhancement
or
noise
reduction
algorithms
since
it
is
more
corre-
lated
to
subjective
results
(Quackenbush
et
al.,
1988).
SEGSNR
is
defined
as
the
ratio
of
the
power
of
"ideal"
clean
speech
to
that
of
the
noise
.
E)
,
n
10
20
30
40
50
60
70
Frame
number
80
90
100
2
Eo"
0-
-
o
w
0
E
w
-0
-10
a)
ra
-12
8
-14
-16
10
20
30
40
50
60
70
80
90
100
Frame
number
0
En
-
-o
-5
cC
U)
-10
a
Lu
co
-15
122
J.
Li,
M.
Akagi
I
Speech
Communication
48
(2006)
111-126
signal
embedded
in
a
noisy
signal
or
in
an
en-
hanced
speech
signal
by
tested
algorithms
over
all
frames,
given
by:
1
m
L-1
E
M
- -
0
1
[
S
(
AM
+A
2
2
SEGSNR=
E101o
gio
'—
I
.1=0
E
f
=0
rs•(Am+i)
—s(
1
m+i)1)
(22)
where
(i)
4.)
and
.i(.)
are
the
reference
speech
sig-
nal
and
noisy
signal
or
enhanced
signals
processed
by
the
tested
algorithms;
(ii)
L
and
M
represent
the
number
of
frames
in
the
signal
and
the
number
of
samples
per
frame
(equal
to
the
length
of
STFT).
Note,
that
a
higher
SEGSNR
value
corresponds
to
the
higher
speech
quality
of
the
enhanced
signal.
10
5
5.2.3.
Objective evaluation
results
Fig.
7
shows
the
experimental
results
of
the
tested
noise
reduction
algorithms
in
terms
of
SEG-
SNR
in
the
simulated
and
real-world
noise
envi-
ronments
at
various
noise
levels.
Significant
noise
suppression
performance
is
achieved
consistently
by
employing
the
proposed
noise
reduction
system
in
the
tested
conditions.
Compared
to
the
noisy
in-
puts
and
the
enhanced
signals
by
the
multi-channel
algorithms
(delay-and-sum
beamformer
with/
without
post-filter)
the
SEGSNR
improvements
of
the
proposed
system
amount
to
about
20
dB
in
low
SNRs
and
about
15
dB
in
high
SNRs
for
all
the
tested
noise
conditions.
These
are
larger
15
10
(a)
Average
SEGSNR
of
input
signal
[dB]
15
5
En
-
0
cC
co
-
5
a
Lu
co
-10
-15
-20
-25
-22
(b)
10
e
DS
Beamformer
DS+Postfilter
Single
A
S.S.
Output
--
Proposed
-17
-12
-7
-2
3
Average
SEGSNR
of
input
signal
[dB]
e
DS
Beamformer
DS+Postfilter
Single
A
S.S.
Output
-EH
Proposed
-1',
-6
-1
-20
-25
,
-30T
-26
-21
-16
10
-15
5
-
-35
.0
DS
Beamformer
DS+Postfilter
*r.-
Single
*
S.S.
Output
Proposed
-15
-20
-25
-22
(C)
-17
-12
-7
-2
Average
SEGSNR
of
input
signal
[dB]
-40
3
-38
(d)
-34
-31
-27
-24
-20
Average
SEGSNR
of
input
signal
[dB]
e
DS
Beamformer
DS+Postfilter
Single
S.S.
Output
Proposed
Fig.
7.
Average
segmental
SNR
(dB)
at
delay-and-sum
beamformer
output
(0),
delay-and-sum
beamformer
with
post-filter
output
(A),
single-channel
OM-LSA
output
(*),
spectral
subtraction
output
(0)
and
proposed
system
output
(CI),
in
various
noise
conditions:
(a)
Simulated
condition;
(b)
car
environment
with
a
speed
of
50
km/h;
(c)
car
environment
with
a
speed
of
100
km/h;
(d)
car
environment
with
the
speech
of
100
km/h
and
passenger's
interfering
voice.
6000
5000
4000
U
5
3000
Cr
Cr
u
2000
1000
6000
5000
4000
U
U
2
3000
LL
2000
1000
6000
5000
N
4000
U
U
,s
3000
Cr
Li_
2000
0
0.2
0.4
0.6 0.8
1
(e)
Time
[s]
1.2
1.4
1.6
o
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
(f)
Time
[s]
6000
5000
1
4
4000
5
3000
Li
2000
1000
'Am
4i•
-
1000
0
ii
J
Mar:
OZ.
O.%
A
lto
J.
Li,
M.
Akagi
I
Speech
Communication
48
(2006)
111-126
123
6000
5000
"
4
4000
s
3000
u`.
2000
1000
6000
5000
i
N
4000
U
3000
cr
LL
2000
1000
7t
0
0
0
0.2
0.4
0.6 0.8
1
1.2
1.4
1.6
0
0.2
0.4
0
0
0.8
1
1.2
1.4
1.6
(a)
Time
[s]
(b)
Time
Esj
0
-
(C)
Time
[s]
0
0.2
0.4
0.6
0.8
1
wr
1.2
1.4
1.6
0
0
0.2
0.4
0.6
0.8
1
(d)
Ti
me
[s]
1.2
1.4
1.6
6000
5000
4000
5
3000
Li.
2000
1000
0.2
0.4
0.6 0.8
1
1.2
1.4
1.6
Time
fel
Fig.
8.
Speech
spectrograms.
(a)
Clean
signal
at
center
microphone;
(b)
noisy
signal
at
center
microphone;
(c)
delay-and-sum
beamformer
output;
(d)
delay-and-sum
beamformer
with
post-filter
output;
(e)
single-channel
OM-LSA
output;
(f)
spectral
subtraction
output;
(g)
proposed
system
output.
Noise
condition:
car
environment
with
a
speed
of
100
km/h
and
passenger's
interfering
voice.
(g)
o
o
-0.185
-0.179
Noisy
-0.202
SS.
Output
DS+Post
-0.071
-0.270
Proposed
0.168
-0.291
I I
OM-LSA
-0.270
MYR
1
Clean
.20
SS.
Output1
Proposed
o4
I
05
-1
0
1
Clean
o5
I
10
-10
-05
00
05
10
-05
00
05
10
-1.0
-0.5
0.0
-1
0
-0
5
Noisy
-0.505
DS
-0.444
-0.291
SS
Output
-0.204
OM-LSA
-0.117
VII
I
00
Proposed
0.643
clean
1.010
I
0.5
1.0
-0332
OM-LSA
-0.281
Proposed
0.633
DS
-0398
Clean
0.995
SS.
Output
-0.148
-0.501
T T
Y
Noisy
I
124
J.
Li,
M.
Akagi
I
Speech
Communication
48
(2006)
111-126
than
those
of
the
enhanced
signals
from
the
sub-
tractive
beamformer
alone
based
multi-channel
algorithm
and
the
OM-LSA
based
single-channel
algorithm,
especially
when
the
passenger
is
speak-
ing.
The
significant
SEGSNR
improvement
of
the
proposed
noise
reduction
system
is
attributed
to
its
A
-3
-2
-1
0
1
2
3
less
clean
more
clean
Fig.
9.
Scheffe's
paired
comparison.
sufficient
capability
in
suppressing
both
the
local-
ized
and
non-localized
noises
simultaneously.
5.2.4.
Subjective
evaluations
Subjective
evaluations
and
comparisons
of
the
tested
noise
reduction
algorithms
were
performed
using
speech
spectrograms
and
listening
tests.
Typ-
ical
examples
of
speech
spectrograms
are
illus-
trated
in
Fig.
8
for
the
real-world
car
noise
environment
with
the
speed
of
100
km/h
and
inter-
fering
passenger's
voice.
Fig.
8(c)
and
(d)
demon-
strates
that
the
outputs
of
the
conventional
multi-channel
algorithms
(delay-and-sum
beam-
former
without/with
post-filter)
are
characterized
by
the
high
level
of
noise
due
to
its
small
physical
DS
DS
Noisy
-0.184
-0.196
-0.255
DS+Post
OM-LSA
less
clean
I.
more
clean
less
clean
more
clean
(a)
SNR
=
-5
dB
(b)
SNR
=
0
dB
DS4Pos
DS+Post
-0.315
DS
-0345
Noisy
-0.405
OM-LSA
-0.179
SS.
Output
-0.172
Proposed
0.304
Clean
0.833
-0393
DS
-0.536
Noisy
-0.621
SS.
Output
0.243
OM
LSA
-0.186
Proposed
0.971
Qean
1.014
n
T
T
Y
-10
-05
00
05
10
-10
-05
00
05
10
less
clean
more
clean
less
clean
more
clean
(c)
SNR
=
5
dB
(d)
SNR
=
10
dB
DS+Post
DS+Po
t
less
clean
more
clean
less
clean
more
clean
(e)
SNR
=
15
dB
(f)
SNR
=
20
dB
Fig.
10.
Listening
test
results.
Comparisons
of
clean
speech
(Clean),
noisy
speech
(Noisy)
and
enhanced
speeches
by
delay-and-sum
beamformer
(DS),
delay-and-sum
beamformer
with
post-filter
(DS
+
Post),
single-channel
OM-LSA
(OM-LSA),
the
subtractive
beamformer
based
system
(SS)
and
proposed
noise
reduction
system
(Proposed)
for
listening
in
various
SNRs.
(a)
SNR
=
-5
dB;
(b)
SNR
=
0
dB;
(c)
SNR
=
5
dB;
(d)
SNR
=
10
dB;
(e)
SNR
=
15
dB;
(f)
SNR
=
20
dB.
Noise
condition:
car
environment
with
the
speed
of
100
km/h
and
passenger's
interfering
voice.
J.
Li,
M.
Akagi
/
Speech
Communication
48
(2006)
111-126
125
size
(only
three
microphones
are
used
in
this
work)
and
incapability
in
reducing
the
correlated
noise
in
low
frequencies.
The
single-channel
OM-LSA
esti-
mator
suppresses
a
large
amount
of
diffuse
noise
components
while
there
is
less
suppression
for
the
spatially
correlated
interfering
passenger's
voice,
as
shown
in
Fig.
8(e).
Fig.
8(f)
illustrates
that
the
subtractive
beamformer
alone
based
mul-
ti-channel
algorithm
succeeds
in
suppressing
the
localized
passenger's
interfering
voice
and
fails
in
suppressing
the
non-localized
diffuse
noise.
The
further
spectral
enhancement,
shown
in
Fig.
8(g),
is
achieved
by
the
proposed
noise
reduction
system
which
is
effective
in
suppressing
both
the
localized
and
non-localized
noise
components
simulta-
neously.
This
improvement
is
attributed
to
the
fact
that
not
only
the
statistic
characteristic
of
the
sig-
nals
but
also
the
spatial
characteristics
of
the
noise
field
are
taken
into
account
in
the
proposed
noise
reduction
system,
which
thus
provides
more
possi-
bilities
in
distinguishing
the
desired
speech
and
undesired
signals
including
localized
and
non-
localized
noises.
For
the
listening
tests,
the
sound
data
(clean,
noisy
and
enhanced
speech
sounds),
in
car
envi-
ronment
with
the
speed
of
100
km/h
and
interfer-
ing
passenger's
voice,
were
first
grouped
into
various
pairs
and
each
of
which
was
then
ran-
domly
presented
to
eight
subjects
through
binaural
headphones
at
a
comfortable
loudness
level
in
a
sound-proof
room.
Scheffe's
paired
comparison
method
was
used
to
evaluate
the
preference
of
en-
hanced
speeches
in
terms
of
seven-grade
scores
[-3,
3].
Each
subject
was
asked
to
select
a
cleaner
sound
from
two
in
one
pair
and
give
a
score
based
on
his/her
preference.
The
paired
comparison
for
sound
pair
(A,
B)
is
shown
in
Fig.
9.
If
the
first
sig-
nal
A
was
selected
to
be
"very
clean"
compared
to
the
second
signal
B
in
the
pair,
a
score
of
—3
was
given.
While
if
the
second
signal
B
was
selected
to
be
"very
clean"
compared
to
the
first
signal
A
in
the
pair,
a
score
of
3
was
given.
And
if
two
signals
in
the
pair
were
perceived
to
be
approximately
same,
a
score
of
0
was
given.
Note
that
each
score
in
[-3,
3]
represents
the
degree
of
the
relative
clean-
ness
(not
the
absolute
cleanness)
of
the
tested
sig-
nals.
Assume
the
number
of
subjects
be
NUM,„
b
,
the
number
of
tested
sounds
be
NUMdata
and
the
score
for
the
pair
(i,j)
given
by
the
kth
subject
be
4,
the
overall
mean
score
=
2x
A,
for
the
ith
sound
E
was
calculated
by
A,
m
Z
bxNum
,..
)
.
Obviously
(NU
the
preference
of
the
ith
speech
sound
is
a
function
of
population.
And
the
F-test
results
further
con-
firmed
that
there
are
significant
differences
between
the
enhanced
signals
processed
by
the
proposed
method
and
those
by
other
comparative
methods
at
a
5%
significance
level.
The
experimental
results
at
various
SNRs,
illustrated
in
Fig.
10,
shows
that
our
proposed
noise
reduction
algorithm
also
yields
the
highest
scores,
corresponding
to
the
"cleanest"
speech,
compared
to
other
multi-channel
and
single-channel
algorithms,
which
further
verifies
the
superiority
of
the
proposed
noise
system
in
listening
sense
as
well.
6.
Conclusions
In
this
paper,
we
presented
a
novel
noise
reduc-
tion
system
using
a
hybrid
noise
estimation
tech-
nique
and
post-filtering.
The
hybrid
estimation
technique
effectively
combined
a
multi-channel
estimation
approach
and
a
single-channel
estima-
tion
approach
in
a
parallel
structure
to
estimate
localized
noises.
The
final
estimation
accuracy
of
this
hybrid
estimation
technique
was
improved
by
incorporating
a
RA-SAP
estimator.
The
esti-
mated
localized
noise
spectra
were
then
suppressed
by
spectral
subtraction.
At
spectral
subtraction
output,
we
proposed
a
novel
estimator
for
the
a
priori
SAP,
which
is
based
on
the
coherence
char-
acteristic
of
the
noise
field
at
spectral
subtraction
output,
further
enhancing
the
noise
reduction
per-
formance
of
the
OM-LSA
based
post-filter
for
non-localized
noise.
The
proposed
noise
estimation/reduction
meth-
ods
were
evaluated
and
further
compared
to
other
multi-channel
and
single-channel
estimation/
reduction
algorithms
in
various
noise
conditions.
Experimental
results
demonstrated:
the
proposed
hybrid
estimation
technique
yields
the
more
accu-
rate
spectral
estimates
for
localized noises
than
the
corresponding
multi-channel
and
single-channel
estimation
algorithms;
the
proposed
noise
reduc-
tion
system
was
able
to
reduce
both
localized
126
J.
Li,
M.
Akagi
I
Speech
Communication
48
(2006)
111-126
and
non-localized
noises
simultaneously,
achieving
the
best
noise
reduction
performance
among
the
tested
multi-channel
and
single-channel
noise
reduction
algorithms
in
various
noise
conditions
consistently.
Acknowledgments
The
authors
would
like
to
thank
all
the
subjects
who
attended
the
listening
tests,
especially
thank
Mr.
Takeshi
Saitou
for
his
help
in
the
listening
tests,
and
acknowledge
professor
Joerg
Bitzer
of
University
of
Applied
Sciences
in
Oldenburg,
Ger-
many,
for
his
useful
discussions
during
this
work,
and
the
anonymous
reviewers
for
their
great
com-
ments
to
improve
this
paper.
References
Akagi,
M.,
Kago,
T.,
2002.
Noise
reduction
using
a
small-scale
microphone
array
in
multi
noise
source
environment.
In:
Proc.
IEEE
Internat.
Conf.
on
Acoustic,
Speech
Signal
Processing,
ICASSP-2002,
pp.
909-912.
Akagi,
M.,
Mizumachi,
M.,
1997.
Noise
reduction
by
paired
microphones.
In:
Proc.
EUROSPEECH97,
pp.
335-338.
Boll,
S.F.,
1979.
Suppression
of
acoustic
noise
in
speech
using
spectral
subtraction.
IEEE
Trans.
Acoustic
Speech
Signal
Process.
ASSP-27
(2),
113-120.
Berouti,
M.,
Schwartz,
R.,
Makhoul,
J.,
1979.
Enhancement
of
speech
corrupted
by
additive
noise.
In:
Proc.
ICASSP-1979,
pp.
208-211.
Bitzer,
J.,
Simmer,
K.U.,
2001.
Superdirective
Microphone
Arrays.
Microphone
Arrays
Signal
Processing
Techniques
and
Applications.
Springer,
Berlin,
pp.
19-38.
Bitzer,
J.,
Simmer,
K.U.,
Kammeyer,
K.-D.,
1999.
Multi-
microphone
noise
reduction
by
post-filter
and
superdirective
beamformer.
In:
International
Workshop
on
Acoustic
Echo
and
Noise
Control,
Pocono
Manor,
US,
pp.
27-30.
Cohen,
I.,
Berdugo,
B.,
2001.
Speech
enhancement
for
non-
stationary
noise
environments.
Signal
Process.
81
(11),
2403-2418.
Ephraim,
Y.,
Malah,
D.,
1984.
Speech
enhancement
using
a
minimum
mean-square
error
short-time
spectral
amplitude
estimator.
IEEE
Trans.
Acoustic
Speech
Signal
Process.
32
(6),
1109-1121.
Ephraim,
Y.,
Malah,
D.,
1985.
Speech
enhancement
using
a
minimum
mean-square
error
log-spectral
amplitude
estima-
tor.
IEEE
Trans.
Acoustic
Speech
Signal
Process.
33
(2),
443-445.
Elko,
G.W.,
1996.
Microphone
array
systems
for
hands-
free
telecommunication.
Speech
Commun.
20
(3-4),
229-
240.
Frost,
0.L.,
1972.
An
algorithm
for
linearly
constrained
adaptive
array
processing.
In:
Proc.
IEEE
60,
pp.
926-935.
Griffiths,
L.J.,
Jim,
C.W.,
1982.
An
alternative
approach
to
linearly
constrained
adaptive
beamforming.
IEEE
Trans.
Antennas
Propagat.
AP-30,
27-34.
Cannot,
S.,
Burshtein,
D.,
Weinstein,
E.,
2001.
Signal
enhance-
ment
using
beamforming
and
nonstationary
with
applica-
tion
to
speech.
IEEE
Trans.
Signal
Process.
49
(8),
1614-
1626.
Li,
J.,
Akagi,
M.,
2004.
Noise
reduction
using
hybrid
noise
estimation
technique
and
post-filtering.
In:
Proc.
Internat.
Conf.
on
Spoken
Language
Processing,
ICSLP-2004,
Korea,
pp.
2705-2708.
Meyer,
J.,
Simmer,
K.U.,
1997.
Multi-channel
speech
enhance-
ment
in
a
car
environment
using
Wiener
filtering
and
spectral
subtraction.
In
Proc.
22th
IEEE
Internat.
Conf.
on
Acoustic,
Speech
Signal
Processing,
ICASSP-97,
Munich,
Germany,
pp.
21-24.
Mizumachi,
M.,
Akagi,
M.,
1999.
Noise
reduction
method
that
is
equipped
for
a
robust
direction
finder
in
adverse
environments.
In:
Proc.
IEEE
Workshop
on
Robust
Method
for
Speech
Recognition
in
Adverse
Conditions,
Tampere,
Finland,
pp.
179-182.
McCowan,
I.A.,
Bourlard,
H.,
2003.
Microphone
array
post-
filter
based
on
noise
field
coherence.
IEEE
Trans.
Speech
Audio
Process.
11
(6),
709-716.
Quackenbush,
S.R.,
Barnwell,
T.P.,
Clements,
M.A.,
1988.
Objective
Measures
of
Speech
Quality.
Prentice-Hall,
Inc.,
Englewood
Cliffs,
New
Jersey.
Rabiner,
L.,
Juang,
B.-H.,
1993.
Speech
Recognition
System
Design
and
Implementation
Issues.
Fundamental
of
Speech
Recognition.
Prentice-Hall,
Inc.,
Englewood
Cliffs,
New
Jersey.
Simmer,
K.U.,
Wasiljeff,
A.,
1992.
Adaptive
microphone
arrays
for
noise
suppression
in
the
frequency
domain.
In:
Proc.
Workshop
on
Adaptive
Algorithms
in
Communications,
Bordeaux,
France,
pp.
185-194.
Simmer,
K.U.,
Bitzer,
J.,
Marro,
C.,
2001.
Post-Filtering
Techniques.
Microphone
Arrays
Signal
Processing
Tech-
niques
and
Applications.
Springer,
Berlin,
pp.
39-60.
Zelinski,
R.,
1988.
A
microphone
array
with
adaptive
post-
filtering
for
noise
reduction
in
reverberant
rooms.
In:
Proc.
of
ICASSP-88,
Vol.
5,
pp.
2578-2581.