Ribonucleic acid sequencing differential gene expression analysis of isolated perfused bovine udders experimentally inoculated with Streptococcus agalactiae


Sbardella, A.P.; Weller, M.M.D.C.A.; Fonseca, I.; Stafuzza, N.B.; Bernardes, P.A.; E Silva, F.F.; da Silva, M.V.G.B.; Martins, M.F.; Munari, D.P.

Journal of Dairy Science 102(2): 1761-1767

2018


The aim of this study was to elucidate the differential gene expression in the RNA sequencing transcriptome of isolated perfused udders collected from 4 slaughtered Holstein × Zebu crossbred dairy cows experimentally inoculated with Streptococcus agalactiae. We studied 3 different statistical tools (edgeR, baySeq, and Cuffdiff 2). In summary, 2 quarters of each udder were experimentally inoculated with Strep. agalactiae and the other 2 were used as a control. Mammary tissue biopsies were collected at times 0 and 3 h after infection. The total RNA was extracted and sequenced on an Illumina HiSeq 2000 (Illumina Inc., San Diego, CA). Transcripts were assembled from the reads aligned to the bovine UMD 3.1 reference genome, and the statistical analyses were performed using the previously mentioned tools (edgeR, baySeq, and Cuffdiff 2). Finally, the identified genes were submitted to pathway enrichment analysis. A total of 1,756, 1,161, and 3,389 genes with differential gene expression were identified when using edgeR, baySeq, and Cuffdiff 2, respectively. A total of 122 genes were identified by the overlapping of the 3 methods; however, only the platelet activation presented a significantly enriched pathway. From the results, we suggest the FCER1G, GNAI2, ORAI1, and VASP genes shared among the 3 methods in this pathway for posterior biological validation.

41111
4V
„,,scLek,
i
J.
Dairy
Sci.
102:1-7
°
https://doi.org/10.3168/jds.2018-15516
©
American
Dairy
Science
Association
®
,
2019.
Ribonucleic
acid
sequencing
differential
gene
expression
analysis
of
isolated
perfused
bovine
udders
experimentally
inoculated
with
Streptococcus
agalactiae
A.
P.
Sbardella,'
M.
M.
D.
C.
A.
Weller,
2
I.
Fonseca,
2
N.
B.
Stafuzza,'
P.
A.
Bernardes,'
F. F.
e
Silva,
3
M.
V.
G.
B.
da
Silva,
2
M.
F.
Martins,
2
and
D.
P.
Munari
l
*
1
Departamento
de
Ciencias
Exatas,
Universidade
Estadual
Paulista—UNESP/FCAV,
Jaboticabal
14884-900,
Brazil
2
Empresa
Brasileira
de
Pesquisa
Agropecuaria—Embrapa
Gado
de
Leite,
Juiz
de
Fora
36038-330,
Brazil
3
Departamento
de
Zootecnia,
Universidade
Federal
de
Vigosa,
Vigosa
36570-000,
Brazil
ABSTRACT
The
aim
of
this
study
was
to
elucidate
the
differential
gene
expression
in
the
RNA
sequencing
transcriptome
of
isolated
perfused
udders
collected
from
4
slaughtered
Holstein
x
Zebu
crossbred
dairy
cows
experimentally
inoculated
with
Streptococcus
agalactiae.
We
studied
3
different
statistical
tools
(edgeR,
baySeq,
and
Cuff-
diff
2).
In
summary,
2
quarters
of
each
udder
were
experimentally
inoculated
with
Strep.
agalactiae
and
the
other
2
were
used
as
a
control.
Mammary
tissue
biopsies
were
collected
at
times
0
and
3
h
after
infec-
tion.
The
total
RNA
was
extracted
and
sequenced
on
an
Illumina
HiSeq
2000
(Illumina
Inc.,
San
Diego,
CA).
Transcripts
were
assembled
from
the
reads
aligned
to
the
bovine
UMD
3.1
reference
genome,
and
the
sta-
tistical
analyses
were
performed
using
the
previously
mentioned
tools
(edgeR,
baySeq,
and
Cuffdiff
2). Fi-
nally,
the
identified
genes
were
submitted
to
pathway
enrichment
analysis.
A
total
of
1,756,
1,161,
and
3,389
genes
with
differential
gene
expression
were
identified
when
using
edgeR,
baySeq,
and
Cuffdiff
2,
respectively.
A
total
of
122
genes
were
identified
by
the
overlapping
of
the
3
methods;
however,
only
the
platelet
activa-
tion
presented
a
significantly
enriched
pathway.
From
the
results,
we
suggest
the
FCERIG,
GNAI2,
ORAI1,
and
VASP
genes
shared
among
the
3
methods
in
this
pathway
for
posterior
biological
validation.
Key
words:
baySeq,
crossbred
dairy
cow,
Cuffdiff
2,
edgeR
INTRODUCTION
The
identification
of
genes
expressed
in
different
tissues
leads
to
a
better
understanding
of
molecular
mechanisms
that
are
relevant
for
animal
production.
Received
August
8,
2018.
Accepted
October
28,
2018.
*Corresponding
author:
Ribonucleic
acid
sequencing
(RNA-seq)
allows
a
wider
gene
expression
analysis
because
the
total
RNA
from
a
given
tissue
can
be
sequenced
(Wang
at
al.,
2009).
Different
statistical
methods
are
available
for
perform-
ing
RNA-seq
differential
gene
expression
analysis
in
a
range
of
biological
experiments
(Hardcastle
and
Kelly,
2010;
Robinson
et
al.,
2010a).
The
edgeR
(Robinson
et
al.,
2010a)
and
baySeq
(Hard-
castle
and
Kelly,
2010)
packages
from
Bioconductor/R
software
and
the
Cuffdiff
2
(Trapnell
et
al.,
2013)
pack-
age
implemented
in
Cufflinks
software
have
been
widely
used
to
perform
statistical
analysis
of
RNA-seq
data
(Soneson
and
Delorenzi,
2013;
Seyednasrollah
et
al.,
2015).
These
tools
were
proposed
to
detect
differential
gene
expression
based
on
different
statistical
assump-
tions,
leading
to
distinct
results.
Thus,
the
evaluation
of
all
these
methods
in
RNA-seq
experimental
designs
is
recommended
(Seyednasrollah
et
al.,
2015).
This
evaluation
is
used
to
improve
the
selection
of
candidate
genes
for
posterior
biological
validation
based
on
path-
way
enrichment
analysis.
The
aim
of
this
study
was
to
elucidate
the
differential
gene
expression
in
the
RNA-
seq
transcriptome
of
isolated
perfused
udders,
collected
from
Holstein
x
Zebu
crossbred
dairy
cows
experimen-
tally
inoculated
with
Streptococcus
agalactiae,
using
the
previously
mentioned
statistical
methods.
MATERIALS
AND
METHODS
All
animal care
and
handling
procedures
followed
regulations
approved
by
the
animal
research
ethics
committee
of
Embrapa
Dairy
Cattle
(Juiz
de
Fora,
MG,
Brazil;
license
no.
11/2011).
Data
Collection
The
samples
were
obtained
from
4
Holstein
x
Zebu
crossbred
dairy
cows
slaughtered
in
the
experimental
center
"Campo
Experimental
Jose
Henrique
Bruschi"
at
Embrapa
Dairy
Cattle.
Their
udders
were
extracted
1
2
SBARDELLA
ET
AL.
and
kept
in
an
isolated
perfused
system.
Two
quarters
of
each
udder
were
experimentally
inoculated
with
Strep.
agalactiae,
and
the
other
2
quarters
were
used
as
a
control.
Mammary
tissue
biopsies
were
collected
at
times
0
and
3
h
after
infection.
RNA
Extraction
The
RNA
extraction
was
performed
in
the
Molecular
Genetics
Laboratory
at
Embrapa
Dairy
Cattle.
The
to-
tal
RNA
was
extracted
in
2
replicates
using
the
RNeasy
Mini
Kit
(Qiagen,
Valencia,
CA).
The
protocol
is
based
on
cell
lysis,
followed
by
a
DNase
treatment
to
avoid
DNA
contamination,
binding
of
RNA
to
the
silica
gel
membrane,
purification,
and
subsequent
elution
of
to-
tal
RNA.
The
total
isolated
RNA
was
quantified
by
spectrophotometry
(NanoDrop,
Wilmington,
DE),
and
its
integrity
was
evaluated
by
RNA
integrity
number
score
on
a
Bioanalyzer
2100
(Agilent
Technologies,
Santa
Clara,
CA).
The
average
RNA
integrity
number
±
standard
deviation
was
6.69
±
0.46.
For
each
udder,
a
pool
of
RNA
sample
from
the
2
inoculated
quarters
was
obtained
and
another
pool
of
RNA
sample
was
collected
from
the
2
control
quarters.
These
collections
were
made
2
times
(0
and
3
h)
for
all
4
animals,
total-
ing
16
samples
(4
biological
replicates
x
2
collections
x
2
conditions).
RNA
Sequencing
The
RNA
was
sequenced
by
a
commercial
provider
that
prepared
cDNA
libraries
using
a
TruSeq
RNA
sample
preparation
kit
(Illumina,
San
Diego,
CA),
generating
about
19
million
100-bp
reads.
The
libraries
were
quantified
by
real-time
PCR
and
sequenced
(2
x
100
paired-end
reads)
on
a
HiSeq
2000
sequencer
(Il-
lumina)
using
the
TruSeq
SBS
Kit
version
3
(Illumina).
Quality
Control
and
Alignment
Quality
control
was
performed
using
FastQC
(ver-
sion
0.65;
Afgan
et
al.,
2016;
https://usegalaxy.org/)
.
Adapter
sequences
were
removed
from
reads;
however,
low-quality
or
ambiguous
reads
were
not
trimmed
be-
cause
a
preliminary
test
showed
that
they
did
not
align
to
the
reference
genome.
Reads
were
mapped
to
the
bovine
genome
UMD
3.1
assembly
(http://www.ncbi.nlm.nih.gov/genome?term
=
bos%20taurus)
using
TopHat
(version
2.1.0;
Trapnell
et
al.,
2009)
with
default
settings
and
following
the
au-
thors'
protocol.
The
unmapped
reads
were
subjected
to
a
BLAST
search
on
a
nonredundant
protein
database
using
NCBI
BLAST
+
blastx
(version
0.1.07;
Camacho
et
al.,
2009).
Following
the
alignment,
Cufflinks
(version
2.2.1.0)
with
default
settings
(Trapnell
et
al.,
2010)
was
used
to
assemble
the
transcripts.
The
abundance
of
the
transcripts
was
reported
by
Cufflinks
as
fragments
per
kilobase
of
transcript
per
million
fragments
sequenced,
which
was
the
normalized
expression
level,
later
used
by
the
Cuffdiff
2.
Gene
expression
was
then
calculated
by
counting
the
reads
mapped
to
exons
using
htseq-count
(version
0.6.1;
Anders
et
al.,
2015).
The
normalized
gene
expres-
sion
data
were
summarized
in
an
expression
matrix,
which
was
later
used
by
edgeR
and
baySeq.
A
multi-
dimensional
scaling
method
was
used
to
represent
the
similarities
between
samples
because
it
reproduces
the
observed
distance
matrix
in
a
space
with
a
reduced
number
of
chosen
dimensions
(Hair
et
al.,
2009).
Statistical
Analyses
Gene
expression
was
compared
between
control
and
inoculated
samples
at
times
0
and
3
h
after
inoculation
using
3
statistical
methods
(Table
1).
The
version
of
the
gene
model
used
for
this
study
was
Ensembl
Gene
92
(Aken
et
al.,
2016).
Three
different
tools
were
used,
and
these
were
chosen
to
study
distinct
and
nonover-
lapping
methodologies.
EdgeR
Tool.
The
method
implemented
in
the
edg-
eR
package
(Galaxy
tool
version
0.0.2;
Robinson
et
al.,
2010a)
uses
the
quantile-adjusted
conditional
maximum
likelihood
procedure
(Robinson
and
Smyth,
2008)
in
an
exact
test
similar
to
the
Fisher
exact
test
to
determine
differential
gene
expression
for
pairwise
comparisons.
It
assumes
a
negative
binomial
(NB)
distribution
for
the
data
in
a
balanced
incomplete
block
design
as
follows:
Table
1.
Description
of
the
statistical
assumptions
used
in
the
methods
for
RNA
sequencing
differential
gene
expression
analysis
Method'
Normalization
2
Distribution
Differential
expression
test
edgeR
TMM
Negative
binomial
Exact
test
baySeq
RLE
Negative
binomial
Posterior
probabilities
Cuffdiff
2
Geometric
classified
by
FPKM
0-negative
binomial
t-test
'edgeR:
Robinson
et
al.
(2010a).
baySeq:
Hardcastle
and
Kelly
(2010).
Cuffdiff
2:
Trapnell
et
al.
(2013).
2
TMM
=
trimmed
mean
of
M-values;
RLE
=
relative
log
expression;
FPKM
=
fragments
per
kilobase
of
tran-
script
per
million
fragments
sequenced.
Journal
of
Dairy
Science
Vol.
102
No.
2,
2019
RNA
SEQUENCING
DIFFERENTIAL
GENE
EXPRESSION
ANALYSIS
3
Y
o
NB(M
i
p
o
,
0
9
),
where
g
is
the
gene
and
i
is
the
sample,
M
i
is
the
library
size
(total
number
of
reads),
0
9
is
the
dispersion,
and
p
o
is
the
relative
abundance
of
the
gene
g
in
the
experi-
mental
group
j
to
which
the
sample
i
belongs.
The
NB
parameterization
where
the
mean
was
µ
9
,
=
M
i
p
o
and
the
variance
was
/2
9
,(1
9
Q
0
9
)
was
used.
The
normal-
ization
factor
named
trimmed
mean
of
M-values
was
estimated
between
each
pair
of
samples
(Robinson
and
Oshlack,
2010b).
Individual
gene
expression
was
cal-
culated
as
the
mean
expression
of
each
gene
averaged
over
all
samples
of
each
group
and
presented
as
the
logarithm
of
counts
per
million
reads.
The
Benjamini-
Hochberg
procedure,
proposed
by
Benjamini
and
Ho-
chberg
(1995),
was
used
to
control
the
false
discovery
rate
(FDR),
and
a
cut-off
criterion
of
FDR-adjusted
P
<
0.01
was
applied
to
identify
differentially
expressed
genes.
BaySeq
Tool.
The
method
applied
in
the
baySeq
R/
Bioconductor
package
(version
1.16.0;
Hardcastle
and
Kelly,
2010)
uses
prior
information
to
fit
a
nonlinear
model
that
quantifies
biases
and
uncertainties
in
the
es-
timates
(Liu
and
Logvinenko,
2003).
It
also
assumes
an
NB
distribution
for
the
data
and
derives
an
empirically
determined
prior
distribution
from
the
entire
data
set
by
maximum
likelihood
(Hardcastle
and
Kelly,
2010),
represented
as
follows:
Y
ii
NB(M
oi
,
0,),
where
Y
u
is
the
number
of
reads
for
the
gene
i
in
the
jth
sample,
M
g/3
is
the
mean,
and
O
q
is
the
dispersion.
The
counting
data
from
the
data
set
with
n
=
16
samples
A
=
{A
1
,
...,
A
16
}
are
given
by
(uic,
•••,
unc),
where
u
ic
is
the
counting
for
a
given
gene
c
in
sample
i.
For
each
sample
A
i
,
the
library
size
was
obtained
from
a
scale
factor
4.
Thus,
for
each
gene
(c),
the
data
(D)
were
considered
as
follows:
D
c
=
Kuic,
•••
unc),
(
1
1,
,
46)]
Two
models
were
defined.
In
the
first
model,
the
entire
data
set,
inoculated
(I)
and
control
(C'),
was
considered
as
a
single
treatment:
Dpc
=
[(Un,
Un,
U13,
Upi,
U63,
U62,
u03,
UcA)
(
in,
1
n,
6
114,
1J1,
/c2,
la)],
whereas
in
the
alternative
model,
the
differences
be-
tween
inoculated
(I)
and
control
(C')
data
sets
were
considered:
DP
=
[(u11,
u12,
u13,
u14),
(1n,
6
1
13,
1
14)]
D
c
=
[(UC1,
U62,
U6
,
3,
Uc4),
(la,
162, 103,
l
c4
)].
Last,
a
normalization
factor
named
relative
log
expres-
sion,
described
by
Anders
and
Huber
(2010),
was
used
to
compare
both
models.
Posterior
probabilities
and
FDR
for
each
gene
were
calculated
instead
of
signifi-
cance
values.
A
cut-off
criterion
of
posterior
probability
<0.01
was
applied
to
identify
differentially
expressed
genes.
Cuffdiff
2
Tool.
The
Cuffdiff
2
method
(Trapnell
et
al.,
2013),
implemented
in
Cufflinks
software
(Trap-
nell
et
al.,
2010),
was
used
with
default
settings.
It
uses
a
model
for
read
counts
based
on
a
(3-NB
distri-
bution
and
a
t-test
to
identify
differentially
expressed
genes
and
allows
differential
gene
expression
analysis
between
treatment
and
control
based
on
a
pairwise
analysis
without
replicates.
A
geometric
normaliza-
tion
procedure,
classified
by
fragments
per
kilobase
of
transcript
per
million
fragments
sequenced,
was
applied
to
determine
the
minimum
number
of
reads
needed
to
reconstruct
the
entire
sequence.
Assumptions
are
x
t
i
-
BNB
(p
l
i
,o-
t
i
2
),
where
xi
is
the
number
of
fragments
in
the
replicate
j
from
the
transcript
t,
BNB
is
the
0-NB,
µ
is
the
mean,
and
o
-2
is
the
variance.
The
Benjamini-Hochberg
proce-
dure
was
used
as
previously
described
in
the
baySeq
method
section.
Identification
of
Metabolic
Pathways
The
differentially
expressed
genes
identified
by
each
of
the
3
methods
were
analyzed
independently
using
DAVID
Bioinformatic
Resources
6.8
(Huang
et
al.,
2009a,b),
and
pathways
with
P
<
0.05
were
considered
enriched.
RESULTS
AND
DISCUSSION
The
total
and
the
average
number
of
reads
generated
by
the
RNA-seq
were
842,604,108
and
26,331,378,
re-
spectively.
The
quality
control
performed
using
FastQC
provided
an
overview
of
the
variation
in
base
quality
scores
at
each
position
of
the
input
file.
The
samples
were
of
good
quality
based
on
the
Phred
scores
(prob-
ability
of
incorrect
base
call).
Given
the
good
quality
of
the
cDNA
libraries
and
the
fact
that
mapping
algorithms
use
only
reads
that
perfectly
match
a
position
on
the
reference
genome,
the
analyses
were
carried
out
without
trimming
the
reads.
On
average,
88.91
(Table
2)
and
88.89%
of
the
Journal
of
Dairy
Science
Vol.
102
No.
2,
2019
4
SBARDELLA
ET
AL.
Table
2.
Total
and
proportion
of
aligned
reads
per
sample
Treatment
and
sample
label'
Left
reads
Right
reads
Aligned
pairs
Total
no.
Total
no.
Total
no.
Control
G7TO
30,692,458
89.9
30,419,535
89.1
29,164,959
84.2
G7T3
23,756,098
87.9
23,934,459
88.6
22,638,978
82.6
G8TO
21,164,058
90.4
21,137,250
90.3
20,248,224
85.3
G8T3
23,096,338
89.0
23,253,768
89.6
21,943,088
83.9
G9TO
23,054,518
87.5
23,235,820
88.2
21,811,906
82.3
G9T3
22,784,883
90.2
22,690,679
89.9
21,607,226
85.0
G1OTO
21,290,821
88.6
21,216,345
88.3
20,123,296
82.6
G1OT3
19,639,786
89.2
19,736,210
89.7
18,642,344
83.9
Inoculated
G7TO
22,510,799
86.3
22,658,924
86.9
21,280,138
80.9
G7T3
23,227,834
89.8
23,142,353
89.5
22,129,511
84.2
G8TO
21,707,055
88.3
21,809,049
88.7
20,653,152
82.8
G8T3
22,543,993
89.5
22,395,590
88.9
21,408,569
83.8
G9TO
33,428,420
90.4
33,434,970
90.4
31,943,198
85.1
G9T3
22,334,160
88.4
22,479,819
89.0
21,294,357
83.2
G1OTO
21,647,272
86.5
21,777,734
87.0
20,470,883
80.5
G1OT3
21,589,981
89.5
21,473,579
89.0
20,466,780
83.8
'G7
to
G10:
sample
identification;
TO
and
T3:
time
(h)
after
inoculation.
trimmed
and
nontrimmed
reads,
respectively, were
mapped
to
the
reference
genome.
This
difference
is
due
to
the
lower
number
of
reads
in
the
trimmed
data
set.
Similar
results
were
reported
by
Cui
et
al.
(2014),
who
also
aligned
sequenced
mammary
tissue
samples
using
TopHat.
The
unmapped
reads
subjected
to
the
BLAST
search
were
aligned
to
other
members
of
the
Bovidae
fam-
ily,
and
several
of
these
regions
are
related
to
immune
system
processes,
suggesting
that
the
alleles
are
highly
polymorphic
and
have
not
yet
been
annotated
in
the
Bos
taurus
genome.
During
the
multidimensional
scal-
ing
analysis,
the
control
sample
G7T3
was
clustered
together
with
the
inoculated
samples
(Figure
1),
and
it
was
removed
from
further
analysis.
There
are
26,453
annotated
genes
in
the
UMD
3.1
bovine
genome
assembly.
A
total
of
27,404
sequences
were
expressed
in
the
mammary
tissue,
of
which
21,691
were
annotated
genes.
The
remaining
5,821
sequences
were
labeled
as
XLOC.
A
larger
number
of
differentially
expressed
genes
was
reported
by
Cuffdiff
2,
followed
by
edgeR
and
baySeq
(Table
3).
The
variation
in
the
number
of
differentially
ex-
pressed
genes
is
explained
by
the
different
statistical
as-
sumptions
that
each
method
is
based
on
(Table
1).
The
different
normalization
methods
that
are
used
for
these
analyses
may
be
largely
responsible
for
the
variability
in
the
results,
having
more
of
an
effect
on
the
detection
of
genes
differentially
expressed
than
the
statistical
test
used
(Bullard
et
al.,
2010;
Maza
et
al.,
2013).
The
fragments
per
kilobase
of
transcript
per
million
fragments
sequenced
(FPKM)
normalization
method
is
associated
with
the
greatest
number
of
false
dis-
coveries
compared
with
relative
log
expression
and
trimmed
mean
of
M-values
methods;
this
also
occurs
in
the
production
of
mean
squared
errors,
so
the
FPKM
normalization
method
may
not
be
the
most
suitable
for
differential
gene
expression
analyses
(Maza
et
al.,
2013).
The
trimmed
mean
of
M-values
normalization
method
is
satisfactory
for
maintaining
a
reasonable
false-positive
rate
without
any
loss
of
power
(Dillies
et
al.,
2013)
and
performs
very
similar
to
relative
log
G9T3
G8T3
GI
OT36
G8T3N
G1OT3
67T31
G7T3
G9T3
-0.05
0.00
M1
0.05
0.10
Figure
1.
Multidimensional
scaling
plot
of
the
samples
in
which
distances
correspond
to
normalized
gene
expression
between
pairs
of
samples.
G7
to
G10
=
sample
identification;
T3
=
time
3
h
after
in-
oculation;
=
inoculated;
=
control.
0.050
0.025
0.000
-
0.025
-
0.050
-
0.005
Journal
or
Dairy
Science
Vol.
102
No.
2,
2019
RNA
SEQUENCING
DIFFERENTIAL
GENE
EXPRESSION
ANALYSIS
5
expression,
once
both
can
provide
equal
results
in
some
simpler
experiments
(Maza
et
al.,
2013;
Maza,
2016).
These
methods
perform
well
for
the
analysis
of
differen-
tially
expressed
genes
(Maza
et
al.,
2013).
The
baySeq
assumes
an
NB
distribution;
however,
it
estimates
posterior
probabilities
and
the
parameter
distributions
based
on
an
empirical
Bayesian
method
(Hardcastle
and
Kelly,
2010).
Soneson
and
Delorenzi
(2013)
and
Rapaport
et
al.
(2013)
observed
that
bay-
Seq
was
computationally
slower
but
had
a
better
con-
trol
of
FDR
with
a
large
sample
size
and
no
outliers
(Soneson
and
Delorenzi,
2013).
In
addition,
baySeq
it
is
known
to
be
more
conservative
than
the
other
methods
(Rapaport
et
al.,
2013;
Soneson
and
Delorenzi,
2013;
Seyednasrollah
et
al.,
2015),
which
was
corroborated
in
this
study
by
the
lower
number
of
differentially
ex-
pressed
genes
identified
(1,161).
In
contrast,
Cuffdiff
2
is
considered
less
conservative
and
reported
to
have
a
higher
FDR
(Rapaport
et
al.,
2013).
The
fact
that
the
method
is
less
conservative
was
also
observed
in
the
present
study:
3,389
differ-
entially
expressed
genes
were
identified
by
Cuffdiff
2.
According
to
Rapaport
et
al.
(2013),
Cuffdiff
2
uses
a
Poisson
model
to
control
variability
in
sequencing
depth,
biological
noise,
and
splicing
structure
by
cal-
culating
the
mean
count
across
replicates
and
P-values
for
any
observed
changes
in
a
transcript's
fragment
count
(Trapnell
et
al.,
2013).
The
traditional
statisti-
cal
methods,
as
t-test,
cannot
be
reliably
applied
to
genomic
data
(Law
et
al.,
2014).
This
method
makes
assumptions
of
normally
distributed
continuous
data,
whereas
the
reads
are
count
based
and
cannot
be
nor-
mally
distributed.
In
the
present
study,
the
number
of
differentially
expressed
genes
identified
using
edgeR
was
intermedi-
ary
(1,756)
between
the
other
2
methods.
The
edgeR
implements
a
frequentist
method
that
uses
an
empiri-
cal
Bayes
procedure
to
adjust
the
overdispersion
across
genes
by
using
the
information
of
other
genes
(Robinson
and
Smyth,
2007,
2008).
Soneson
and
Delorenzi
(2013)
considered
the
edgeR
to
be
computationally
efficient
but
less
conservative
compared
with
other
methods,
reporting
a
larger
FDR.
Rapaport
et
al.
(2013)
also
re-
ported
variable
FDR
when
comparing
edgeR,
baySeq,
and
Cuffdiff
2;
however,
both
baySeq
and
edgeR
had
lower
FDR.
The
number
of
differentially
expressed
genes
iden-
tified
in
the
present
study
is
in
accordance
with
the
literature
because
other
studies
have
reported
more
genes
detected
by
Cuffdiff
2,
followed
by
edgeR
and
baySeq.
The
present
results
suggest
that
baySeq
is
more
conservative
than
edgeR
and
Cuffdiff
2
and
that
the
identification
of
differentially
expressed
genes
by
Cuffdiff
2
may
be
inflated
(Table
3).
An
exception
found
in
the
literature,
a
study
by
Seyednasrollah
et
al.
(2015),
presented
Cuffdiff
2
as
conservative,
reporting
lower
FDR
compared
with
other
methods.
Although
the
results
from
the
3
statistical
tools
were
different
(Figure
2),
they
seemed
to
be
suitable
for
RNA-seq
differential
gene
expression
analysis
in
this
study.
There
were
only
122
shared
genes
among
the
methods.
The
most
similar
results
were
obtained
between
edgeR
and
baySeq,
which
shared
900
differen-
tially
expressed
genes.
In
contrast,
there
were
only
221
shared
genes
between
Cuffdiff
2
and
edgeR
and
only
145
shared
genes
between
Cuffdiff
2
and
baySeq.
A
total
of
5,519
transcripts
were
reported
as
differ-
entially
expressed,
out
of
which
4,214
were
annotated
genes
in
UMD
3.1
genome
assembly
and
were
subjected
to
a
pathway
enrichment
analysis
(Figure
3).
A
total
of
8, 8,
and
64
pathways
were
significantly
enriched
(P
<
0.05)
for
edgeR,
baySeq,
and
Cuffdiff
2,
respectively.
Table
4
presents
the
shared
enriched
pathways
when
studying
the
results
from
the
3
statistical
methods.
The
platelet
activation
(bta04611)
was
the
only
pathway
found
to
be
enriched
by
all
3
methods,
denoting
that
these
methods
resulted
in
different
sets
of
differentially
expressed
genes.
Four
genes
[Fc
fragment
of
IgE
recep-
tor
Ig
(FCER1G),
G
protein
subunit
a
i2
(GNAI2),
ORAI
calcium
release-activated
calcium
modulator
1
(ORAI1),
and
vasodilator-stimulated
phosphoprotein
(VASP)]
shared
among
the
3
methods
studied
were
enriched
in
the
pathway
platelet
activation;
therefore,
these
genes
should
be
indicated
for
future
validation.
Using
only
the
122
shared
differentially
expressed
genes
among
the
3
methods
(Figure
2)
could
reduce
the
false-positive
genes
in
the
further
analysis,
supported
by
evidence
of
true
differential
expression
in
the
tissue.
Table
3.
Total
and
shared
differentially
expressed
genes
for
the
statistical
contrasts
according
to
the
method
used
Method
2
Shared
Contrast'
edgeR
baySeq
Cuffdiff
2
genes
(no.)
Inoculated
TO
x
control
TO
0 0 0 0
Inoculated
T3
x
control
T3
1,756
1,161
3,389
122
'TO
and
T3:
at
time
0
and
3
h
after
inoculation,
respectively.
2
edgeR:
Robinson
et
al.
(2010a).
baySeq:
Hardcastle
and
Kelly
(2010).
Cuffdiff
2:
Trapnell
et
al.
(2013).
Journal
of
Dairy
Science
Vol.
102
No.
2,
2019
6
SBARDELLA
ET
AL.
Table
4.
Shared
enriched
pathways
(P
<
0.05)
associated
with
the
differentially
expressed
gene
identified
by
3
statistical
methods'
edgeR
and
baySeq
edgeR
and
Cuffdiff
2
baySeq
and
Cuffdiff
2
bta05012:
Parkinson's
disease
bta04611:
platelet
activation
bta04611:
platelet
activation
bta04310:
Wnt
signaling
pathway
bta05323:
rheumatoid
arthritis
bta04610:
complement
and
coagulation
cascades
bta04550:
signaling
pathways
regulating
pluripotency
of
stem
cells
bta04360:
axon
guidance
bta04611:
platelet
activation
'edgeR:
Robinson
et
al.
(2010a).
baySeq:
Hardcastle
and
Kelly
(2010).
Cuffdiff
2:
Trapnell
et
al.
(2013).
However,
the
genes
identified
by
one
method
or
another
are
not
always
false
positives
because
the
experimental
design
can
affect
the
effectiveness
of
the
methods.
Such
approach
would
reduce
the
number
of
genes
and
the
statistical
power
of
the
following
enrichment
analysis.
Thus,
using
only
the
shared
differentially
expressed
genes
is
not
of
interest
for
functional
enrichment
analy-
sis.
Instead,
the
shared
genes
could
be
used
for
valida-
tion
by
real-time
PCR.
CONCLUSIONS
The
results
from
the
evaluated
tools
were
different
given
their
specific
statistical
assumptions.
The
use
of
the
3
methods
for
the
detection
of
differentially
ex-
pressed
genes
helped
select
the
genes
to
be
indicated
for
validation.
From
the
genes
identified
by
the
over-
lapping
of
the
3
methods,
only
platelet
activation
pre-
sented
a
significantly
enriched
pathway.
We
indicated
the
FCERIG,
GNAI2,
ORAI1,
and
VASP
genes
shared
among
the
3
methods
in
this
pathway
for
posterior
bio-
logical
validation.
ACKNOWLEDGMENTS
The
authors
acknowledge
the
Brazilian
Agricultural
Research
Corporation—Embrapa
Dairy
Cattle
(Juiz
de
Fora,
Minas
Gerais,
Brazil)
for
providing
the
data
used
in
this
study
and
the
Bioinformatics
Multiuser
Laboratory
(LMB)—Embrapa
Agricultural
Informatics
edgeR
baySeq
edge
R
baySeq
753
778
4
i
t
ti
1
\
122
N
99
\
23
3
2
3145
Cuffdiff
2
Figure
2.
Venn
diagram
of
the
shared
differentially
expressed
genes
based
on
RNA
seqencing
data
through
3
statistical
methods:
edgeR
(Robinson
et
al.,
2010a),
baySeq
(Hardcastle
and
Kelly,
2010),
and
Cuffdiff
2
(Trapnell
et
al.,
2013).
58
Cuffdiff
2
Figure
3.
Venn
diagram
of
the
total
number
of
significantly
en-
riched
pathways
(P
<
0.05)
through
3
statistical
methods:
edgeR
(Robinson
et
al.,
2010a),
baySeq
(Hardcastle
and
Kelly,
2010),
and
Cuffdiff
2
(Trapnell
et
al.,
2013).
Journal
of
Dairy
Science
Vol.
102
No.
2,
2019
RNA
SEQUENCING
DIFFERENTIAL
GENE
EXPRESSION
ANALYSIS
7
(Campinas,
Sao
Paulo,
Brazil)
for
providing
compu-
tational
structure
for
data
analysis.
The
first
author
received
a
Coordination
for
the
Improvement
of
Higher
Education
Personnel
(Coordenaca
-
o
de
Aperfeicoa-
mento
de
Pessoal
de
Nivel
Superior,
CAPES,
Brazil)/
Embrapa
scholarship.
The
experiment
was
financially
supported
by
the
Foundation
for
Research
Support
of
the
State
of
Minas
Gerais
(Fundacao
de
Amparo
a
Pes-
quisa
do
Estado
de
Minas
Gerais,
FAPEMIG,
Minas
Gerais,
Brazil;
project
number
APQ-00095-15)
and
the
National
Council
for
Scientific
and
Technological
Development
(Conselho
Nacional
de
Desenvolvimento
Cientifico
e
TecnolOgico,
CNPq,
Brazil;
project
number
473414/2011-2).
Priscila
Arrigucci
Bernardes
received
a
scholarship
from
the
Sao
Paulo
Research
Foundation
(Fundaca
-
o
de
Amparo
a
Pesquisa
do
Estado
de
Sao
Paulo-FAPESP,
Sao
Paulo,
Brazil;
process
numbers
2016/22940-3
and
2015/25096-6).
REFERENCES
Afgan,
E.,
D.
Baker,
M.
Van Den
Beek,
D.
Blankenberg,
D.
Bouvier,
M.
tech,
J.
Chilton,
D.
Clements,
N.
Coraor,
C.
Eberhard,
B.
Griining,
A.
Guerler,
J.
Hillman-Jackson,
G.
Von
Kuster,
E.
Ra-
sche,
N.
Soranzo,
N.
Turaga,
J.
Taylor,
A.
Nekrutenko,
and
A.
Goecks.
2016.
The
Galaxy
platform
for
accessible,
reproducible
and
collaborative
biomedical
analyses:
2016
update.
Nucleic
Acids
Res.
44:W3-W10.
https://doi.org/10.1093/nar/gkw343.
Aken,
B.
L.,
S.
Ayling,
D.
Barrell,
L.
Clarke,
V.
Curwen,
S.
Fairley,
J.
F.
Banet,
K.
Billis,
C.
G.
Giron,
T.
Hourlier,
K.
Howe,
A.
Kahari,
F.
Kokocinski,
F.
J.
Martin,
D.
N.
Murphy,
R.
Nag,
M.
Ruffier,
M.
Schuster,
Y.
A.
Tang,
J.-H.
Vogel,
S.
White,
A.
Zadissa,
P.
Flicek,
and
S.
M.
J.
Searle.
2016.
The
Ensembl
gene
annotation
system.
Database
(Oxford)
2016:1-19.
https://doi.org/10.1093/database/
baw093.
Anders,
S.,
P.
T.
Pyl,
and
W.
Huber.
2015.
HTSeq-A
Python
frame-
work
to
work
with
high-throughput
sequencing
data.
Bioinformat-
ics
31:166-169.
Anders,
S.,
and
W.
Huber.
2010.
Differential
expression
analysis
for
sequence
count
data.
Genome
Biol.
11:R106.
Benjamini,
Y.,
and
Y.
Hochberg.
1995.
Controlling
the
false
discovery
rate:
A
practical
and
powerful
approach
to
multiple
testing.
J.
R.
Stat.
Soc.
Series
B
Stat.
Methodol.
57:289-300.
Bullard,
J.
H.,
E.
Purdom,
K.
D.
Hansen,
and
S.
Dudoit.
2010.
Evalu-
ation
of
statistical
methods
for
normalization
and
differential
ex-
pression
in
mRNA-Seq
experiments.
BMC
Bioinformatics
11:94.
https://doi.org/10.1186/1471-2105-11-94.
Camacho,
C.,
G.
Coulouris,
V.
Avagyan,
N.
Ma,
J.
Papadopoulos,
K.
Bealer,
and
T.
L.
Madden.
2009.
BLAST+:
Architecture
and
applications.
BMC
Bioinformatics
10:421-429.
https://doi.org/10
.1186/1471-2105-10-421.
Cui,
X.,
Y.
Hou,
S.
Yang,
Y.
Xie,
S.
Zhang,
Y.
Zhang,
Q.
Zhang,
X.
Lu,
G.
E.
Liu,
and
D.
Sun.
2014.
Transcriptional
profiling
of
mam-
mary
gland
in
Holstein
cows
with
extremely
different
milk
protein
and
fat
percentage
using
RNA
sequencing.
BMC
Genomics
15:226.
Dillies,
M.-A.,
A.
Rau,
J.
Aubert,
C.
Hennequet-Antier,
M.
Jeanmou-
gin,
N.
Servant,
C.
Keime,
G.
Marot,
D.
Castel,
J.
Estelle,
G.
Guernec,
B.
Jagla,
L.
Jouneau,
D.
Laloe,
C.
L.
Gall,
B.
Schaeffer,
S.
L.
Crom,
M.
Guedj,
and
F.
Jaffrezic.French
StatOmique
Con-
sortium.
2013.
A
comprehensive
evaluation
of
normalization
meth-
ods
for
Illumina
high-throughput
RNA
sequencing
data
analysis.
Brief.
Bioinform.
14:671-683.
https://doi.org/10.1093/bib/bbs046.
Hair,
J.
F.,
W.
C.
Black,
B.
J.
Badin,
R.
E.
Anderson,
and
R.
L.
Tatham.
2009.
Analise
Multivariada
de
Dados.
6th
ed.
Artmed
Editora
S.
A.,
Porto
Alegre,
RS,
Brazil.
Hardcastle,
T.
J.,
and
K.
A.
Kelly.
2010.
baySeq:
Empirical
Bayesian
methods
for
identifying
differential
expression
in
sequence
count
data.
BMC
Bioinformatics
11:422.
Huang,
W.,
B.
T.
Sherman,
and
R.
A.
Lempicki.
2009a.
Systematic
and
integrative
analysis
of
large
gene
lists
using
DAVID
Bioinfor-
matics
Resources.
Nat.
Protoc.
4:44-57.
https://doi.org/10.1038/
nprot.2008.211.
Huang,
W.,
B.
T.
Sherman,
and
R.
A.
Lempicki.
2009b.
Bioinformat-
ics
enrichment
tools:
Paths
toward
the
comprehensive
functional
analysis
of
large
gene
lists.
Nucleic
Acids
Res.
37:1-13.
https://doi
.org/10.1093/nar/gkn923.
Law,
C.
W.,
Y.
Chen,
W.
Shi,
and
G.
K.
Smyth.
2014.
Voom:
Preci-
sion
weights
unlock
linear
model
analysis
tools
for
RNA-seq
read
counts.
Genome
Biol.
15:R29.
Liu,
J.
S.,
and
T.
Logvinenko.
2003.
Bayesian
methods
in
biological
sequence
analysis.
Pages
66-93
in
Handbook
of
Statistical
Genet-
ics.
2nd
ed.
Wiley,
Chichester,
UK.
Maza,
E.
2016.
In
papyro
comparison
of
TMM
(edgeR),
RLE
(DE-
Seq2),
and
MRN
normalization
methods
for
a
simple
two-con-
ditions-without-replicates
RNA-Seq
experimental
design.
Front.
Genet.
7:164.
https://doi.org/10.3389/fgene.2016.00164.
Maza,
E.,
P.
Frasse,
P.
Senin,
M.
Bouzayen,
and
M.
Zouine.
2013.
Comparison
of
normalization
methods
for
differential
gene
expres-
sion
analysis
in
RNA-Seq
experiments.
Commun.
Integr.
Biol.
6:e25849.
https://doi.org/10.4161/cib.25849.
Rapaport,
F.,
R.
Khanin,
Y.
Liang,
M.
Pirum,
A.
Krek,
P.
Zumbo,
C.
E.
Mason,
N.
D.
Socci,
and
D.
Betel.
2013.
Comprehensive
evalu-
ation
of
differential
gene
expression
analysis
methods
for
RNA-seq
data.
Genome
Biol.
14:R95.
Robinson,
M.
D.,
D.
J.
McCarthy,
and
G.
K.
Smyth.
2010a.
edgeR:
A
Bioconductor
package
for
differential
expression
analysis
of
digital
gene
expression
data.
Bioinformatics
26:139-140.
Robinson,
M.
D.,
and
A.
Oshlack.
2010b.
A
scaling
normalization
method
for
differential
expression
analysis
of
RNA-seq
data.
Ge-
nome
Biol.
11:R25.
Robinson,
M.
D.,
and
G.
K.
Smyth.
2007.
Moderated
statistical
tests
for
assessing
differences
in
tag
abundance.
Bioinformatics
23:2881-
2887.
Robinson,
M.
D.,
and
G.
K.
Smyth.
2008.
Small-sample
estimation
of
negative
binomial
dispersion,
with
applications
to
SAGE
data.
Biostatistics
9:321-332.
Seyednasrollah,
F.,
A.
Laiho,
and
L.
L.
Elo.
2015.
Comparison
of
soft-
ware
packages
for
detecting
differential
expression
in
RNA-seq
studies.
Brief.
Bioinform.
16:59-70.
https://doi.org/10.1093/bib/
bbt086.
Soneson,
C.,
and
M.
Delorenzi.
2013.
A
comparison
of
methods
for
differential
expression
analysis
of
RNA-seq
data.
BMC
Bioinfor-
matics
14:91.
Trapnell,
C.,
D.
G.
Hendrickson,
M.
Sauvageau,
L.
Goff,
J.
L.
Rinn,
and
L.
Pachter.
2013.
Differential
analysis
of
gene
regulation
at
transcript
resolution
with
RNA-seq.
Nat.
Biotechnol.
31:46-53.
Trapnell,
C.,
L.
Pachter,
and
S.
L.
Salzberg.
2009.
TopHat:
Discover-
ing
splice
junctions
with
RNA-Seq.
Bioinformatics
25:1105-1111.
Trapnell,
C.,
B.
A.
Williams,
G.
Pertea,
A.
Mortazavi,
G.
Kwan,
M.
J.
V.
Boren,
S.
L.
Salzberg,
B.
J.
Wold,
and
L.
Pachter.
2010.
Transcript
assembly
and
quantification
by
RNA-Seq
reveals
unan-
notated
transcripts
and
isoform
switching
during
cell
differentia-
tion.
Nat.
Biotechnol.
28:511-515.
Wang,
Z.,
M.
Gerstein,
and
M.
Snyder.
2009.
RNA-Seq:
A
revolution-
ary
tool
for
transcriptomics.
Nat.
Rev.
Genet.
10:57-63.
Journal
of
Dairy
Science
Vol.
102
No.
2,
2019