edgeRDESeq2分析RNAseq差异表达Word下载.docx
《edgeRDESeq2分析RNAseq差异表达Word下载.docx》由会员分享,可在线阅读,更多相关《edgeRDESeq2分析RNAseq差异表达Word下载.docx(15页珍藏版)》请在冰点文库上搜索。
CC_1"
CC_2"
CC_3"
names(mydata)[7:
12]<
-sampleNames
head(mydata)
GeneidChrStartEndStrandLengthCA_1CA_2CA_3CC_1CC_2CC_3
1gene1314NW_139421.112571745+489000000
2gene1315NW_139421.121153452+1338000000
3gene1316NW_139421.138564680+825000000
4gene1317NW_139421.148665435-570000000
5gene1318NW_139421.160666836-771000000
6gene1319NW_139421.172949483+2190000000
∙在这里我们只是需要Geneid和后6列的样本的count信息来组成矩阵,所以要处理下
countMatrix<
-as.matrix(mydata[7:
12])
rownames(countMatrix)<
-mydata$Geneid
head(countMatrix)
CA_1CA_2CA_3CC_1CC_2CC_3
gene1314000000
gene1315000000
gene1316000000
gene1317000000
gene1318000000
gene1319000000
*要导入的矩阵由3v3样本组成(三组生物学重复)
创建DEGlist
group<
-factor(c("
CA"
CC"
))
y<
-DGEList(counts=countMatrix,group=group)
y
Anobjectofclass"
DGEList"
$counts
14212morerows...
$samples
grouplib.sizenorm.factors
CA_1CA_117885371
CA_2CA_218255461
CA_3CA_319030171
CC_1CC_118260421
CC_2CC_221244681
CC_3CC_320250631
过滤
∙过滤掉那些count结果都为0的数据,这些没有表达的基因对结果的分析没有用,过滤又两点好处:
1可以减少内存的压力2可以减少计算的压力
keep<
-rowSums(cpm(y)>
1)>
=2
-y[keep,,keep.lib.sizes=FALSE]
gene1321161138129218194220
gene1322231133
gene1323202733475146
gene132460877986100132
gene1325322921587556
3877morerows...
CA_1CA_117883621
CA_2CA_218253081
CA_3CA_319027961
CC_1CC_118258891
CC_2CC_221241551
CC_3CC_320247861
标准化处理
∙edgeR采用的是TMM方法进行标准化处理,只有标准化处理后的数据才又可比性
-calcNormFactors(y)
CA_1CA_117883620.9553769
CA_2CA_218253080.9052539
CA_3CA_319027960.9686232
CC_1CC_118258890.9923455
CC_2CC_221241551.1275178
CC_3CC_320247861.0668754
设计矩阵
∙为什么要一个设计矩阵呢,道理很简单,有了一个设计矩阵才能够更好的分组分析
subGroup<
-factor(substring(colnames(countMatrix),4,4))
design<
-model.matrix(~subGroup+group)
rownames(design)<
-colnames(y)
design
(Intercept)subGroup2subGroup3groupCC
CA_11000
CA_21100
CA_31010
CC_11001
CC_21101
CC_31011
attr(,"
assign"
[1]0112
contrasts"
)$subGroup
[1]"
contr.treatment"
)$group
评估离散度
-estimateDisp(y,design,robust=TRUE)
y$common.dispersion
[1]0.02683622
#plot
plotBCV(y)
差异表达基因
fit<
-glmQLFit(y,design,robust=TRUE)
qlf<
-glmQLFTest(fit)
topTags(qlf)
Coefficient:
groupCC
logFClogCPMFPValueFDR
gene7024-5.5156489.612809594.92326.431484e-442.496702e-40
gene66125.1302828.451143468.20601.557517e-393.023140e-36
gene27434.3774925.586773208.02683.488383e-264.513967e-23
gene120324.7343835.098148192.93784.359649e-254.231040e-22
gene491-2.73391010.412673190.98396.104188e-254.739291e-22
gene89412.9971856.839106177.76146.332836e-244.097345e-21
gene2611-2.8469247.216173174.73321.099339e-236.096619e-21
gene62422.5291259.897771169.26583.022914e-231.466869e-20
gene72523.7323156.137670188.20943.890569e-231.678132e-20
gene61252.8754236.569935160.31891.656083e-226.428914e-20
查看差异表达基因原始的CMP
top<
-rownames(topTags(qlf))
cpm(y)[top,]
gene70241711.3830021405.8618991480.12111533.1141837.1604029.62696
gene661217.55864912.10384826.585753403.99298582.457961044.35046
gene27434.6823061.8155775.96823062.9169487.26431114.34156
gene120321.7558652.4207702.71283265.6764647.5987275.45617
gene4912811.1397272059.4696692222.351938444.83381385.38258253.68087
gene894123.99682024.81288824.415488131.35291244.67410225.90560
gene2611245.821088310.463691225.16505243.0484326.3045539.81123
gene6242231.188880299.570228298.4115151348.298991343.619882191.93237
gene72529.36461313.3142325.42566492.71970108.55847181.92807
gene612523.41153214.52461729.841152145.70239160.75005185.16852
查看上调和下调基因的数目
summary(dt<
-decideTestsDGE(qlf))
[,1]
-1536
02793
1553
挑选出差异表达基因的名字
isDE<
-as.logical(dt)
DEnames<
-rownames(y)[isDE]
head(DEnames)
gene1325"
"
gene1326"
gene1327"
gene1331"
gene1340"
gene1343"
差异表达基因画图
plotSmear(qlf,de.tags=DEnames)
abline(h=c(-1,1),col="
blue"
DESeq2包的安装
##tryhttp:
DESeq2"
∙导入count矩阵,导入数据的方式很多这里直接导入count矩阵
library(DESeq2)
table2<
-data.frame(name=c("
),condition=("
rownames(table2)<
∙把count矩阵转化为DESeq2 的数据格式
dds<
-DESeqDataSetFromMatrix(countMatrix,colData=table2,design=~condition)
dds
class:
DESeqDataSet
dim:
142176
metadata(0):
assays
(1):
counts
rownames(14217):
gene1314gene1315...gene6710gene6709
rowRangesmetadatacolumnnames(0):
colnames(6):
CA_1CA_2...CC_2CC_3
colDatanames
(2):
namecondition
∙过滤掉那些count结果都为0的数据,这些没有表达的基因对结果的分析没有用
-dds[rowSums(counts(dds))>
1,]
dds
41906
rownames(4190):
gene1321gene1322...gene6712gene6710
PCA分析
rld<
-rlog(dds)
plotPCA(rld,intgroup=c("
name"
condition"
∙当然也可以使用ggplot2 来画 PCA图
library(ggplot2)
data<
-plotPCA(rld,intgroup=c("
"
),returnData=TRUE)
percentVar<
-round(100*attr(data,"
percentVar"
p<
-ggplot(data,aes(PC1,PC2,color=condition,shape=name))+
geom_point(size=3)+
xlab(paste0("
PC1:
percentVar[1],"
%variance"
))+
ylab(paste0("
PC2:
percentVar[2],"
p
∙注意在进行PCA分析前不要
library(DESeq)
否则无法进行PCA分析
差异表达基因分析
分析结果输出
library(DESeq)
-DESeq(dds)
res<
-results(dds)
write.table(res,"
result.csv"
sep="
row.names=TRUE)
head(res)
log2foldchange(MAP):
conditionCCvsCA
Waldtestp-value:
DataFramewith6rowsand6columns
baseMeanlog2FoldChangelfcSEstatpvalue
<
numeric>
gene1321173.2886810.262679590.20499831.28137422.000623e-01
gene13222.118367-0.052379520.4989589-0.10497769.163936e-01
gene132335.9737010.500545800.30380961.64756419.944215e-02
gene132488.4216610.176776050.24027270.73573094.618945e-01
gene132543.0018280.811431040.29193962.77944865.445127e-03
gene1326662.136259-1.053561050.1752230-6.01268801.824720e-09
padj
gene13213.790396e-01
gene13229.559679e-01
gene13232.337858e-01
gene13246.565731e-01
gene13252.447141e-02
gene13264.520861e-08
∙注:
(1)rownames:
基因ID
(2)baseMean:
所有样本矫正后的平均reads数(3)log2FoldChange:
取log2后的表达量差异(4)pvalue:
统计学差异显著性检验指标(5)padj:
校正后的pvalue,padj越小,表示基因表达差异越显著
∙summary 查看整体分析结果
summary(res)
outof4190withnonzerototalreadcount
adjustedp-value<
0.1
LFC>
0(up):
595,14%
LFC<
0(down):
644,15%
outliers[1]:
0,0%
lowcounts[2]:
325,7.8%
(meancount<
1)
[1]see'
cooksCutoff'
argumentof?
results
[2]see'
independentFiltering'
MA图
library(geneplotter)
plotMA(res,main="
ylim=c(-2,2))
Heatmap图
sum(res$padj<
0.1,na.rm=TRUE)
library("
pheatmap"
select<
-order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:
1000]
nt<
-normTransform(dds)#defaultstolog2(x+1)
log2.norm.counts<
-assay(nt)[select,]
df<
-as.data.frame(colData(dds)[,c("
)])
pdf('
heatmap1000.pdf'
width=6,height=7)
pheatmap(log2.norm.counts,cluster_rows=TRUE,show_rownames=FALSE,
cluster_cols=TRUE,annotation_col=df)
dev.off()