Posted on

我们经常会在SCI文章里面看到下面这样的图来,展示体细胞突变(somatic mutation)的数据。

这个图叫瀑布图,展示每一样本中的各种类型的突变,包括错义突变,移码突变,无义突变,插入缺失等等。要想画出这张图,首先我们必须要准本好数据。今天小编就来跟大家聊聊怎么从TCGA数据库下载体细胞突变(somatic mutation)数据。

1.打开TCGA网站,输入需要下载的肿瘤类型

https://pic1.zhimg.com/v2-bd86495635c42fda94345d8e4ad7d250_r.jpg

2.点击WXS后面的数字51

https://pic1.zhimg.com/v2-16dc8b29d8e6c28d27925d4c42e1d300_r.jpg

3.点击左上角File

https://pic3.zhimg.com/v2-ee1bce705e92785b9f710d19a965755a_r.jpg

4.选择WXS,Masked Somatic Mutation,maf,simple nucleotide variation,Aliquot Ensemble Somatic Variant Merging and masking,然后Add all files to cart

https://pic2.zhimg.com/v2-cf3643d65abdee53768f149ef0f3c209_r.jpg

5.这51个文件就加入右上角的购物车里面了

https://pic4.zhimg.com/v2-1c658e447c2c489bf04440815ce743db_r.jpg

6.下载Download下拉框里里面的Cart

https://pic1.zhimg.com/v2-28a39ab7aacd1a2eeebcfef35eb7ee9c_r.jpg

得到gdc_downloa_****.tar.gz.文件

7. 解压该文件

8. 合并所有数据

setwd("G:\\test\\gdc_download_20221025_103238.659115")
files <- list.files(pattern = '*.gz',recursive = TRUE)
all_mut <- data.frame()
for (file in files) {
  mut <- read.delim(file,skip = 7, header = T, fill = TRUE,sep = "\t")
  all_mut <- rbind(all_mut,mut)
}

9. 数据整理

all_mut <- read.maf(all_mut)

a <- all_mut@data %>%
  .[,c("Hugo_Symbol","Variant_Classification","Tumor_Sample_Barcode")] %>%
  as.data.frame() %>%
  mutate(Tumor_Sample_Barcode = substring(.$Tumor_Sample_Barcode,1,12))

gene <- as.character(unique(a$Hugo_Symbol))
sample <- as.character(unique(a$Tumor_Sample_Barcode))

mat <- as.data.frame(matrix("",length(gene),length(sample),
                            dimnames = list(gene,sample)))
mat_0_1 <- as.data.frame(matrix(0,length(gene),length(sample),
                                dimnames = list(gene,sample)))

for (i in 1:nrow(a)){
  mat[as.character(a[i,1]),as.character(a[i,3])] <- as.character(a[i,2])
}

for (i in 1:nrow(a)){
  mat_0_1[as.character(a[i,1]),as.character(a[i,3])] <- 1
}

#所有样本突变情况汇总/排序
gene_count <- data.frame(gene=rownames(mat_0_1),
                         count=as.numeric(apply(mat_0_1,1,sum))) %>%
  arrange(desc(count))
gene_top <- gene_count$gene[1:20] # 修改数字,代表TOP多少,也可选择自己感兴趣的
##保存
save(mat,mat_0_1,file = "TMB.rda") ##保存为RData
write.csv(mat,"all_mut_type.csv")
write.csv(mat_0_1,"all_mut_01.csv")
mat

mat_0_1

10. 绘制瀑布图oncoplot

oncoplot(maf = all_mut,

top = 30, #显示前30个的突变基因信息

fontSize = 0.6, #设置字体大小

showTumorSampleBarcodes = F) #不显示病人信息

11. 计算tmb值

tmb_table = tmb(maf = all_mut)   #默认以log10转化的TMB绘图
tmb_table = tmb(maf = all_mut,logScale = F)   #不log
write.csv(tmb_table,"tmb_results.csv")

发表评论

邮箱地址不会被公开。 必填项已用*标注