Data Analysis for the Life Sciences是哈佛大学PH525x系列课程——生物医学中的数据分析(PH525x series – Biomedical Data Science
),课程全部采用R语言进行统计分析理论教学与实战。教材采用Rmarkdown语言编写,易轻松易读,又保证分析的可重复性,代表了科学界最先进的可重复计算要求,我们不仅可以系统学习一个生物学家所要掌握的统计知识,还能新手用代码实现,并达到CNS发表可重复代码的要求。
传统的统计材料关注数学原理。而本文重点是用计算机实现数据分析。本书采用实例来讲解数学原理,提供代码亲自实现分析。全文采用R markdown编写,保证读者完成全部分析。
关于作者:
Rafael A Irizarry是哈佛大学公共卫生学院丹娜法伯癌症研究院的生物统计和计算生物学教授,有17年分析基因组数据的经验。
Michael I Love是北卡教堂山大学生统与遗传系助理教授。研究方向为利用统计模型发现基因组数据中的生物为规律,并开发了Bioconductor中开源统计软件。
课程源代码:https://github.com/genomicsclass/labs 包括课程所有源代码、测试数据和结果
网页版教程: https://genomicsclass.github.io/book/ ,包括课程的Rmd运行结果网页教程,和Rmd源代码的每节导航和下载链接。
电子书:https://leanpub.com/dataanalysisforthelifesciences/ 方便下载各版本在移动端阅读
有意思的是可选择免费学习,或最高付给作者80$。
教程大纲
https://genomicsclass.github.io/book/
PH525x series – Biomedical Data Science
链接与资源Links and resources
- R markdown source files
- ePub version on Leanpub
- Links to the HarvardX class pages
- External resources and books
- Finding more help for data analysis
Chapter 0 – 简介Introduction
- Introduction [Rmd]
- Getting started [Rmd]
- Getting started exercises
- 数据操作dplyr introduction [Rmd]
- dplyr introduction exercises
- Mathematical notation [Rmd]
Chapter 1 – 推理统计基础Inference
- 随机变量Random variables [Rmd]
- Random variables exercises
- 群体与样本Populations and samples [Rmd]
- Populations and samples exercises
- CLT and t-distribution [Rmd]
- CLT and t-distribution exercises
- CLT in practice [Rmd]
- CLT in practice exercises
- t-test in practice [Rmd]
- 置信区间Confidence intervals [Rmd]
- Power calculations [Rmd]
- Power calculations exercises
- Monte carlo [Rmd]
- Monte carlo exercises
- 排列检验Permutation tests [Rmd]
- Permutation tests exercises
- 关联研究Association tests [Rmd]
- Association tests exercises
Chapter 2 – 数据探索Exploratory Data Analysis
- Exploratory data analysis [Rmd]
- Plots to avoid [Rmd]
- Exploratory data analysis exercises
Chapter 3 – 稳健统计Robust Statistics
- Robust summaries [Rmd]
- Rank tests [Rmd]
- Robust summaries exercises
Chapter 4 – 矩阵代数Matrix Algebra
- 回归Introduction to using regression [Rmd]
- Introduction to using regression exercises
- Matrix notation [Rmd]
- Matrix notation exercises
- Matrix operations [Rmd]
- Matrix operations exercises
- Matrix algebra examples [Rmd]
- Matrix algebra examples exercises
Chapter 5 – 线性模型 Linear Models
- Linear models introduction [Rmd]
- Linear models introduction exercises
- Expressing design formula [Rmd]
- Expressing design formula exercises
- Linear models in practice [Rmd]
- Linear models in practice exercises
- Standard errors [Rmd]
- Standard errors exercises
- Interactions and contrasts [Rmd]
- Interactions and contrasts exercises
- Collinearity [Rmd]
- Collinearity exercises
- QR and regression [Rmd]
- Linear models going further [Rmd]
Chapter 6 – 推断高维数据Inference for High-Dimensional Data
- Introduction to high-throughput data [Rmd]
- Introduction to high-throughput data exercises
- Inference for high-throughput data [Rmd]
- Inference for high-throughput data exercises
- Multiple testing [Rmd]
- Multiple testing exercises
- EDA for high-throughput data [Rmd]
- EDA for high-throughput data exercises
Chapter 7 – 统计模型Statistical Modeling
- Modeling [Rmd]
- Modeling exercises
- Bayes theorem [Rmd]
- Bayes theorem exercises
- Hierarchical models [Rmd]
- Hierarchical models exercises
Chapter 8 – 降维Distance and Dimension Reduction
- Distance [Rmd]
- Distance exercises
- PCA motivation [Rmd]
- SVD [Rmd]
- SVD exercises
- Projections [Rmd]
- Rotations [Rmd]
- MDS [Rmd]
- MDS exercises
- PCA [Rmd]
Chapter 9 – 机器学习Practical Machine Learning
- 聚类和热图Clustering and heatmaps [Rmd]
- Clustering and heatmaps exercises
- Conditional expectation [Rmd]
- Conditional expectation exercises
- Smoothing [Rmd]
- Smoothing exercises
- Machine learning [Rmd]
- Crossvalidation [Rmd]
- Crossvalidation exercises
Chapter 10 – 批次效应Batch Effects
- Introduction to batch effects [Rmd]
- Confounding [Rmd]
- Confounding exercises
- EDA with PCA [Rmd]
- EDA with PCA exercises
- Adjusting with linear models [Rmd]
- Adjusting with linear models exercises
- Factor analysis [Rmd]
- Factor analysis exercises
- Adjusting with factor analysis [Rmd]
- Adjusting with factor analysis exercises
Chapter 11 – 生物R包简介Introduction to Bioconductor
- Mike Love’s general reference card
- Motivations and core values (optional)
- Installing Bioconductor and finding help [Rmd]
- Data structure and management for genome scale experiments [Rmd]
- Coordinating multiple tables: ExpressionSet
- Institutional archives: GEO, ArrayExpress
- Interlude: Working with general genomic features using GenomicRanges
- IRanges introduced
- Intra-range operations
- Inter-range operations
- GRanges
- Calculating overlaps
- Range-oriented solutions for current experimental paradigms
- SummarizedExperiment: for RNA-seq and 450k methylation
- External storage for very large assays
- GenomicFiles for families of BAM or BED
- DNA Variants: VCF handling with VariantAnnotation and VariantTools
- Handling multiomic archives like TCGA
- Cloud-oriented solutions: e.g., Google BigQuery
- Short read mapping/alignment software (optional) [Rmd]
Chapter 12 – 基因组注释Genomic Annotation with Bioconductor
- More details on GRanges [Rmd]
- Run-length encoding, views
- Application to genomic landmarks
- Application to 450k methylation array visualization
- General overview of Bioconductor annotation [Rmd]
- Levels: reference sequence, regions of interest, pathways
- Discovering reference sequence
- A build of the human genome
- Gene/Transcript/Exon catalogs from UCSC and Ensembl
- Importing and exporting regions and scores
- AnnotationHub: brokering thousands of annotation resources
- OrgDb: simple interface to annotation databases
- Finding and managing gene sets
- OrganismDb: unifying diverse annotation
- Cheat sheet on Bioconductor annotation [Rmd]
- Translating addresses between genome builds: liftOver [Rmd]
Chapter 13 – 假设检验Genome-scale hypothesis testing with Bioconductor
- 区分生物重复和技术重复的变异Distinguishing biological and technical variability [Rmd]
- An experiment with pooled and individual samples
- Measuring technical variation
- Measuring biological variation
- Interpretation
- 多重比较Multiple comparisons with genewise t-tests [Rmd]
- Gene-wise testing
- Naive enumeration of genes
- Demonstrating danger of multiple testing with a set of sham comparisons
- Adjusting for multiplicity with qvalue
- Adjusted counts in the sham case
- Moderated t tests via limma [Rmd]
- A spike-in dataset
- Naive t-tests
- Three steps with limma: lmFit, eBayes, topTable
- Exposing the spiked-in genes
- A view of the shrinkage of variance estimates
- 基因集分析Introducing gene sets and gene set analysis [Rmd]
- Identifier remapping
- Categorical testing
- Statistical summaries for sets: Wilcoxon
- Statistical summaries for sets: t statistics
- A dataset for comparing expression by gender
- Finding surrogate variables/batch effect correction
- Data wrangling
- The Broad Institute MsigDb
- Adjusting for within-set correlation
- A permutation procedure
Chapter 14 – 基因组数据可视化Visualization of genome scale data
- 可视化任务与策略A basic overview of visualization tasks and strategies[Rmd]
- Gene models
- Gene models plus data
- Driving visualizations with functions
- Using the browser to drive visualization functions via shiny
- Queriable dynamic displays with plotly
- Annotation-oriented visualizations
- Sketching the binding landscape over chromosomes with ggbio’s karyogram layout [Rmd]
- Plotting data in the context of genomic features with Gviz [Rmd]
- Visualizing NGS data [Rmd]
- Interactive visualization
- Graphical user interfaces for multivariate data with shiny [Rmd]
- Clustering gene expression data with shiny [Rmd]
- Final remarks on visualization [Rmd]
Chapter 15: 并行与内存不足Pursuing scalability in genomic analysis: parallelism and out-of-memory data
- Parallel computing with R and Bioconductor [Rmd]
- Demonstrating simple speedup in multicore environments
- Implicit parallelism with BiocParallel and GenomicAlignments
- External data: data interfaces that spare RAM[Rmd]
- SQLite for annotation
- Tabix-indexed BAM
- HDF5
- An illustration of NoSQL with S4: mongodb and RaggedMongoExpt[Rmd]
- Benchmarking various out-of-memory solutions[Rmd]
- Introduction to Bioconductor’s Amazon Machine Instance for cluster creation and use in EC2 [Rmd]
- Sharded GRanges for scalable integrative analysis[Rmd]
Chapter 16: 多组学数据Multi-omic data integration
- Basic examples of multi-omic integration[Rmd]
- Transcription factor (TF) binding and gene coexpression in yeast
- TF binding and GWAS hits in humans
- Using RTCGAToolbox outputs to integrate clinical, mutation, expression and methylation assays[Rmd]
- Associating tumor stage with expression patterns
- Linking DNA methylation with expression patterns
- Defining a severity marker
- Extracting survival times
- Basic data acquisition
- Working with clinical data
- Working with mutations
- Curation tasks for discrepant identifier formats
- Working with expression data
- Application to visualization: kataegis and rainfall plot[Rmd]
Chapter 17: Fostering reproducible genome-scale analysis
- Overview of unit on reproducibility[Rmd]
- Basic definitions
- Infrastructure requirements
- Statistical aspects of reproducibility
- Analysis of reproducibility probability (Boos and Stefanski 2011)
- Costs of highly reproducible designs
- Package structure, creation, installation, management[Rmd]
- create() to set up folders and DESCRIPTION
- Composing documentation plus code
- document(), install()
- What is a package?
- Using package.skeleton
- Using makeOrganismPackage
- Using devtools
- Conclusions, including a link to a recent Nature Toolbox article on Bioconductor
如何学习
我们选择在线阅读网页版教程,结合源代码进行练习。
https://genomicsclass.github.io/book/ 逐节阅读学习,内容较多。读者可挑选适合自己的章节学习即可。
有实战的内容,都有Rmd的源代码,下载用本地的Rstudio打开即可。
批量下载所有资源
Windows下载:https://github.com/genomicsclass/labs/archive/master.zip
Linux下使用git或wget下载
# 方法1. 解压后为labs-master目录
wget -c https://github.com/genomicsclass/labs/archive/master.zip
unzip master.zip
# 方法2. 下载为labs目录下
git clone git@github.com:genomicsclass/labs.git
搜资料偶然间翻到大佬的网站,资源齐全,晚辈在此谢过了
你好 也是自己用来整理资料的 欢迎共同探讨