BIOS 555: High-throughput data analysis using R and Bioconductor



Class Information


Summary

This course covers the basics of analyses of genomic data from high-throughput technologies, mainly microarray and second-generation sequencing. Topics include the biological motivations, experimental procedures and statistical methods for different technologies. Usage of existing software packages (mainly in R/Bioconductor) for analyzing various genomic data will be introduced.

This class put more emphases on applications instead of statisical theories. Upon completion of the class, students are expected to be able to:

  1. Understand the biological motivations and technological procedures of high-throughput experiments including different types of microarrays and second generation sequencing.
  2. Understand statistical challenges and existing methods for analyzing the data generated from high-throughput experiments.
  3. Analyze high-throughput data using R/Bioconductor and other open source software.
Prerequisite: BIOS 501 or equivalents. Basic programming experiences in R. Some experience in command line driven OS (DOS or Linux).

Grading: Four sets of homework, each worth 15%. Final project 30%. Class participate 10%.

Lab: Bring laptop to the labs.


Reading Materials:

Here are some reading materials related to the class. It is NOT required to read them all, but it'll help you get better understanding of certain materials.

  1. R programming: R for beginners is highly recommended.
  2. Bioconductor. There are several books listed on the Bioconductor website at here, but they are pretty outdated. The best way to learn is to read the manual or "vignette" for each package.
  3. Important papers:

Final project:

The final project could be (but not limited to) exploratory analysis, statistical modeling, or analytical software development for any type of genomic data. Some ideas of projects include:

Talk to instructor and/or TA for ideas or for helps in getting proper dataset.

Students need to submit a short report for the final project, as well as related programs. The report should NOT exceed 6 pages (single spacing, 11 point, 1 inch margin) with figures. There is no minimum page requirement.


Class schedule

Date Lecture Title Description Homework Suggested Reading
8/29 (Wed) Lecture 1: Introduction [PDF] Brief introduction of molecular biology, high-throughput experiments, R and Bioconductor. Wikipedia pages for gene, genome, microarray and sequencing.
9/3 (Mon) Labor day, no class
9/5 (Wed) Lab 1: Simple genomic analysis using R [PDF, R] Exploratory analysis of human refseq genes. Homework1
9/10 (Mon) Lecture 2: Gene expression microarray I [PDF] Experimental procedures and data pre-processing methods for Gene expression microarrays. The microarray review article, RMA and GCRMA papers.
9/12 (Wed) Lecture 3: Gene expression microarray II, tiling arrays. [PDF] Differential expression from GE arrays. Batch effects. SAM and Limma papers.
9/17 (Mon) Lab 2: Analyzing gene expression array data from MAQC. [PDF][R] Using gene expression microarrays generated by MAQC project, we will explore data produced from different array designs and compare to the gold standard. The gold standard Taqman data can be found at here. Homework2
9/19 (Wed) Lecture 4: Handling genome data using Bioconductor I [PDF][R] Introduce Biostrings and BSgenome Bioconductor packages. PLoS CB paper, Package Vignettes for Biostrings and BSgenome.
9/24 (Mon) Lecture 5: Handling genome data using Bioconductor II [PDF][R] Introduce GenomicRanges and GenomicFeatures Bioconductor packages. PLoS CB paper, Package Vignettes for GenomicRanges and GenomicFeatures.
9/26 (Wed) Lab 3: Analyzing human genome [PDF][R] Study the sequence composition of human genome. Look at overlaps of CpG islands and gene promoters. List of CpG island can be downloaded at here. Homework3
10/1 (Mon) Lecture 6: Introduction to second generation sequencing [PDF] Introduce second generation sequencing technologies and software for alignment, variant calling and visualization.
10/3 (Wed) Lecture 7: RNA-seq [PDF] Experimental procedure and data analysis for RNA-seq data. Normalization and differential expression detection. DEseq and edgeR Bioconductor package. edgeR, DESeq, and cufflink papers
10/8 (Mon) Fall break, no class
10/10 (Wed) Lecture 8: ChIP-seq [PDF] Experimental procedure of ChIP-seq. Peak calling methods. Comparison of multiple ChIP-seq. Joint analysis of ChIP-seq and RNA-seq. MACS and PeakSeq papers
10/15 (Mon) Lab 4: Handling second generation sequencing data, RNA- and ChIP-seq analyses [PDF][R] This lab will practice the materials covered in three lectures. (1) sequence alignment with bowtie and manipulation with samtools and Rsamtools. (2) RNA-seq analysis using DEseq and edgeR. (3) Joint analysis of ChIP-seq and RNA-seq. Data can be dowload here. Homework4
10/17 (Wed) Lecture 9: Bisulfite sequencing [PDF] Experimental procedure of bisulfite sequencing. Differential methylation. DNA methylation and protein binding. BSmooth and DSS papers
10/22 (Mon) Lecture 10: Single-cell sequencing [PDF] Briefly introduce single-cell sequencing technologies and data analysis, with emphasis on single-cell RNA sequencing (scRNA-seq). Monocle, Wanderlust, RaceID, and MAST papers