BIOS 555: High-throughput data analysis using R and Bioconductor

Class Information


This course covers the basics of analyses of genomic data from high-throughput technologies, mainly microarray and second-generation sequencing. Topics include the biological motivations, experimental procedures and statistical methods for different technologies. Usage of existing software packages (mainly in R/Bioconductor) for analyzing various genomic data will be introduced.

This class put more emphases on applications instead of statisical theories. Upon completion of the class, students are expected to be able to:

  1. Understand the biological motivations and technological procedures of high-throughput experiments including different types of microarrays and second generation sequencing.
  2. Understand statistical challenges and existing methods for analyzing the data generated from high-throughput experiments.
  3. Analyze high-throughput data using R/Bioconductor and other open source software.
Prerequisite: BIOS 501 or equivalents. Basic programming experiences in R. Some experience in command line driven OS (DOS or Linux).

Grading: Four sets of homework, each worth 15%. Final project 30%. Class participate 10%.

Lab: Bring laptop to the labs.

Reading Materials:

Here are some reading materials related to the class. It is NOT required to read them all, but it'll help you get better understanding of certain materials.

  1. R programming: R for beginners is highly recommended.
  2. Bioconductor. There are several books listed on the Bioconductor website at here, but they are pretty outdated. The best way to learn is to read the manual or "vignette" for each package.
  3. Important papers:

Final project:

Possible topics

The final project could be (but not limited to) exploratory analysis, statistical modeling, or analytical software development for any type of genomic data. Some ideas of projects include:

Talk to instructor and/or TA for ideas or for helps in getting proper dataset.

Requirement for the report

Students need to submit a short report for the final project. The report needs to be a single docx or pdf file with file name as NAME_BIOS555_finalproject.docx. Here NAME is your full name. The R codes should be submitted as appendix, but need to be in the same file. The report should contain following sections: Use tables and figures if needed. Make sure to properly number the tables and figures, and use captions. The report should NOT exceed 6 pages with single spacing, 11 point, 1 inch margin. R codes are not counted toward page limit. There is no minimum page requirement.

Submit the report

The report is due on Oct 20, 2022 at midnight, a few days after the class ends. Please send the report (as a single attachment) to the instructor and cc the TA.

Class schedule

Date Lecture Title Description Homework Suggested Reading
8/24 (Wed) Lecture 1: Introduction [PDF] Brief introduction of molecular biology, high-throughput experiments, R and Bioconductor. Wikipedia pages for gene, genome, microarray and sequencing.
8/29 (Mon) Lab 1: Simple genomic analysis using R [PDF, R] Exploratory analysis of human refseq genes. Homework1
8/31 (Wed) Lecture 2: Gene expression microarray I [PDF] Experimental procedures and data pre-processing methods for Gene expression microarrays. The microarray review article, RMA and GCRMA papers.
9/5 (Mon) Labor day, no class
9/7 (Wed) Lecture 3: Gene expression microarray II. [PDF] Differential expression from GE arrays. Biological and technical artifacts: batch effects, cell type mixture.. SAM and Limma papers.
9/12 (Mon) Lab 2: Analyzing gene expression array data from MAQC. [PDF][R] Using gene expression microarrays generated by MAQC project, we will explore data produced from different array designs and compare to the gold standard. The gold standard Taqman data can be found at here. Homework2
9/14 (Wed) Lecture 4: Handling genome data using Bioconductor I [PDF][R] Introduce Biostrings and BSgenome Bioconductor packages. PLoS CB paper, Package Vignettes for Biostrings and BSgenome.
9/19 (Mon) Lecture 5: Handling genome data using Bioconductor II [PDF][R] Introduce GenomicRanges and GenomicFeatures Bioconductor packages. PLoS CB paper, Package Vignettes for GenomicRanges and GenomicFeatures.
9/21 (Wed) Lab 3: Analyzing human genome [PDF][R] Study the sequence composition of human genome. Look at overlaps of CpG islands and gene promoters. Homework3
9/26 (Mon) Lecture 6: Introduction to second generation sequencing [PDF] Introduce second generation sequencing technologies and software for alignment, variant calling and visualization.
9/28 (Wed) Lecture 7: RNA-seq [PDF] Experimental procedure and data analysis for RNA-seq data. Normalization and differential expression detection. DEseq and edgeR Bioconductor package. edgeR, DESeq, and cufflink papers
10/3 (Mon) Lecture 8: ChIP-seq [PDF] Experimental procedure of ChIP-seq. Peak calling methods. Comparison of multiple ChIP-seq. Joint analysis of ChIP-seq and RNA-seq. MACS and PeakSeq papers
10/5 (Wed) Lab 4: Handling second generation sequencing data, RNA- and ChIP-seq analyses [PDF][R] This lab will practice the materials covered in three lectures. (1) sequence alignment and manipulation. (2) Joint analysis of ChIP-seq and RNA-seq. (3) RNA-seq analysis using DEseq and edgeR. Homework4
10/10 (Mon) Fall break, no class
10/12 (Wed) Lecture 9: Bisulfite sequencing [PDF] Experimental procedure of bisulfite sequencing. Differential methylation. DNA methylation and protein binding. BSmooth and DSS papers
10/17 (Mon) Lecture 10: Single-cell sequencing [PDF] Briefly introduce single-cell sequencing technologies and data analysis, with emphasis on single-cell RNA sequencing (scRNA-seq). Monocle, MAST, and the review papers