Model Based CpG Islands
Last updated on July 9, 2020.
Lists of CpG islands
Below are lists of model-based CpG islands for a number of different species
created using the method described in Wu et al. (2010) Biostatistics.
All lists are generated using 0.99 as posterior probability threshold
except for D. melanogaster (fruit fly), which used 0.975.
All files are tab delimited text.
- H. sapiens (human) hg38 [islands]
- H. sapiens (human) hg19 [islands]
- H. sapiens (human) hg18 [islands]
- M. musculus (mouse) mm10 [islands]
- M. musculus (mouse) mm9 [islands]
- M. musculus (mouse) mm8 [islands]
- R. norvegicus (rat) rn4 [islands]
- P. troglodytes (chimpanzee) panTro2 [islands]
- R. macaque (monkey) rheMac2 [islands]
- P. abelii (Orangutan) ponAbe2 [islands]
- B. taurus (cow) bosTau4 [islands]
- C. familiaris (dog) canFam2 [islands]
- E. caballus (horse) equCab2 [islands]
- G. gallus (chicken) galGal3 [ islands]
- D. rerio (zebrafish) danRer6 [ islands]
- A. mellifera (bee) BeeBase assembly4 [islands]
- D. melanogaster (fruit fly) dm3 [islands]
- C. elegans (worm) ce2 [islands]
- A. thaliana (Arabidopsis) [islands]
Model based CpG islands are now at UCSC genome browser as custom tracks
(link).
R software package
makeCGI is an R software package to obtain CGI from a genome.
It fits two HMMs on GC content and observed to expected CpG ratio iteratively and obtain posterior probabilities
for genomic regions being CpG islands. The CpG islands are then defined by thresholding the posterior probabilities.
Download
The software package can be downloaded from here.
It depends on BSgenome and Biostrings BioConductor packages.
The input DNA sequence can be either a BSgenome package or text file in fa format.
Use the software
Follow below steps to use the package:
- Load in the library:
library(makeCGI)
- Set up default parameters:
.CGIoptions=CGIoptions()
- Start running:
makeCGI(.CGIoptions)
Three folders "counts" "rawdata" and "result" will be created under the
current working directories to save intermediate result files. The final results
will be write to a text file in the current directory as "CGI-[species name].txt",
e.g., for the above examples, "CGI-Hsapiens.txt".
This program could require a lot of memory, depending on
the size of the genome being analyzed. The computational time
could be substantial.
References
- Wu H, Caffo B, Jaffe HA, Feinberg AP, Irizarry RA (2010)
Redefining CpG Islands Using a Hierarchical Hidden Markov Model. Biostatistics 11(3): 499-514.
- Irizarry RA, Wu H, Feinberg AP (2009)
A Species-Generalized Probabilistic Model-Based Definition of CpG Islands.
Mammalian Genome Volume 20, Numbers 9-10, 674-680.