Mouse MSigDB Collections: Details and
Acknowledgments

General notes

Identifiers for genes are mapped to their MGI approved Gene Symbol and NCBI Gene ID through annotations extracted from Ensembl's BioMart data service, and will be updated at each MSigDB release with the latest available version of Ensembl.

MH collection: hallmark gene sets

MSigDB's hallmark collection has long served as a starting point for exploration of the MSigDB resource and GSEA in human datasets. In recognition of its broad utility as an analytic starting point, an orthology converted version of these sets is being provided here to allow analysis in the mouse gene-space alongside other, mouse-native, sets.

Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA

We refer to the original overlapping gene sets, from which a hallmark is derived, as its 'founder' sets. The MH gene set pages provide links to the corresponding human gene set page, which provide links to the corresponding founder sets for more in-depth exploration, as well as links to microarray data that served for refining and validation of the hallmark set.

To cite your use of the collection, and for further information, please refer to Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015 Dec 23;1(6):417-425.

M1 collection: positional gene sets

Gene annotations for this collection are derived from the Chromosome and Karyotype band tracks from Ensembl BioMart and reflect the gene architecture as represented on the primary assembly. Decimals in cytogenetic bands were ignored. For example, chr1A1.1 was considered chr1A1. Therefore, genes annotated as chr1A1.2 and those annotated as chr1A1.3 were both placed in the same set, chr1A1. These gene sets can be helpful in identifying effects related to chromosomal deletions or amplifications, dosage compensation, epigenetic silencing, and other regional effects.

Note that due to the transition to GRCm39 in Ensembl 103 which has not been annotated for cytogenetic bands, this data is based on band annoations retrieved from the GRCm38 based Ensembl 102 release. The GRCm38 based annotations were retireved in the namespace of Ensembl IDs and the current GRCm39 based Ensembl gene ID to gene symbol mapping tables were used to construct the gene sets.

M2 collection: curated gene sets

Gene sets in this collection are curated from various sources, including online pathway databases and the biomedical literature. Many sets are also contributed by individual domain experts. The gene set page for each gene set lists its source. The M2 collection is divided into the following two subcollections: Chemical and genetic perturbations (CGP) and Canonical pathways (CP).

> M2 subcollection CGP: Chemical and genetic perturbations

Gene sets that represent expression signatures of genetic and chemical perturbations.

The majority of the CGP subcollection represents data curated from biomedical literature. Microarray, and sequencing studies have identified many signatures of many important biological and clinical states (e.g. cancer metastasis, stem cell characteristics, drug resistance). Rather than, for example, a pathway database that is designed to represent a generic accounting of cellular processes, CGP aims to provide specific targeted signatures largely from perturbation experiments. A number of these gene sets come in pairs: xxx_UP (and xxx_DN) gene sets representing genes induced (and repressed) by the perturbation. The majority of CGP sets were curated from publications and include links to the PubMed citation, the exact source of the set (e.g., Table 1), and links to any corresponding raw data in GEO or ArrayExpress repositories. When the gene set involves a genetic perturbation, the set's brief description includes a link to the gene's entry in the NCBI (Entrez) Gene database. When the gene set involves a chemical perturbation, the set's brief description includes a link to the chemical's entry in the NCBI PubChem Compound database.

A number of individuals have contributed gene sets to this collection. The gene set annotation includes a "contributor" field that acknowledges the contributor by name/affiliation.

> M2 subcollection CP: Canonical pathways

The pathway gene sets are curated from the following online databases:

M3 collection: regulatory target gene sets

Gene sets representing potential targets of regulation by transcription factors or microRNAs. The sets consist of genes grouped by their shared regulatory element. The motifs represent known or likely cis-regulatory elements in promoters and 3'-UTRs. These gene sets make it possible to link changes in an expression profiling experiment to a putative cis-regulatory element. The M3 collection is divided into two subcollections: microRNA targets (MIR) and transcription factor targets (TFT).

> M3 subcollection miRDB microRNA targets

These sets consist of computationally predicted mouse gene targets of miRNAs using the MirTarget algorithm (Liu and Wang, 2019). Data was curated from miRDB v6.0 (mirdb.org, Chen and Wang, 2020) target predictions with MirTarget scores >80 (high confidence predictions). miRNAs catalogued in miRDB v6.0 are derived from miRBase v22 (March 2018).

Liu W, Wang X. Prediction of functional microRNA targets by integrative modeling of microRNA binding and target expression data. Genome Biol. 2019 Jan 22;20(1):18.

Chen Y, Wang X. miRDB: an online database for prediction of functional microRNA targets. Nucleic Acids Res. 2020 Jan 8;48(D1):D127-D131.

> M3 subcollection GTRD transcription factor targets

Sets of mouse genes predicted to contain transcription factor binding sites in their promoter regions (-1000,+100 bp around the transcription start site) for the indicated transcription factor. Gene sets are derived from the Gene Transcription Regulation Database (GTRD, gtrd.biouml.org) uniform processing pipeline and represent a candidate list of potential regulatory targets for each transcription factor (see MSigDB release notes for the current included GTRD version).

GTRD: an integrated view of transcription regulation. Kolmykov S, Yevshin I, Kulyashov M, Sharipov R, Kondrakhin Y, Makeev VJ, Kulakovskiy IV, Kel A, Kolpakov F. Nucleic Acids Res. 2021 Jan 8;49(D1):D104-D111. doi: 10.1093/nar/gkaa1057.

M4 collection

The M4 collection is reserved for future use.

M5 collection: ontology gene sets

Gene sets in this collection are derived from ontology resources. divided into four sub collections derived from ontology annotations. Ontology annotations were curated from databases maintained by their respective authorities.

Ontology terms for very broad categories that would produce extremely large gene sets (greater than 2000 members) and ontology terms that produced gene sets with fewer than 5 members have been omitted. Additionally, each subcollection goes through a redundancy filtering procedure to ensure there are no identical or highly similar sets. (See MSigDB release notes for the current versions, and more information on specific procedures.)

Note to GSEA users: Gene set enrichment analysis identifies gene sets consisting of co-regulated genes; GO gene sets are based on ontologies and do not necessarily comprise co-regulated genes.

> M5 subcollection GO: Gene Ontology

The M5:GO subcollection is divided into three compoents (BP, CC, and MF) derived from Gene Ontology (GO), and represent GO terms belonging to one of the three root GO ontologies: biological process (BP), cellular component (CC), or molecular function (MF) respectively.

GO is a collaborative effort to develop and use ontologies to support biologically meaningful annotation of genes and their products. A GO annotation consists of a GO term associated with a specific reference that describes the work or analysis upon which the association between a specific GO term and gene product is based. Each annotation also includes an evidence code to indicate how the annotation to a particular term is supported (http://geneontology.org/page/guide-go-evidence-codes). Gene sets in this subcollection are prefixed with "GOBP" (Biological Process), "GOMF" (Molecular Function), or "GOCC" (Cellular Component) to indicate their source ontology.

> M5 subcollection MPT: Mammalian Phenotype Tumor Ontology

The MPT subcollection consists of ontology terms related to tumor phenotypes mined from the Mammalian Phenotype Ontology database. The Mammalian Phenotype Ontology database provides a standardized vocabulary of phenotypic abnormalities encountered in mouse models (http://www.informatics.jax.org/vocab/mp_ontology). Each term in the vocabulary is annoated with the association between the phenotypic abnormality and a set of genes known to be involved in development of said abnormality, developed through expert curation of the peer reviewed scientific literature. The published sources of the annotations for the genes in these gene sets are available from MGI. Gene sets in this subcollection are prefixed with "MP" to indicate their source ontology.

Smith CL, Eppig JT. The mammalian phenotype ontology: enabling robust annotation and comparative analysis. Wiley Interdiscip Rev Syst Biol Med. 2009 Nov-Dec;1(3):390-399. doi: 10.1002/wsbm.44.

M6 collection

The M6 collection is reserved for future use.

M7 collection

The M7 collection is reserved for future use.

M8 collection: cell type signature gene sets

Gene sets that contain cluster marker genes for cell types identified in single-cell sequencing studies of mouse tissue. These gene sets are intended to facilitate the assignment of cell types in datasets such as those from experiments developing organoid models.