Help with Investigating Gene Sets

Input Gene Identifiers

The tools on the Investigate Gene Sets page all take a list of genes as input. Enter a list of gene identifiers in the box provided and specify the appropriate species; human, mouse, and rat are supported. Ensembl Gene IDs and NCBI (Entrez) Gene IDs are accepted, as are HGNC (HUGO) IDs and Symbols, MGI IDs and Symbols, and RGD Symbols. These are case sensitive (e.g. egfr is not the same as EGFR). For Ensembl identifiers remove any version suffixes (e.g. use ENSG00000141510 instead of ENSG00000141510.17); transcript-level IDs are not accepted. For identifiers from other species, we recommend using Biomart to convert into HGNC Gene Symbols or Human/Mouse/Rat NCBI Gene IDs.

Beginning in MSigDB 7.0, we are using Ensembl as the platform annotation authority. Identifiers for genes are mapped to approved gene symbols and NCBI Gene ID through annotations extracted from Ensembl's BioMart data service, and will be updated at each MSigDB release with the latest available version of Ensembl. See the Release Notes for the current MSigDB release for a link to the version of Ensembl Biomart used in mapping.

Compute Overlaps

When gene sets share genes, examination of how they overlap can highlight common processes, pathways, and underlying biological themes. This tool evaluates the overlap of a user provided gene set, and an estimate of the statistical significance, with as many MSigDB collections as you choose. Note: this simple overlap evaluation is not the same as the full gene set enrichment analysis provided by the GSEA desktop application.

Enter a list of gene identifiers in the box provided and specify the appropriate species as described in Input Gene Identifiers above. Overlap results are presented using the gene symbols and NCBI Gene IDs specific to the species of the target gene set database; any required conversion (i.e. orthology) is done automatically by the tool. Due to the characteristics of the hypergeometric distribution there are limits to how large the user provided gene set can be, yet still produce meaningful significance estimates. At most 500 genes will be allowed, anything larger will be rejected.

Click on the "compute overlaps" button to display the results, including

  • Statistics:
    • # overlaps shown lists the number of overlapping gene sets displayed in the report
      By default, the report displays the 10 gene sets in the collection that best overlap with your gene set. If you compute overlaps from the Investigate Gene Sets page, you can choose the number of overlapping gene sets to display in the report.
      You can choose to filter out gene sets with a number of members above or below given size thresholds. By default, no thresholds are used.
    • # gene sets in collection lists the total number of gene sets being analyzed
    • # genes in comparison lists the number of genes in your gene set
    • # genes in collection lists the number of unique genes in the gene sets being analyzed
  • Descriptions of the overlapping gene sets, including
    • Link to the gene set page
    • Number of genes in the gene set
    • Description of the gene set
    • Number of genes in the overlap between this gene set and your gene set
    • P value from the hypergeometric distribution for (k-1, K, N - K, n) where
      k is the number of genes in the intersection of the query set with a set from MSigDB
      K is the number of genes in the set from MSigDB
      N is the total number of gene universe (all known human gene symbols, or all known mouse gene symbols)
      n is the number of genes in the query set
      You can read the Wikipedia article on the hypergeometric distribution for more information on how p-values are determined.
    • FDR q-value. This is the false discovery rate analog of hypergeometric p-value after correction for multiple hypothesis testing according to Benjamini and Hochberg.
      You can read the Wikipedia article on the false discovery rate for more information on how q-values are determined.
    • Color bar shading from light green to black, where lighter colors indicate more significant q-values (< 0.05) and black indicates less significant q-values (≥ 0.05).
  • Overlap matrix showing the genes in the overlapping gene sets
    • Rows list the genes in your gene set, with gene descriptions and links to gene annotations
    • Columns list the overlapping gene sets, with links to the gene set pages

Our thanks to GATHER: Gene Annotation Tool to Help Explain Relationships (Change & Nevins, Bioinformatics, 2006; https://changlab.uth.tmc.edu/gather/) for their inspiration in the output format of our gene set overlap tool.

Compendia Expression Profiles

You can display a heat map of the expression levels of the genes in your gene list in the samples of any one of these four compendia of expression data:


For data in the human gene space:

  • GTEx v8 (GTEx). Median gene-level TPM by tissue profiles from the GTEx Portal. Per GTEx analysis pipeline: median expression was calculated from the file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz. Full data is available from https://gtexportal.org/home/datasets.
  • Human tissue compendium (Novartis). Gene expression profiles from the Novartis normal tissue compendium, as published in Su, A. I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K. A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., et al. (2004) Proc. Natl. Acad. Sci. USA 101, 6062-6067.
  • Global Cancer Map (Broad Institute). Gene expression profiles from the global cancer map, as published in Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C. H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., et al. (2001) Proc. Natl. Acad. Sci. USA 98, 15149-15154.
  • NCI-60 cell lines (National Cancer Institute). Gene expression profiles from the NCI 60 data set downloaded from the Developmental Therapeutics Program web site (https://wiki.nci.nih.gov/display/NCIDTPdata/Molecular+Target+Data). No preprocessing was done other than collapsing probe IDs to gene symbols.

For data in the mouse gene space:

  • Mouse Transcriptomic BodyMap compendium. Gene expression profiles from the study "A Comprehensive Mouse Transcriptomic BodyMap across 17 Tissues by RNA-seq" performed by Li et al. (https://www.nature.com/articles/s41598-017-04520-z). FPKM quantifications were extracted from Supplementary Table S6: gene expression profile (FPKM) of mouse across 17 tissues. No additional preprocessing was done other than collapsing Ensembl IDs to gene symbols.

Enter a list of gene identifiers in the box provided and specify the appropriate species as described in Input Gene Identifiers above. Choose one of the available compendia and click on "display expression profile". Heat maps are presented using gene symbols; any required conversion is done automatically by the tool. The resulting heat map includes dendrograms clustering gene expression by gene and samples. Genes are identified by probe identifier, gene symbol, description, and gene family.

Gene Families

Note that gene family information is only available for human genes in the human component of MSigDB.

A gene family describes any collection of proteins that share a common feature such as homology or biochemical activity. Available categories and links to the relevant source publications in PubMed:

Enter a list of gene identifiers in the box provided and specify the appropriate species as described in Input Gene Identifiers above. Click on "show gene familes" to categorize the input genes by gene families.

Filtered By Similarity

Beginning in MSigDB 7.3, gene sets in C5 and C2:CP:Reactome that have undergone redundancy filtering for inclusion in MSigDB now have an additional field on the gene set page "Filtered by similarity". This field contains the source database IDs of other candidate gene sets that clustered with the selected set by Jaccard similarity coefficient, and exhibited Jaccard coefficients >0.85 with the selected set but were filtered out of the collection on the basis of tree distance or set size. These database IDs link to the source resource's page for that identifier as in the EXTERNAL_DETAILS_URL field.

This redundancy filtering procedure also applies to the cooresponding collections in the Mouse MSigDB collections (M5:GO and M2:CP:Reactome)

NDEx Biological Network Repository

You can further investigate the genes in your gene list through a query to NDEx, the Network Data Exchange (www.ndexbio.org), an online biological networks repository that is also integrated with Cytoscape, the network analysis and visualization environment (cytoscape.org). Networks are a powerful tool for expressing biological knowledge, including molecular interactions, biological relationships curated from literature, and outputs from analysis of big data.

Enter a list of gene identifiers in the box provided and specify the appropriate species as described in Input Gene Identifiers above. Click on "query NDEx" to send the list of genes to the NDEx IQuery tool (www.ndexbio.org/iquery), which finds pathways enriched for the query genes, networks representing the interactions between those genes and other proteins, and networks representing the associations between those genes and other biological or chemical entities. The NDEx query results page will allow you to:

  • Browse the networks from pathway enrichment, protein association, or gene association searches
  • View query genes that are present in each network
  • Zoom, pan, and inspect networks, and get more information about their nodes and edges by clicking on them
  • Save networks to NDEx, or open them in Cytoscape
  • Perform new searches with other genes

See the IQuery help documentation for more details on using the NDEx query results page. See the NDEx home page for information on how to cite your use of NDEx.