Help with Investigating Gene Sets

Input Gene Identifiers

The tools on the Investigate Gene Sets page all take a list of genes as input. Enter a list of gene identifiers in the box provided and specify the appropriate species; human, mouse, and rat are supported. Ensembl Gene IDs and NCBI (Entrez) Gene IDs are accepted, as are HGNC (HUGO) IDs and Symbols, MGI IDs and Symbols, and RGD Symbols. These are case sensitive (e.g. egfr is not the same as EGFR). For Ensembl identifiers remove any version suffixes (e.g. use ENSG00000141510 instead of ENSG00000141510.17); transcript-level IDs are not accepted. For identifiers from other species, we recommend using Biomart to convert into HGNC Gene Symbols or Human/Mouse/Rat NCBI Gene IDs.

Beginning in MSigDB 7.0, we are using Ensembl as the platform annotation authority. Identifiers for genes are mapped to approved gene symbols and NCBI Gene ID through annotations extracted from Ensembl's BioMart data service, and will be updated at each MSigDB release with the latest available version of Ensembl. See the Release Notes for the current MSigDB release for a link to the version of Ensembl Biomart used in mapping.

Compute Overlaps

When gene sets share genes, examination of how they overlap can highlight common processes, pathways, and underlying biological themes. This tool evaluates the overlap of a user provided gene set, and an estimate of the statistical significance, with as many MSigDB collections as you choose. Note: this simple overlap evaluation is not the same as the full gene set enrichment analysis provided by the GSEA desktop application.

Enter a list of gene identifiers in the box provided and specify the appropriate species as described in Input Gene Identifiers above. Overlap results are presented using the gene symbols and NCBI Gene IDs specific to the species of the target gene set database; any required conversion (i.e. orthology) is done automatically by the tool. Due to the characteristics of the hypergeometric distribution there are limits to how large the user provided gene set can be, yet still produce meaningful significance estimates. At most 500 genes will be allowed, anything larger will be rejected.

Click on the "compute overlaps" button to display the results, including

  • Statistics:
    • # overlaps shown lists the number of overlapping gene sets displayed in the report
      By default, the report displays the 10 gene sets in the collection that best overlap with your gene set. If you compute overlaps from the Investigate Gene Sets page, you can choose the number of overlapping gene sets to display in the report.
      You can choose to filter out gene sets with a number of members above or below given size thresholds. By default, no thresholds are used.
    • # gene sets in collection lists the total number of gene sets being analyzed
    • # genes in comparison lists the number of genes in your gene set
    • # genes in collection lists the number of unique genes in the gene sets being analyzed
  • Descriptions of the overlapping gene sets, including
    • Link to the gene set page
    • Number of genes in the gene set
    • Description of the gene set
    • Number of genes in the overlap between this gene set and your gene set
    • P value from the hypergeometric distribution for (k-1, K, N - K, n) where
      k is the number of genes in the intersection of the query set with a set from MSigDB
      K is the number of genes in the set from MSigDB
      N is the total number of gene universe (all known human gene symbols, or all known mouse gene symbols)
      n is the number of genes in the query set
      You can read the Wikipedia article on the hypergeometric distribution for more information on how p-values are determined.
    • FDR q-value. This is the false discovery rate analog of hypergeometric p-value after correction for multiple hypothesis testing according to Benjamini and Hochberg.
      You can read the Wikipedia article on the false discovery rate for more information on how q-values are determined.
    • Color bar shading from light green to black, where lighter colors indicate more significant q-values (< 0.05) and black indicates less significant q-values (≥ 0.05).
  • Overlap matrix showing the genes in the overlapping gene sets
    • Rows list the genes in your gene set, with gene descriptions and links to gene annotations
    • Columns list the overlapping gene sets, with links to the gene set pages

Our thanks to GATHER: Gene Annotation Tool to Help Explain Relationships (Change & Nevins, Bioinformatics, 2006; https://changlab.uth.tmc.edu/gather/) for their inspiration in the output format of our gene set overlap tool.

Compendia Expression Profiles

There are tools for producing heatmaps from an MSigDB gene set or user-provided gene list against the samples of several compendia of expression data. You can create a heatmap as a static image or, as of July 2023, in an interactive form.

Our interactive Compendia Expression Profiles tool uses Next-Generation Clustered Heat Maps (NG-CHM) from the Department of Bioinformatics and Computational Biology at The University of Texas MD Anderson Cancer Center to allow ad-hoc exploration of the expression profile. As they describe it on their home page:

The NG-CHM Heat Map Viewer is a dynamic, graphical environment for exploration of clustered or non-clustered heat map data in a web browser. It supports zooming, panning, searching, covariate bars, and link-outs that enable deep exploration of patterns and associations in heat maps.

Full instructions on the navigation and use of the NG-CHM viewer can be found on the project's website, along with links to video tutorials, citing information, and more.

You can display a heatmap of the expression levels of the genes in your gene list in the samples of any one of these compendia of expression data:


For data in the human gene space:

  • GTEx v8 (GTEx). Median gene-level TPM by tissue profiles from the GTEx Portal. Per GTEx analysis pipeline: median expression was calculated from the file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz. Full data is available from https://gtexportal.org/home/datasets.
  • Human tissue compendium (Novartis). Gene expression profiles from the Novartis normal tissue compendium, as published in Su, A. I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K. A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., et al. (2004) Proc. Natl. Acad. Sci. USA 101, 6062-6067.
  • Global Cancer Map (Broad Institute). Gene expression profiles from the global cancer map, as published in Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C. H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., et al. (2001) Proc. Natl. Acad. Sci. USA 98, 15149-15154.
  • NCI-60 cell lines (National Cancer Institute). Gene expression profiles from the NCI 60 data set downloaded from the Developmental Therapeutics Program web site (https://wiki.nci.nih.gov/display/NCIDTPdata/Molecular+Target+Data). No preprocessing was done other than collapsing probe IDs to gene symbols.

For data in the mouse gene space:

  • Mouse Transcriptomic BodyMap compendium. Gene expression profiles from the study "A Comprehensive Mouse Transcriptomic BodyMap across 17 Tissues by RNA-seq" performed by Li et al. (https://www.nature.com/articles/s41598-017-04520-z). FPKM quantifications were extracted from Supplementary Table S6: gene expression profile (FPKM) of mouse across 17 tissues. No additional preprocessing was done other than collapsing Ensembl IDs to gene symbols.

Enter a list of gene identifiers in the box provided and specify the appropriate species as described in Input Gene Identifiers above. Choose one of the available compendia and click on "launch expression profile". Heat maps are presented using gene symbols; any required conversion is done automatically by the website. The resulting heat map includes dendrograms clustering gene expression by gene and samples.

  • Genes are identified by gene symbol. Details like probe identifier and description are available in a pop-up by clicking on an individual data point.
  • You can select rows and columns in the matrix with the usual mouse-click operations.
  • Context-sensitive link-outs are available as menu items on the gene labels (single or multi-selected) with the usual mouse context-click operation (Windows right-click or Mac control-click). These allow querying a number of other useful sites on the Internet and look up the associated row-level (gene) metadata. Certain link-outs are valid when multiple rows are selected, but the majority will use just the single row which received the context-click.

Gene Families

Note that gene family information is only available for human genes in the human component of MSigDB.

A gene family describes any collection of proteins that share a common feature such as homology or biochemical activity. Available categories and links to the relevant source publications in PubMed:

Enter a list of gene identifiers in the box provided and specify the appropriate species as described in Input Gene Identifiers above. Click on "show gene familes" to categorize the input genes by gene families.

Filtered By Similarity

Beginning inMSigDB 7.3, gene sets in C5 and C2:CP:Reactome that have undergone redundancy filtering for inclusion in MSigDB now have an additional field on the gene set page "Filtered by similarity". This field contains the source database IDs of other candidate gene sets that clustered with the selected set by Jaccard similarity coefficient, and exhibited Jaccard coefficients >0.85 with the selected set but were filtered out of the collection on the basis of tree distance or set size. These database IDs link to the source resource's page for that identifier as in the EXTERNAL_DETAILS_URL field.

This redundancy filtering procedure also applies to the cooresponding collections in the Mouse MSigDB collections (M5:GO and M2:CP:Reactome)

NDEx Biological Network Repository

You can further investigate the genes in your gene list through a query to NDEx, the Network Data Exchange (www.ndexbio.org), an online biological networks repository that is also integrated with Cytoscape, the network analysis and visualization environment (cytoscape.org). Networks are a powerful tool for expressing biological knowledge, including molecular interactions, biological relationships curated from literature, and outputs from analysis of big data.

Enter a list of gene identifiers in the box provided and specify the appropriate species as described in Input Gene Identifiers above. Click on "query NDEx" to send the list of genes to the NDEx IQuery tool (www.ndexbio.org/iquery), which finds pathways enriched for the query genes, networks representing the interactions between those genes and other proteins, and networks representing the associations between those genes and other biological or chemical entities. The NDEx query results page will allow you to:

  • Browse the networks from pathway enrichment, protein association, or gene association searches
  • View query genes that are present in each network
  • Zoom, pan, and inspect networks, and get more information about their nodes and edges by clicking on them
  • Save networks to NDEx, or open them in Cytoscape
  • Perform new searches with other genes

See the IQuery help documentation for more details on using the NDEx query results page. See the NDEx home page for information on how to cite your use of NDEx.