Tutorials for msea

Tutorial 1: Basic walk-through

This tutorial demonstrates the basic functionalities in the MSEA package.

We start by importing the module and relevant data structure:

from pprint import pprint
import msea
from msea import SetLibrary

Next, load a reference microbe-set library from a GMT file. The msea package provides a read_gmt() function to parse a GMT file into a dictionary of sets:

gmt_filepath = \
    'https://bitbucket.org/wangz10/msea/raw/aee6dd184e9bde152b4d7c2f3c7245efc1b80d23/msea/data/human_genes_associated_microbes/set_library.gmt'

d_gmt = msea.read_gmt(gmt_filepath)
print('Number of microbe-sets:', len(d_gmt))
# Look at a couple of reference sets in the library
pprint(list(d_gmt.items())[:3])

Step 1: perform MSEA for a input microbe-set against the library of reference sets:

microbe_set_input = set(['Colwellia',
                        'Deinococcus',
                        'Idiomarina',
                        'Neisseria',
                        'Pseudidiomarina',
                        'Pseudoalteromonas'])
# this can be done using the `msea.enrich` function
msea_result = msea.enrich(microbe_set_input, d_gmt=d_gmt, universe=1000)
# check the top enriched reference microbe-sets
print(msea_result.head())

Step 2: perform MSEA with adjustment of expected ranks for reference sets. Sometimes certain reference microbe-sets in a library are more likely to be enriched by chance. We can adjust this by empirically estimating the null distributions of the ranks of the reference sets, then uses z-score to quantify if observed ranks are significantly different from the expected ranks.

To do that, it is easier through the SetLibrary class:

set_library = SetLibrary.load(gmt_filepath)

The SetLibrary.get_empirical_ranks() method helps compute the expected ranks and store the means and standard deviations of the ranks from the null distribution:

set_library.get_empirical_ranks(n=20)
print(set_library.rank_means.shape, set_library.rank_stds.shape)

After that, we can perform MSEA with this adjustment:

msea_result_adj = set_library.enrich(
    microbe_set_input, adjust=True, universe=1000)
print(msea_result_adj.head())

Tutorial 2: Uses MSEA as part of a microbiomic data processing pipeline

This tutorial demonstrates a case study demonstrating how MSEA can be integrated into a computational pipeline analyzing a microbiome profiling dataset.

This tutorial requires some additional Python packages:

import numpy as np
import pandas as pd
from biom import load_table  # reqired for parsing BIOM formated dataset
# pip install biom-format
from skbio.stats.composition import ancom
# pip install scikit-bio
import msea
from msea import SetLibrary

We started with loading and preparing 16S dataset, which can be downloaded from Qiita:

table = load_table('../msea/data/study_10483/otu_table.biom')
print(table.shape)  # OTUs x samples
# Optionally normalize data on the sample axis -> relative abundance
# table.norm(axis='sample', inplace=True)

# Load metadata
meta_df = pd.read_csv(
    '../msea/data/study_10483/10483_prep_2122_qiime_20180418-105538.txt',
    sep='\t',
    index_col='#SampleID')
meta_df = meta_df.loc[table.ids()]


def collapse_f(id_, md):
    # collapse OTUs to genus-level
    genus_idx = 5
    return '; '.join(md['taxonomy'][:genus_idx + 1])


table_g = table.collapse(collapse_f, axis='observation')
print(table_g.shape)

# check the microbial species:
print(table_g.ids(axis='observation')[:5])
# convert to a pd.DataFrame
df_g = pd.DataFrame(
    table_g.matrix_data.toarray().T,
    columns=table_g.ids(axis='observation'),
    index=table_g.ids()
)
# Select a subset of samples from genotype 'Thy1-aSyn' for further analysis
meta_df_sub = meta_df.loc[meta_df['genotype'] == 'Thy1-aSyn']
df_g_sub = df_g.loc[meta_df['genotype'] == 'Thy1-aSyn']
print(meta_df_sub['donor_status'].value_counts())

Next, we perform DA analysis using ANCOM to identify DA microbes:

ancom_df, percentile_df = ancom(df_g_sub + 1,  # adding pseudocounts
                                meta_df_sub['donor_status'],
                                alpha=0.1,
                                multiple_comparisons_correction='holm-bonferroni')
microbes_DA = ancom_df.loc[ancom_df['Reject null hypothesis']].index
print(microbes_DA)

# retain genus names
microbes_DA = filter(None, [s.split('; ')[-1].lstrip('g__')
                            for s in microbes_DA])
# convert to set
microbes_DA = set(microbes_DA)
print(microbes_DA)

Finally, we can perform MSEA for DA microbes we just identified:

set_lib = SetLibrary.load(
    '../msea/data/human_genes_associated_microbes/set_library.gmt',
    rank_means_file='../msea/data/human_genes_associated_microbes/rank_means.npy',
    rank_stds_file='../msea/data/human_genes_associated_microbes/rank_stds.npy')
msea_result = set_lib.enrich(microbes_DA, adjust=True, universe=1000)
print(msea_result.head())