Tutorials for msea¶
Tutorial 1: Basic walk-through¶
This tutorial demonstrates the basic functionalities in the MSEA package.
We start by importing the module and relevant data structure:
from pprint import pprint
import msea
from msea import SetLibrary
Next, load a reference microbe-set library from a GMT file. The msea package provides a read_gmt() function to parse a GMT file into a dictionary of sets:
gmt_filepath = \
'https://bitbucket.org/wangz10/msea/raw/aee6dd184e9bde152b4d7c2f3c7245efc1b80d23/msea/data/human_genes_associated_microbes/set_library.gmt'
d_gmt = msea.read_gmt(gmt_filepath)
print('Number of microbe-sets:', len(d_gmt))
# Look at a couple of reference sets in the library
pprint(list(d_gmt.items())[:3])
Step 1: perform MSEA for a input microbe-set against the library of reference sets:
microbe_set_input = set(['Colwellia',
'Deinococcus',
'Idiomarina',
'Neisseria',
'Pseudidiomarina',
'Pseudoalteromonas'])
# this can be done using the `msea.enrich` function
msea_result = msea.enrich(microbe_set_input, d_gmt=d_gmt, universe=1000)
# check the top enriched reference microbe-sets
print(msea_result.head())
Step 2: perform MSEA with adjustment of expected ranks for reference sets. Sometimes certain reference microbe-sets in a library are more likely to be enriched by chance. We can adjust this by empirically estimating the null distributions of the ranks of the reference sets, then uses z-score to quantify if observed ranks are significantly different from the expected ranks.
To do that, it is easier through the SetLibrary class:
set_library = SetLibrary.load(gmt_filepath)
The SetLibrary.get_empirical_ranks() method helps compute the expected ranks and store the means and standard deviations of the ranks from the null distribution:
set_library.get_empirical_ranks(n=20)
print(set_library.rank_means.shape, set_library.rank_stds.shape)
After that, we can perform MSEA with this adjustment:
msea_result_adj = set_library.enrich(
microbe_set_input, adjust=True, universe=1000)
print(msea_result_adj.head())
Tutorial 2: Uses MSEA as part of a microbiomic data processing pipeline¶
This tutorial demonstrates a case study demonstrating how MSEA can be integrated into a computational pipeline analyzing a microbiome profiling dataset.
This tutorial requires some additional Python packages:
import numpy as np
import pandas as pd
from biom import load_table # reqired for parsing BIOM formated dataset
# pip install biom-format
from skbio.stats.composition import ancom
# pip install scikit-bio
import msea
from msea import SetLibrary
We started with loading and preparing 16S dataset, which can be downloaded from Qiita:
table = load_table('../msea/data/study_10483/otu_table.biom')
print(table.shape) # OTUs x samples
# Optionally normalize data on the sample axis -> relative abundance
# table.norm(axis='sample', inplace=True)
# Load metadata
meta_df = pd.read_csv(
'../msea/data/study_10483/10483_prep_2122_qiime_20180418-105538.txt',
sep='\t',
index_col='#SampleID')
meta_df = meta_df.loc[table.ids()]
def collapse_f(id_, md):
# collapse OTUs to genus-level
genus_idx = 5
return '; '.join(md['taxonomy'][:genus_idx + 1])
table_g = table.collapse(collapse_f, axis='observation')
print(table_g.shape)
# check the microbial species:
print(table_g.ids(axis='observation')[:5])
# convert to a pd.DataFrame
df_g = pd.DataFrame(
table_g.matrix_data.toarray().T,
columns=table_g.ids(axis='observation'),
index=table_g.ids()
)
# Select a subset of samples from genotype 'Thy1-aSyn' for further analysis
meta_df_sub = meta_df.loc[meta_df['genotype'] == 'Thy1-aSyn']
df_g_sub = df_g.loc[meta_df['genotype'] == 'Thy1-aSyn']
print(meta_df_sub['donor_status'].value_counts())
Next, we perform DA analysis using ANCOM to identify DA microbes:
ancom_df, percentile_df = ancom(df_g_sub + 1, # adding pseudocounts
meta_df_sub['donor_status'],
alpha=0.1,
multiple_comparisons_correction='holm-bonferroni')
microbes_DA = ancom_df.loc[ancom_df['Reject null hypothesis']].index
print(microbes_DA)
# retain genus names
microbes_DA = filter(None, [s.split('; ')[-1].lstrip('g__')
for s in microbes_DA])
# convert to set
microbes_DA = set(microbes_DA)
print(microbes_DA)
Finally, we can perform MSEA for DA microbes we just identified:
set_lib = SetLibrary.load(
'../msea/data/human_genes_associated_microbes/set_library.gmt',
rank_means_file='../msea/data/human_genes_associated_microbes/rank_means.npy',
rank_stds_file='../msea/data/human_genes_associated_microbes/rank_stds.npy')
msea_result = set_lib.enrich(microbes_DA, adjust=True, universe=1000)
print(msea_result.head())