A two-stage statistical procedure for feature selection and comparison in functional analysis of metagenomes 

This is a brief introduction to using R code for the proposed method for detection of significantly differentially abundant features of different metagenomic communities/conditions. 

Input feature count matrix and phenotype matrix:
The feature count matrix contains the number of observations of each feature within each subject.  The element in the ith row and the jth column corresponds to total number of reads (or relative abundance) of feature i in sample j. The entire matrix is tab-delimited and contains labels for each feature and each subject in the following format:

\tsubject1\tsubject2\tsubject3\t ....\tsubjectN\n
feature1\t391\t729\t...
feature2\t668\t1978\t...
feature3\t174\t12\t...
feature4\t0\t58\t...

The phenotype matrix is also tab-delimited and contains the phenotype condition of each subject in the following format:

\tSample\Phenotype \n
Subject1\tDiseased
Subject2\tDiseased
..
Subject11\tNormal

Note the tab at the beginning of the first row.  Sample matrices for feature count and phenotype are provided at the website http://cals.arizona.edu/~anling/software/software.htm : abundance.csv and phenotype.csv.

R commands
Once you have R up and running:
1. Input the source file TwoStage_Package.r
> source("D:/dataset/TwoStage_Package.r")

2. Load a feature count matrix and phenotype matrix:
> count <- read.csv(file = "abundance.csv")
> phenotype  <-  read.csv(file = "phenotype.csv")

3. Analyze the loaded matrices:
> TwoStage_Package(count, phenotype, "sig.csv")

In this example dataset, "sig.csv" is the filename of the output containing information of the significantly differentially abundant features. Each row represents a significantly differentially abundant feature with its corresponding statistics.

Example:
The output file is a tab-delimited text file containing 7 columns in the following order:
1.  Annotation (name of feature)
2.  mean_group1 (the average feature abundance of population1)
4.  sd_group1 (standard deviation of feature abundance of population1)
3.  mean_group2 (the average feature abundance of population2)
5.  sd_group2 (standard deviation of feature abundance of population2)
6.  p.val (p-values)
7.  p.adj (adjusted p-values)