README for WSS method ===================== This file serves as an instruction for using provided R codes to carry out the proposed normalization method on metagenomic data, and to simulate metagenomic data. cite: An effective normalization method for taxonomic analysis of metagenomic sequence count data contact: anling@email.arizona.edu Metagenomic data normalization: =============================== 1. A metagenomic data should be prepared as all samples merged over conditions. In the data, each row stands for a detected taxonomic feature, each column indicates a sample. Specifically, the data should be contained in a R dataframe object, with the names of features as row names, and the ids of samples as column names. For example, a part of a dataset would be like: s1_c1 s2_c1 s3_c1 s4_c1 s5_c1 s1_c2 s2_c2 s3_c2 s4_c2 s5_c2 Lactobacillus 2755 102 204 1770 1691 90 14 90 83 104 Flavobacterium 1393 101 8754 2954 1238 178 72 1324 73 257 Fusobacterium 1355 603 2762 2089 1016 21 3 8 19 5 Kineococcus 187 316 915 813 507 909 103 1125 629 185 Heliobacterium 1693 12 1029 981 114 79 181 761 521 89 2. A R vector, indicating from which condition each column (i.e. sample) of the data is, should also be provided. For example, for above partial data, five samples come from each of the two conditions: 1 1 1 1 1 2 2 2 2 2 3. Here, we provide an example of normalization by using the data from SimuCount.csv, and the condition indices of samples obtained from SimuCond.csv. Both files can be downloaded from our webpage. - Download "WSS.r" from our webpage. Get Rgui or some other R progromming environment running. - example R code: source("WSS.r") #include all the functions defined in WSS.r into current environment SimuCount <- read.csv("SimuCount.csv",row.names=1) #input data SimuCond <- read.csv("SimuCond.csv") #obtain condition indices of samples in SimuCount SimuCond <- factor(SimuCond[,1]) #make the indices as factor value SimuCount.WSS <- WSS(SimuCount, SimuCond) #the key function, WSS, to carry out the normalization #User can then get the estimate of sample scales and the scaling factors SimuCount.WSS$scalEsts SimuCount.WSS$scalFactors Metagenomic data simulation: ============================ 1. Abundance proportions of features for a condition can be obtained from some real dataset, such as the 16S rDNA mouse data described in the main text. The proportions need be saved in a R vector, with the names of features as its names, and the sum of its values should equal to one. 2. A User needs also set parameter values for simulation, such as, number of samples, lower and upper bounds of sample scales, and etc. Detailed meanings of the parameters can be found in the comments within the code. The below example, using abundance proportions estimated from low fat diet condition of the 16S rDNA mouse data, shows how to generate metagenomic data for one condition. User can then write own code to merge datasets simulated from different conditions separately. - Download "Simulation.r". Get Rgui or some other R progromming environment running. - example R code: library('MASS') #the R package is needed for functions in WSS.r to be able to work source("Simulation.r") #include all the functions defined in Simulation.r into current environment ########### estimate of abundance proportions from a real dataset library(metagenomeSeq) #the Bioconductor package which contains the mouse data data(mouseData) mat_mouse = as.data.frame(MRcounts(mouseData)) #save the mouse data into a R data frame mat_mouse<-mat_mouse[which(rowSums(mat_mouse)>=10),] #use the features with sum of their counts being equal to greater than 10 mouse_diet<-pData(mouseData)$diet #diet information to each sample dat.LF<-mat_mouse[,which(mouse_diet=="BK")] #samples from the condition of low fat diet dat.LF<-dat.LF[order(-rowSums(dat.LF)),] #sort the samples according to descending order of sum of counts from each row prop.LF<-rowSums(dat.LF)/sum(rowSums(dat.LF)) #estimated abundance proportions for that condition ########### simulation of data for one condition simuC1 <- SimuOneCond(prop = prop.LF, numspl = 20, lss = 754, uss = 5675, byss = 10, lvlm.coef = c(1.5,1.5), lvlm.sd=1.3, lvlm.range = c(0.5,3.5), seed1 = F, seed2 = F, cond = NULL) #the output of function SimuOneCond is a two-element list object head(Simu.c1$SimuCount) #the head part of simulated counts head(Simu.c1$SimuExp) #the head part of expected counts for simulation Created May 4, 2016