README for MetaG method ======================= This file serves as an instruction for using provided R codes to carry out the proposed normalization method on metagenomic data, and to generate simulated metagenomic data. cite: A Generic Normalization Method for Metagenomic Sequencing Data contact: anling@email.arizona.edu Metagenomic data normalization: =============================== 1. A metagenomic data should be prepared as all samples merged over conditions. In the data, each row stands for a detected feature, each column indicates a sample. Specifically, the data should be contained in a R dataframe object, with the names of features as row names, and the ids of samples as column names. For example, a corner of a data would be like: s1_c1 s2_c1 s3_c1 s4_c1 s5_c1 s1_c2 s2_c2 s3_c2 s4_c2 s5_c2 Lactobacillus 2755 102 204 1770 1691 90 14 90 83 104 Flavobacterium 1393 101 8754 2954 1238 178 72 1324 73 257 Fusobacterium 1355 603 2762 2089 1016 21 3 8 19 5 Kineococcus 187 316 915 813 507 909 103 1125 629 185 Heliobacterium 1693 12 1029 981 114 79 181 761 521 89 2. A R vector, indicating from which condition each column (i.e. sample) of the data is, should also be provided. For example, to above partial data, it is: [1] 1 1 1 1 1 2 2 2 2 2 3. Here, we provide an example of normalization by using the data from SimuCount.txt, and the condition indices of samples obtained from SimuCond.txt. Both files can be downloaded from our webpage. - Download "MetaG.r" from our webpage. Get Rgui or some other R progromming environment running. - example R code: require(sROC) #the R package is needed for functions used in MetaG.r source("MetaG.r") #include all the functions defined in Meta.r into current environment SimuCount<-read.csv("SimuCount.csv",row.names=1) #input data SimuCond<-read.csv("SimuCond.csv",row.names=1,header=FALSE) #obtain condition indices of samples SimuCond<-factor(SimuCond[,1]) #make the indices as factor value scale=20000 #the normalization scale which user should assign a value SimuMetaG<-MetaG(SimuCount,SimuCond,scale) #the key function, MetaG, to carry out the normalization #User can also calculate scaling factor for samples in each condition by the function, scalFactors. scalFactors(SimuCount[,which(SimuCond==1)]) scalFactors(SimuCount[,which(SimuCond==2)]) Metagenomic data simulation: ============================ 1. Abundance proportions of features for a condition should be obtained before simulation. The proportions should be saved in a R vector, with the names of features as its names, and the sum of its values to be one. For example, partial proportions would be like: Reinekea Lactobacillus Flavobacterium 0.17733302 0.12417724 0.09997111 Fusobacterium Kineococcus Heliobacterium 0.04178289 0.03328868 0.03123473 Propionibacterium Bacteroides Pseudoflavonifractor 0.03037323 0.02957476 0.02802511 Blautia 0.02607622 2. User need also set parameter values for simulation, such as, number of samples, lower and upper bounds of sample scale, and etc. Details of meaning of parameters, and setting for simulation can be found in the comments in codes. The below example, using abundance proportions from abunProp_c1.csv and abunProp_c1.csv given on our webpage, is to show how to generate metagenomic data for one condition, or two conditions. User can then easily write own codes for simulation for more than two conditions. - Download "Simulation.r". Get Rgui or some other R progromming environment running. - example R code: require(MASS) #the R package is needed for functions used in Simulation.r source("Simulation.r") #include all the functions defined in Simulation.r to use #input proportions of condition 1, and put them in a vector input.c1<-read.csv("abunProp_c1.csv",header = FALSE) prop.c1<-input.c1[,2] names(prop.c1)<-as.character(input.c1[,1]) #input proportions of condition 2, and put them in a vector input.c2<-read.csv("abunProp_c2.csv",header = FALSE) prop.c2<-input.c2[,2] names(prop.c2)<-as.character(input.c2[,1]) #parameters set for simulation for condition 1 numspl.c1<-10; lss.c1=10000; uss.c1=30000; byss.c1=100; #parameters set for simulation for condition 2 numspl.c2<-24; lss.c2=20000; uss.c2=40000; byss.c2=100; ########### simulation of data for one condition ###################### Simu.c1<-SimuOneCond(prop.c1,numspl.c1,lss.c1,uss.c1,byss.c1, seed1=2014, seed2=2014, cond="c1") #the output of function SimuOneCond is a two-element list object head(Simu.c1$SimuCount) #the head part of simulated counts head(Simu.c1$SimuExp) #the head part of expected counts for simulation ########### simulation of data for two conditions ###################### Simu<-SimuTwoConds(args.c1=list(prop.c1,numspl.c1,lss.c1,uss.c1,byss.c1, seed1=2014, seed2=2014, cond="c1"), args.c2=list(prop.c2,numspl.c2,lss.c2,uss.c2,byss.c2, seed1=2014, seed2=2014, cond="c2")) #the output head(Simu$SimuCount) #the head part of simulated counts head(Simu$SimuExp) #the head part of expected counts for simulation Created July 10, 2014