Tuesday, 3 July 2012

One of the most commonly used software for Linkage Disequilibrium (LD) analysis and for the relevant graphical visualization is Haploview. LD analysis in Haploview requires two different input files. The first one is a ped file. Depending on the type of analysis you are doing, this file can be created in five different formats, considered as standard in the scientific community (linkage, HapMap, ecc.). The second input file includes information on the genetic markers (name and position) that are analyzed in the study and then included in the first one. LINKAGE format is the most used locus file format. It includes the following information: Pedigree name, Individual ID, Father's ID, Mother's ID, Sex, Affection status, Marker genotypes (two columns for each marker). The generation of LINKAGE format files is very cumbersome and requires a significant amount of manual file manipulation, especially when high-dimensional datasets should be processed. In order to facilitate and automate the tedious process of creating these files, people that use R for the analysis of genetic data can now take advantage of a function, makeHaploviewInputFile, from the HapEstXXR package. This function takes as input parameters (obviously): (i) the information required to create the ped file (Pedigree name, Individual ID, Father's ID, Mother's ID, Sex, Affection status, Marker genotypes); (ii) the information required to create the marker file (name and position); (iii) the targets of the linkage and of the marker Information file required from Haploview and returned from this function. The next example shows an application of this function using two simulated datasets, one for the ped file and the other one for the marker file. We suppose that these two datasets are both in csv format. A copy of the dataset is shown below:
This dataset contains data on 10 individuals that have genotype information at 4 loci (SNP1-SNP4) in base-call format (SNPs in columns, samples in rows). As it regards the marker file, a copy of this dataset is shown below:


The first thing to do is to import these datasets in R:

myPedfile<-read.csv(“myPedFile.csv”,T) 
myMarkers<-read.csv(“myMarkersFile.csv”,T)


Then, we convert the genotypes data included in the first dataset into “snp” objects by using the setupSNP function of SNPassoc package:

myGeno<-setupSNP(myPedfile,7:10 , sep="")

Before using the makeHaploviewInputFile function we need to convert the genotypes into gene contents (the number of copies of a particular allele in a genotype is referred to as the gene content). In fact, the genotype matrix used by the makeHaploviewInputFile function is in a numeric format. The additive function of SNPassoc package takes “snp” object (myGeno) as an input argument and returns a numerical variable 0, 1, 2 based on copies of the minor allele. By executing this command on each locus (column), we obtain the numeric format of myGeno. An example of R code to obtain these numeric dataset is shown below:

numericDataset<-data.frame(matrix(NA,nrow(myGeno),ncol(myGeno)))
for (i in 1:ncol(myGeno)){

numericDataset[,i]<-additive(myGeno[,i])

}

names(numericDataset)<-names(myGeno)

Since in the genotype matrix used by the makeHaploviewInputFile function subjects homozygous for the major (most frequent) allele are coded as 1, those homozygous for the minor (less frequent) allele are coded as 2 and the heterozygous subjects as 3, we convert this dataframe in the required coding scheme:


geno<-data.frame(matrix(NA,nrow(numericDataset),ncol(numericDataset)))
for (i in 1:ncol(numericDataset)){
geno[which(numericDataset [,i]==1),i]<-3
geno[which(numericDataset [,i]==0),i]<-2
geno[which(numericDataset [,i]==2),i]<-1
}
names(geno)<-names(numericDataset)


We can now define the input arguments for the makeHaploviewInputFile function:


PD<- myPedfile $Family.ID
ID<- myPedfile $ID
FID<- myPedfile$FA.Ther.ID
MID<- myPedfile$Mother.ID
Sex<- myPedfile$Sex
AS<- myPedfile$AffectionStatus
SNPID<- myMarkerFile[,1]
Coordinate<- myMarkerFile[,2]


Finally, using the makeHaploviewInputFile function we obtain the two required input files:

makeHaploviewInputFile(PN,ID,FID,MID,Sex,AS,geno, SNPID, Coordinate,"pedfile","markerfile")

Tuesday, 29 May 2012

On the use of chi-square test for checking Hardy-Weimberg equilibrium

Among the different methods for testing deviations from Hardy-Weimberg equilibrium (HWE) the chi-square goodness-of-fit test is the most popular. In the case of a Single Nucleotide Polymorphism (SNP) for the application of this test a 2x3 contingency table, like that shown below, is usually employed:


CC
CT
TT
Observed
200
100
25
Expected
192
115
15


In the case of no evidence for departure from HWE (null hypothesis), this test statistics is asymptotically distributed as a central χ2. For this reason the determination of the significance of this departure should be relatively easy. However, in the case of HWE testing, although the chi-square value is relatively easy to calculate, the determination of its degrees of freedom is not so clear. As pointed out in the book of Andreas Ziegler ed Inke Koing published in 2010, in the application of this test two different scenario can be distinguished: if the allelic frequencies of the analyzed polymorphism are estimated form the current data, the number of degrees of freedom to adopt is equal to 1 (number of expected genotypes – 1 – 1); if the allelic frequencies are specified in advance, the number of degrees of freedom to adopt is equal to 2 (number of expected genotypes – 1). 

Suppose we are interested in testing for HWE for a generic SNP for which we observed that the genotypic frequencies were 200 for genotype CC, 100 for genotype CT and 20 for genotype TT. Suppose that (a) the allelic frequencies of the analyzed polymorphism are estimated form the current data; (b) the allelic frequencies are specified in advance. 



Case A
The frequencies of the C and T allele are given by:


Fr_C = (2x200 + 100)/[2x(200+100+25)] = 0.77


Fr_T = 1-Fr_C = (2x25 + 100)/[2x(200+100+25)] = 0.23


The expected frequencies under HWE are 192 (0.77x0.77x325), 115 (2x0.77x0.23x325) and 15 (0.23x0.23x325) for the CC, CT and TT genotypes, respectively. The quantile of the chi-square distribution corresponding to 0.95 with a single degree of freedom is 3.86. Using these data we obtained a test statistic of 5.78. Since this value is higher than the critical value, we reject the null hypothesis of HWE for the analysed SNP. 


Case B
Suppose that in advance (i.e. from literature data) we know that the allelic frequencies of the previous analyzed SNP are 0.77 for the C allele and 0.23 for the T allele. As in the previous case, on the basis of these data, the expected frequencies under HWE are 192, 115 and 15 for the CC, CT and TT genotypes, respectively. The quantile corresponding to 0.95 with two degrees of freedom in this case is 5.99. Using the same genotypic data of the previous example we obtained a test statistic of 5.78. Since this value is below the critical value, we do not reject the null hypothesis of HWE for the analysed SNP.

Thursday, 24 May 2012

Robust association tests

In order to determine a potential association between the variability of a set of genetic markers and a given complex trait, several statistical tests are commonly employed.  Among these, the Fisher’s exact test, the chi-square test and the Cochran-Armitage trend test are the most popular and frequently used in case-control genetic association studies. All these tests and their characteristics are presented in the recent book of Andreas Ziegler ed Inke Koing "A Statistical Approach to Genetic Epidemiology" published in 2010. However, in addition to several other important factors (power, assumptions, etc), a critical aspect in the application of such tests is represented by the coding scheme adopted in the association analysis. Indeed, the effectiveness of these tests (power) depends also on the coding scheme adopted for handle the available genetic information (categorical data). This is especially true for some complex traits for which the effect of the analyzed genetic variants (dominant, recessive, additive, etc) is generally unknown. As a result, to take into account the genetic model uncertainty, some authors adopted the strategy to use in a single association test not a unique, but several coding schemes. Although this kind of approach increases the power do detect possible associations, in parallel it also dramatically increases the number of false-positive results. To take into account the problem of the genetic model uncertainty and also the problem due to multiple comparisons, several “robust” tests have been developed. Among these the most popular was MAX test (or MAX3) originally proposed by Freidlin and co-workers (2002) and subsequently modified and improved by Zang and co-workers. This last version was implemented in the packages SNPassoc and Rassoc of R (Zang et al., 2010). The most recent versions of these tests, RobustSNP and the Robust Mantel-Haenszel Test, allow also adjustment for covariate effects (So and Sham, 2011; Zang and Fung, 2011). Both for the robustness with respect to the adopted genetic models (dominant, additive and recessive), and because they are able to handle genome-wide association (GWA) studies, these tests should represent the standard methodology to adopt in the future case-control genetic association studies.