Identification of major animal genes in field collected data by use of statistical methods. A review

The purpose of this paper is to present some of the methods currently available for the detection of major genes based on population data. These methods are concerned, primarily, with quantitative traits because their application to discrete and threshold traits is not always possible (e. g. tests of deviations from normality). As mentioned, only some methods for identification of the genes with large effects are presented. Directions of methodological investigations and their effects are shown. In the first part of this paper, existing statistical methods (adopted for major gene identifaction) are described. In the second part, methods of human genetics modified for animal genetics (Major Gene Index - MGI) are presented. Finally, the method generally accepted currently as the most accurate (segregation analysis) is characterized. Some disadvantages connected with the application of this method are also described.


INTRODUCTION
The classical animal breeding theory for quantitative performance traits is based on a polygenic model of inheritance. However in recent years, several genes considerably affecting these traits have been identified. The major locus is defined as the one having an effect of at least one standard deviation of the metric traits, as measured by the difference between the two alternative homozygotes (Roberts and Smith, 1982). The Booroola gene affecting ovulation rate in sheep, the double muscling gene of cattle, the dwarf gene in poultry, the halothane sensitivity gene in pigs and the rapid postweaning growth gene in mice are notable examples. In Hanset's (1982) opinion several traits (e.g. twinning, calving ease, size and ISSN 1230-1388 © Institute of Animal Physiology and Nutrition resistance to infections and parasitic diseases) are candidates" for a mixed major gene and polygenic inheritance.
This situation requires modifications of breeding programmes, since mixed inheritance (genes with a large effect and genes with small effects) leads to an increase of genetic variability compared to polygenic inheritance of the traits. Hence, the effects of major locus often lead to an increase of heritability and genetic correlation. Major gene effect is usually masked by a large number of environmental factors, as well as additive and non-additive polygenic effects. Moreover, the ability to detect a major gene is a function of the magnitude of its effect. To make the best use of the variability induced by these major genes, it is necessary to identify the carrier genotypes and their frequencies, as well as the gene effects and activity in breeding stock. Most importantly, detection of a major locus is very important for gene mapping and for creating transgenic individuals. Different stastistical techniques have been suggested for the identification of human, animal and plant genes with large effects (Karlin et al., 1979;Hoeschele, 1988;Elsen et al., 1988;Elsen and Le Roy, 1989;Thomas et al., 1991).
The aim of this study is to present some methods which may be applied to animal breeding field-collected data, mainly for continuous traits.

General analysis of trait distribution
The polygenic inheritance of metric traits is connected with a normal distribution of their phenotypic values (e.g. Knott et al., 1990). Therefore, it is assumed that effects of major genes lead to deviations from normality. In the most spectacular case (indirect inheritance) frequency distribution of phenotypic values results in multimodality. However, a test of multimodality did not find wide application in detecting major locus because mixed inheritance (polygenic + major genes) is relatively rarely combined with more than one mode value. However it is not clear whether mixed inheritance is connected with the non-normality of phenotypic distribution. As measurements of deviations from normality two parameters: skewness (y x ) and kurtosis (y 2 ) have generally been used (Hammond and James, 1970;Hanset and Michaux, 1985). Testing of the hypothesis concerning non-skewness is based on the following test function: where: (for large samples, only), gi kr k 3 is the central moment of third rank, k 2 is the central moment of second rank, n is the number of observations. Rejection of the null hypothesis (H 0 :y 1 = 0) indicates the possibility of a partitioning of the major locus in inheritance.
Statistics used for testing the significance of the kurtosis adopts the following form: [24 s g2 for large samples is ^ / -(in approximation) n k §2 = v 2 4 kl k 2 , k 4 are central moments of the second and fourth ranks, respectively.
Similarly to the previous case -not rejecting the null hypothesis (H 0 : y 2 = 0) suggests the mixed inheritance of a trait.
From the point of view of application these methods are relatively simple, since unlike some others, they do not require the inclusion of any pedigree information. However it should be remembered that deviation from normality is not equivalent to the existence of mixed inheritance. Hence these tests are of limited usefulness in the detection of the major locus (Le Roy and Elsen, 1992).

Merat's method
Including the individual pedigree in evaluation of skewness and kurtosis was investigated by Merat (1968). The method is based on analysis of heterogeneity of kurtosis in fullsib (or halfsib) groups with the basic assumption that if at least one of the parents is heterozygous (e.g. Aa) the kurtosis coefficient will be negative. Such a case involves the problem of the size of fullsib or halfsib families (small number of individuals per family is connected with non-normal distribution in the groups). Hence, Merat (1968) proposed a modification of the method. The modification consists of dividing all the analysed families into two groups according to the magnitude of their variance. Then the testing hypothesis concerning the negativity of kurtosis is employed separately for each group. Le Roy and Elsen (1992)  It should be stressed that the division into two groups according to the size of family (or halfsib groups) variances is rather arbitrary. Population specificity is not taken into account here and thus, the efficiency of statistical inference may differ depending on the population considered.

The method of parent-offspring regression
This method is based on the assumed linearity of parent-offspring regression when only additive action genes are presented (Gimerfarb, 1986). The deviation from linearity may be interpreted as a result of polygenic interactions (for instance dominance and epistasis) and, above all, the gene action with large effects. Very large population size is frequently associated with numerical barriers. However, a simplified procedure may be used when instead of the pairs of parents (dam or sire) -progeny, pairs of parent-fraction of their progeny (having a trait level of the parent) are included (Snedecor and Cochran, 1980). The method was applied in investigation on a genetic determination of muscular hypertrophy in the Belgian White and Blue cattle by Hanset and Michaux (1985). It was assumed that the total variation among R proportions (p { ) is measured by an x 2 with (R-l) degrees of freedom. As there is a score (X;) for each of these proportions, a weighted regression coefficient of p{ on X{ is calculated. The difference x 2 ~~ X 2 is an I 2 with (R-2) degrees of freedom for testing the deviations ofliie p/from linear regression on the Xj.

Evaluation of variance homogeneity within a family
Many authors (Bishop et al., 1988;Elsen and Le Roy, 1990;Hill and Knott, 1990) have indicated the possibilities of applying the tests of within-family variance homogeneity for detecting genes with large effects. Thus, the within-family variance heterogeneity (as an alternative of variance homogeneity) suggested the mixed inheritance. In this case the Bartlett test (1937) is the most frequently applied. It is based on the following statistics of x 2 '- iij is the size of i th family, n is the total number of individuals, t is a number of families, s 2 is i th family variance, s 2 is the general variance.
However, in some situations variance homogeneity need not be equivalent to polygenic inheritance and it is necessary to pay attention to some regularities connected with the major gene segregation. The full sibs having a mean of the trait similar to the population mean usually show a relatively large variance. However small variance of fullsib groups is usually combined with an extreme trait mean, when families contain mainly alternative homozygotive individuals, for instance AA or aa. So, in the latter case, variance homogeneity test is not an adequate method for the detection of a major locus.

Within-family mean-variance regression method
The above imperfection may be eliminated by the application of the criterion of mean/variance of families as described by the following curvilinear regression equation (Fain, 1978): logcr. 2 is logarithm of i th family variance, a is a regression constant, \x x is i th family mean, jS 2 , j8 3 are regression coefficients.
Verification of the hypothesis concerning mixed inheritance is based on known statistical procedures -investigation effects in curvilinear regression by Fisher-Snedecor statistics (Mayo et al, 1980). Rejecting this hypothesis is equivalent to non-significance of components in this model.
It should be stressed that this method is usually applied in human genetics. However, simultaneous investigations carried out by Le Roy and Elsen (1992) have suggested a wider applicability of this method for the detection of the animal major locus using both fullsib and halfsib families. MAJOR GENE INDEX (MGI) This method is based on the intuitive assumption that under polygenic inheritance, progeny deviation from the midparent average is smaller than the deviation from either parents (Karlin et al., 1979). On the basis of this assumption, the major gene index can be presented as: n is the number of sire (dam) -offspring pairs, Oj, S i? Dj is the observation of the i th offspring, sire and dam, respectively, k -known parameter.
The choice of value k is somewhat arbitrary. However, Karlin et al. (1979) recommend evaluation of the index at three levels of k (k = 0.5, 1, 2). Thus calculated values of the index greater than 1 would be indicative of major gene inheritance. As previously mentioned, the opposite case (MGI(k)<l) indicates polygenic inheritance. This criterion of magnitude of MGI(k) for different k values as well as increasing k values also suggest major gene inheritance. When three generations are known, the formula [1] may be used after some modification (Le Roy and Elsen, 1992). There are also two other criteria for identification of major locus, the so called Structure Exploratory Data Analysis (SEDA) (Karlin and Williams, 1981): offspring between parents regression (OBP) and pairwise midparental correlation coefficient (MPCC). The general principle of using these criteria for the detection of major genes is similar to the MGI method. The application of the MGI method to detecting major genes usually requires adjustment of the phenotypes (parents and progeny) for such influences as sex, age or productive season. The classical adjustment procedure based on linear or nonlinear regression (Zuk et al., 1980) or proportional adjustment (Miller et al., 1966) is not sufficiently accurate. Hence Famula (1986) suggested a construction of the major gene index for quantitative traits using Best Linear Unbiased Prediction (BLUP) based on the mixed model: where: y is the m x 1 vector of observations, b is the p x 1 vector of unknown fixed parameters, g x is the q x 1 vector of unknown random additive genetic effects, g 2 is the q x 1 vector of unknown random dominance genetic effects, e is the m x 1 vector of random effects of errors, X, Z are the m x p and m x q known incidence matrices, respectively. A is the m x m additive relationship matrix for m individuals (parents and progeny), D is the m x m dominance relationship matrix for m individuals (parents and progeny), I m is identity matrix, <j 2 , a 2 , a 2 are the component of additive, dominance and residual variance, respectively. More details of the linear model and prediction with the BLUP method as well as rules of construction of these relationship matrices are given by Kennedy (1989). Best linear unbiased predictes (the so-called animal model) of g x and g 2 are computed from the following mixed model equations (Henderson, 1985 and The predicted additive genetic values for each individual (dam, sire and progeny) are substituted into the MGI formulae [1]. Simulation studies (Famula, 1986) indicated potentially useful applications of the index in some situations, such as, when there is a complete dominance inheritance and a low frequency of major genes (in the absence of multimodality of the phenotypic distribution). Moreover, the index is more sensitive to detection of single genes of large effects compared to phenotypic values when the predicted genetic merit is included.
The index based on mixed model methodology may be particularly useful in the following two cases: for a sex-limited trait (sire not being measured) or for the progeny-limited trait (a trait measured after slaughter).
Unfortunately, MGI is not a statistical test because control of error at a significiant level is not possible. On the other hand, the method may be largely used as preliminary indicator of a major gene segregation. SEGREGATION ANALYSIS (SA) This method was introduced by Elston and Steward (1971) who used it to test the agreement of probabilities of single locus transmissions to Mendelian expectations. Next, Morton and MacLean (1974) proposed a new concept for segregation analysis which was based on including major locus effects as well as polygenic components and random environmental effects. Descriptions and some practical applications of these methods were discussed by Hill and Knott (1990) and Lalouel et al. (1983). Recently, the approaches both of Elston and Steward (1971), and of Morton and MacLean (1974) were modified for animal genetic problems.
Segregation analysis" is a group of methods for major locus identification based on pedigree information, mainly of halfsib groups Knott et al., 1990Knott et al., , 1992aKnott et al., , 1992b, fullsib groups ) and all available pedigree information from the so called complex segregation analysis (Elston and Rao, 1978;Elston, 1980Elston, , 1992. Recently, multivariate segregation analysis which allows explicit tests of major gene pleiotropy hypotheses have been developed (e.g. Blangero and Konigsberg, 1991). The general principle of detecting a major gene by segregation analysis is to compare different types of models (polygenic and mixed inheritance) which include both genetic and environmental effects. Next, all pedigree information is summarized in a likelihood function which depends on different parameters (mean and residual variance within genotype groups, heritability, genotype frequencies etc.) denoting the probability of observation, given a particular transmission hypothesis. The ratio of those likelihood functions supplies an answer to the final question concerning mixed inheritance. In this paper, the method is presented on the basis of a typical hierarchical experimental design for many domestic species, mainly poultry and pigs . It is assumed that each of n sires mated to m { dam (additionally, sires and dams are unrelated, and environmental effects are not significant) and each family contains l[ h recorded offspring. Two different transmission models have been compared. For polygenic transmission the linear model is as follows: where: y ijk is the observation of k th individual, \i is the overall mean of the trait, Uj is the random effect of i t h sire, a variable distributed as a normal with mean 0 and variance cr 2 , u v Aj is the random effect of the j t h dam mated to the i th sire; a variable distributed as a normal with mean 0 and variance cr 2 , V ' e ijk is the random error effect normal distribution with mean 0 and variance a\ Thus, this model obviously depends on the following four parameters: the general mean (p) and respective variance components (cr 2 , a 2 , cr 2 ). This model corresponds to the null hypothesis (H 0 ). In the alternative model, a major locus effect is included into the polygenic model: where: /r 2 is the mean of progeny with genotype r, yr k , u i9 Vjj, e ijk -as above.
It should be stressed that if other fixed effects are included in these models, the number of estimated parameters will increase. As mentioned, the test statistics is the following: ratio null ("polygenic inheritance" H 0 (M 0 )) and general ("mixed inheritance" H 1 (M 1 )) hypothesis: Moreover, S { is the genotype of the i th sire and $ { its realization, Ty is the genotype of the j t h dam of the sire i, and ty its realization, and R ijk is the genotype of the k th progeny of dam ij, and r ijk its realization; P(R ijk = r ijk / s i? t^) is the probability of r ijk given the genotypes s { and ty of the sire i and the dam ij; f is the distribution of the sire effect U{.
1 lu 2 f ( u i) = AT ex P( -V 2 ™u 2 ( T u g is the distribution of the dam effect v i5 h is the distribution of the dependent variable, given Ui and Vy: h rijk (y ljk / Ui , v a ) = ^Lexp^^"^"" 8 "^) 2 ) Segregation analysis is a more powerful statistical test compared with the other methods mentioned above. The SA method allows for the derivation of additional results on the identified major gene, for instance, genotypic effects as well as frequence of genes and genotypes. Unfortunately, SA is a very difficult method from the numerical point of view. The likelihood equations presented above are based on a complicated function, which gives a summation of each combination of the major genotypes for a pedigree. These calculations are impracticable even for a small number of offspring per sire (or per sire and dam -fullsib groups), since they require an extremely large computer memory. In order to simplify the calculations, several approximations to this likelihood function have been proposed. Knott et al. (1992a), comparing three approximations in halfsib balanced experimental design, suggested using the so-called Hermite integration to replace integration in the combined model likelihood (more details of this kind of integration are given by Knott et al. (1992a, b)). Other solutions of simplified segregation analysis for one-way classification experimental design were studied by . It should be stressed that the numerical complications increase if the number of observed individuals is large. In these cases, probabilities of additional relationships among individuals, overlapping of generations and large complex pedigree with loops are increasing. Thus, the calculations required are often almost impossible to complete.
Segregation analysis is also limited from the statistical point of view. Elsen and Le Roy (1989) concluded that likelihood ratio distribution has no regular asymptotic convergence properities in the case of the SA method. The likelihood ratio distribution depend on sample sizes and the structure of experimental data (nuclear families, half -or fullsib families). Hence, the main statistical test ( -UnMJM^ is not always true for each case. Another problem in detecting major genes by segregation analysis may be connected with the distribution of the analysed trait. As mentioned in part 2 of this paper, skewness is expected when genes with large effects are segregated. It was considered a paradox that skewness may lead to the false inference of a major gene under this method (Demenais et al., 1986). Elsen and Le Roy (1989) showed that when the skewness coefficient is larger than 0.2 the inference is quite false. Skewness can be reduced by the application of data transformations, for instance logarithmic or Box-Cox transformation (Gianola et al., 1990). However, these transformations may lead to a large loss of power (Demenais et al., 1986). Investigations carried out on both transformed and untransformed pig data by  indicated similar estimates of the parameters.