Single nucleotide polimorphism database of candidate genes associated with cow milk protein biosynthesis

A growing number of mutations within milk protein genes and genes associated milk protein biosynthesis are not classified and described in ways which facilitate the design and interpretation of experiments with the use of multiplex PCR or other high throughout screening techniques. The aim of the study was to process and catalog all available information on single nucleotide polymorphisms (SNPs) located within genes directly, indirectly or potentially associated with bovine milk protein biosynthesis. All records were divided into 3 groups of polymorphic sequences: milk protein genes, genes associated with milk protein genes regulation and genes potentially associated with milk protein biosynthesis. A database was constructed containing 339 SNPs within 49 genes. Among the 339 SNPs, 316 single nucleotide substitutions, 8 deletions, 5 repeats, 7 indels and 3 insertions were identified. All collected SNPs were described in such a way as to enable the automatic downloading of GeneBank records to specialized software and simultaneous design of PCR primers and allele specific probes used in microarray technology. It is believed that collection of SNPs presented in this study will serve as a reliable resource for studies on the genetic determination of milk protein biosynthesis variation and after wide population screening, also for paternity testing and evolutionary studies in dairy cattle


INTRODUCTION
Genetic determinants of protein content in ruminant milk have been the subject of many studies for almost 50 years.They were initiated by the discovery of bovine beta-lactoglobulin polymorphism by Aschaffenburg and Drewry (1955).Over the next 30 years most genetic variants of milk protein have been characterized.They were classified into two groups: caseins (alfa S1 -CSN1S1, alfa S2 -CSN1S2, beta -CSN2 and kappa -CSN3) and whey proteins (beta-lactoglobulin -LGB i alfa-lactalbumin -LALBA) (reviewed by Eigel et al., 1984).For many years, this polymorphism was identified at the protein level by observing different electrophoretic mobilities of milk protein in starch, agarose or polyacrylamid gels.Different populations of dairy cattle were screened to determine the significance of milk protein variants for milk content and yield.Research conducted by numerous groups concluded that polymorphism of kappa-casein and beta-lactoglobulin is strongly associated with the chemical content and technological properties of milk (Jakob and Puhan, 1992;Mao et al., 1992;Walawski et al., 1994).The sequencing of milk protein genes initiated by Gorodetsky et al. (1983) and Steward et al. (1984) enabled the development of methods allowing for the genotyping of bulls (Leveziel et al., 1988;Rando et al., 1988;Lien et al., 1990).Genotyping of CSN3 locus was even introduced to breeding programs of A.I. bulls by several commercial companies in the early 1990s.
Today, milk protein genes are one of the best studied genes in livestock.Moreover, the number of other SNPs related to milk protein biosynthesis is constantly growing.The most promising ones were found within the following genes: prolactin -PRL (Sasavage et al., 1982;Hart et al., 1993), the signal transducer and activator of transcription -STAT5 (Antoniou et al., 1998;Flisikowski and Zwierzchowski, 2002), growth hormone -GH (Lageziel et al., 1996), growth hormone receptor -GHR (Falaki et al., 1996;Blott et al., 2003) and the ornitine decarboxylase gene (Yao et al., 1998).Anonymous new loci associated with milk protein content were proposed in QTL experiments (Georges et al., 1995;Ashwell et al., 1997;Vilkki et al., 1997;Mosig et al., 2001;Boichard et al., 2003).All these reports suggest that milk protein content is a polygenic trait determined by variants located not only within milk protein genes and their promoters but also within other genes involved in milk protein biosynthesis.It is thought that the simultaneous genotyping of as many informative SNPs as possible will lead to a better understanding of genetic background of milk protein content.Currently the best method for typing SNPs determining complex traits is DNA microarray (review by Syvänen, 2001, andKamiński, 2002).This technology, however, requires precise DNA sequence information, mainly on the type and location of SNP.
The general aim of this work was to construct a database of all available polymorphic sequences directly, indirectly or potentially associated with cow milk protein biosynthesis.

SNP definition
Single nucleotide polymorphisms (SNPs) are single base pair positions in genomic DNA at which different sequence alternatives (alleles) exist in normal individuals in some population(s), wherein the least frequent allele has an abundance of 1% or greater (Brookes, 1999).In practice, the term SNP is typically used more loosely and encompasses many different types of subtle sequence variations (including small deletions and insertions) with the frequency of rare allele being less than 1%.To maintain the clarity of this work, the latter SNP definition has been employed.

Database structure
All records of the database were organized in a table (Table 1) consisting of the following columns: position in cytogenetic map (cattle chromosome), locus symbol, bovine gene name, sequence description (length, type -DNA or RNA, GenBank acc.no), SNP description (position, location within gene structure, functional significance, and reference).
The mapping position and locus symbols were based on the ARK database (www.thearkdb.org)and on Band et al. (2000).
All records were divided into 3 groups of polymorphic sequences: milk protein genes, genes associated with milk protein genes regulation, genes potentially associated with milk protein biosynthesis.

Sources of sequence information
The primary source of records was the GenBank database (NCBI, www.ncbi.nlm.nih.gov) in which 499 records (gene or nucleotide sequence) were found by searching for "Bos AND taurus AND variation".Records named "genomic sequence containing highly polymorphic single nucleotide sites" (specific for beef cattle) and "Bos taurus genomic sequence" (unknown function) were rejected from further data mining.Additional resources were also used: bovine mapping genome database (www.thearkdb.org,http://locus.jouy.inra.fr),SNPZoo database (http://snpzoo.de),patent database (http://www.epo.co.at, http: //www1.uspto.gov/),database of genes and ESTs expressed in bovine mammary gland (Looft et al., 2001;Malewski and Zwierzchowski, 2002) and human-cattle comparative mapping (Band et al., 2000).Another source of information was the world-wide bibliographic databases (Life Science, CAB, Medline) processed by Reference Manager software (ISI Research Soft, 1999).Column "References" contain references mostly to the documented functional effects of SNP as well as allele frequency data (marked by FD).All of these resources were first previewed and evaluated to ensure they contained at least three elements: GenBank acc.no, position of SNP and minimum length of sequence (250-500 bp DNA or RNA).For some portion of records, individual searching was conducted.For example, if only a variant on protein level was known, the appropriate DNA (RNA) sequence in GenBank database was found and the SNP was marked.Conversely, in some instances, SNPs marked in the GenBank sequence were translated at the protein level or annotated by additional information gained from papers.

RESULTS
A database was constructed containing 339 SNPs within 49 genes (Table 1).Among the 339 SNPs, 316 single nucleotide substitutions, 8 deletions, 5 repeats, 7 indels and 3 insertions were collected.Most SNPs were located in non-coding regions of the genome (mainly within 5' flanking regions) and had no direct known impact on the phenotype of an individual.These SNPs may influence the yield of gene expression and can also be used as markers for unknown adjacent genomic regions.
The most important feature of the SNP database is the precise information of the nature and location of certain SNP.Each SNP is described in the same way, for example, the first SNP in Table 1 (A11115T) means that in the sequence recorded under GenBank acc.no X59856, in position 11115, A is replaced by T. Sometimes the nature of SNP is more complicated and had to be written in a more descriptive way, for example: 2561..2624 (GT)n means that in position between 2561 and 2624 is a GT repeat polymorphism with a different number of GT repeats.This uniform method of SNP description enables the automatic downloading of GeneBank records to specialized software and simultaneous design of PCR primers and allele specific probes used in microarray technology.
Standardization of different data in the same way revealed that numerous papers or GenBank records contain insufficient, conflicting or even error-prone  Gobbetti et al., 2002;Dziuba et al., 1999;Barroso et al., 1999;FD: Damiani et al., 1992 G   sequence information.These were first clarified by the comparison to original data published in the paper or by consultation with the authors and were eventually either included or eliminated from the database.SNPs were also annotated by adding some important information on the function or significance of certain SNP.Most of these annotations indicate the type of mutation: missense or silent and a SNP location in gene structure: intron, exon, 5'-or 3'-flanking regions.Many SNPs have no information in which part of gene structure they are located.Although this location could be theoretically elucidated, it is preferable to sustain the original data.Some SNPs were located within putative (computational) or experimentally confirmed binding sites of transcription factors.Several other SNPs are localized in epitope for immunoglobulin, suggesting their potential significance in immune response, especially in allergy for milk.
Information on allele frequency is also very useful in planning population experiments.If an allele is very rare or specific for uncommon breed it should probably be eliminated because of the low probability of finding a genotype group of animals for associated studies.Therefore, alleles occurring in rare or endangered breeds of cattle were excluded from the database and SNPs were cataloged only for major dairy cattle breeds (e.g., Holstein, Jersey) because of their economic importance.The SNP database shows that, except for the SNPs of major milk protein variants, the population data for most of the SNPs is very poor (references marked by FD; Table 1).

DISCUSSION
The reason for the current vital interest in SNPs is the hope that they could be used as markers to identify genes associated with multifactoral disorders or quantitative trait loci (QTLs) (Coronini et al., 2003).It is assumed that the SNP alleles are inherited together with the QTLs over generations because they are physically close to each other.In contrast to microsatellite markers, SNPs are frequently dispersed throughout the genome and therefore can be used for QTL fine mapping.The rationale would be to genotype a collection of SNPs that occur at regular intervals and cover the whole genome to detect genomic regions in which the frequencies of the SNP allele differ between experimental populations.The genome-wide SNP genotyping is theoretically possible for the human genome, for which almost 2 million SNPs are available in the public database (SNP Consortium, www.snp.schl.org).Celera Genomics also offers commercial SNPs databases for human and mouse genomes (www.celera.com).The throughput required for genotyping even some of the thousands of SNPs and the current cost of genotyping makes such projects impractical.A more feasible alternative to random whole-genome SNP mapping is to use SNP markers in candidate genes which are thought to be associated with certain QTL.This is the only choice for genomes for which no SNP database has been published, but have numerous detected SNPs dispersed in many publicly available sources.In cattle genome, a good candidate for such an approach are SNPs within genes associated directly, indirectly or potentially, with milk protein biosynthesis.To our knowledge, the database presented in this paper is first publicly available SNP database based on dairy cattle genome processed and described to enable automatic and high throughout SNP genotyping.

Database specificity and limitations
It seems the growing number of mutations within bovine milk protein genes and genes associated with their expression have to be ordered and classified to better design and interpret future experiments with the use of high throughout screening techniques.There is an evident lack of uniform information on the topic.In papers, SNPs are described mostly at the protein level as an amino-acid change with or without relevant nucleic acid sequence information.In contrast, in the GenBank database, sequences are not annotated sufficiently (location and type of SNP) or dispersed within different records.Attempts to use these sequences for multi-loci genotyping are very limited or even impossible.Therefore, in this paper all available sequence and research information has been gathered to create a well-organized database of SNPs described in the same format.
Dividing all loci into three groups helps to better understand their role in milk protein biosynthesis.For the first and second group (milk protein genes, genes associated with milk protein genes regulation), the associations with milk protein content is obvious and documented in numerous papers (review by Jakob andPuhan, 1992, andMartin et al., 2002).The third group (genes potentially associated with milk protein biosynthesis) contains different genes which are believed to be indirectly or potentially associated with milk protein content in milk.For some of them, these associations are experimentally confirmed, but for others they are not.The latter ones were included in the SNP database because they are involved in basic biochemical processes in the mammary gland or play a fundamental role in the functioning of the whole organism.
It is problematic whether all known SNPs within one gene should be included in the database.On one hand, the more SNPs there are within the locus, the more choices there are to design effective primers or probes.But on the other hand, too many synonymous SNPs or repeats within one locus which are indirectly or only potentially associated with a phenotype seems to be useless and, in this author's opinion, should be ignored.Although SNPs located within the same gene (or within 20-200 kb) are strongly linked and most of them can be omitted, the reduction of a number of SNPs may lead to missing an interesting genetic phenomenon -interacting phenotyping effects of co-existing variants located within the 5'-and 3'-flanking region of a single gene (Schwerin et al., 2002).Therefore, in the first and second group all published SNP were included.In the third group, however, a kind of pre-selection was made: from loci containing more than 10 SNPs (e.g., PRP, NOS2, CPN1) repeat polymorphism and synonymous SNPs located very close to each other were excluded, leaving only those located in exons and in maximum distance.
Because very short stretches of DNA are inconvenient or even useless in primer design, all sequences shorter than 250 bp were excluded from the database.
The database also contains SNPs determining two genetic diseases (BLAD and DUMPS).The carriers of these disorders are obligatory eliminated from reproductive schemes in many countries to avoid losses in health and reproduction.
A separate group of polymorphism associated with milk protein biosynthesis are microsatellites (Mosig et al., 2001;Boichard et al., 2003).These QTL microsatellite markers were excluded from the database for three reasons: 1. the nature of polymorphism is often unclear (the type of repetitive motif, its location and number of repeats), 2. repetitive sequences are difficult to genotype by primer extension reaction -the most often used method in high throughout-put genotype screening on a chip, 3. each QTL microsatellite marker has approximately 10 alleles, which increase the cost of the genotyping.
A way to represent these genomic regions into the chip is sequencing regions located around a QTL microsatellite marker and then comparing these sequences from a population of animals to find new biallelic SNPs.The probability of finding SNP could be lower than in human (1/1250 bp) because of the higher homogeneity of cattle.Such SNPs may substitute QTL microsatellite markers to enable their implementation in high throughout put genotyping.
The only publicly available livestock SNP database (SNPZoo, www.snpzoo.de) is maintained for the development of paternity control.Because this SNP database contains only anonymous SNPs (randomly dispersed in the genome), its records were not included to our SNP database.
The database should be continuously updated by new data, and a potential source of new SNPs are bovine mammary gland expressed sequence tags (ESTs).They can be found by the use of bovine ESTs data and human genomic sequences (Band et al., 2000;Stone et al., 2002) or by in silico mapping of DNA sequences to cattle genome (Farber and Medrano, 2003).Cheung and Spielman (2002) suggest that expression profiling with the use of microarray may reveal data on variation of gene expression indicating genes containing causative SNPs.Picoult-Newberg et al. (1999) published a method of SNPs mining from the EST database.Unfortunately, publicly available bovine mammary gland EST databases are dispersed in different resources (GenBank, TIGR: www.tigr.org/tdb/tg/btgi;Looft et al., 2001) and therefore are not suitable to such experiments.

Database applications
The primary application of this SNP database is for designing a chip for the simultaneous genotyping of hundreds of SNPs to reveal the genetic background of milk protein biosynthesis.Protein content in cow milk is one of the most important criterion in bull selection and also in cow milk pricing.This trait has been improved in recent decades and is still the most desirable milk performance trait.It is believed to be possible to find a combination of SNPs to acts as very effective genetic markers in the selection of milk protein content.
In several genes, many mutations were cataloged which create intragenic haplotypes (many SNPs within one gene or very strong linked genes).These kind of haplotypes were first described for bovine LGB and CSN3 genes by Wagner et al. (1994), Ehrmann et al. (1997) and Kamiński (2000).By using the database it is possible to find SNPs located in different, but functionally associated genes.Good examples are SNPs within PRL, RPRL, STAT5 and SNPs identified within STAT5 binding sites located within milk protein promoters.A combination of these intergenic SNPs can give a new insight into relationships between these genes responsible for the major signal transmitting pathway regulating milk protein gene expression.
The collected SNPs represent most of the 29 cattle chromosomes.Eight chromosomes, namely: 2, 3, 7, 9, 12, 17, 25, 26, are not represented in the SNP database.Most of the collected SNPs may play a role as a marker of certain chromosome region, while others should be treated as a causative mutations.All these sequence variants are located within functional genes.Because functional genes are sometimes organized in groups and located close together because of their function, the SNPs described in the catalog may turned to be more efficient genetic markers and may shorten the way to find causative mutations influence milk protein content.
Another possible application of the SNP database is dairy cattle identification and paternity analysis.Compared with most popular DNA marker (microsatellites), SNPs are attractive because they are abundant, genetically stable and amenable to high-throughput automated technology (Vignal, 2002).They are considered as a realistic alternative in livestock identification and kinship analysis (Fries and Durstewitz, 2001;Heaton et al., 2002).Before it, however, a wide population screening must be conducted to validate frequency of SNPs in major dairy cattle breeds.
The SNPs database can also be used for evolutionary studies, evaluation of genetic distances between wild and domestic cattle breeds and the domestification history of bovine species.
Although the SNPs database does not contain all existing variations associated with milk protein content, its originality, current and future applicability make it a valuable resource for designing different experiments, especially with the use of microarray technology.