Comparison of three statistical approaches for feature selection for fine-scale genetic population assignment in four pig breeds

Tropical Animal Health and Production(2021)

引用 3|浏览0
暂无评分
摘要
Background Assigning animals to their corresponding breeds through breed informative single-nucleotide polymorphisms (SNPs) is required in many fields. For instance, it is used in the traceability and the authentication of meat and other livestock products. SNPs’ information for several pork breeds are now accessible thanks to the availability of dense SNP chips. These SNP chips cover a large number of molecular markers distributed across the entire genome. To identify the pork breed from a sample of industrial meat, one must analyze a large panel of genetic markers depending on the SNP chip used. The analysis of such large datasets requires intensive work. This leads to the idea of creating less dense chips of breed informative markers based on a reduced number of SNPs. Therefore, the analysis of the data emanating from the genotyping of these reduced chips will require less time and effort. Aim The objective of this study is to find the most informative SNPs for the discrimination between four pig breeds, namely Duroc, Landrace, Large White, and Pietrain. Method The Illumina Porcine 60 k SNP chip was used to genotype SNPs distributed all over the individuals’ genomes. Firstly, we used three different statistical approaches for feature selection: (i) principal component analysis (PCA), (ii) least absolute shrinkage and selection operator (LASSO), and (iii) random forest (RF). These three approaches identified three sets of SNPs; each set corresponds to one approach. Then, we combined the results of the three methods by setting up a final panel containing the SNPs which appear on the three sets altogether. Results Separately, each method resulted in a panel with the corresponding most discriminating SNPs. The PCA, the LASSO, and the random forest with Boruta algorithm highlighted 28,816, 50, and 286 SNPs, respectively. The number of SNPs selected by PCA is high compared to Boruta and LASSO because PCA chooses the variables while preserving as much information about the data as possible. The only downside of LASSO regression is that among a group of correlated variables, LASSO tends to select only one variable and ignore the others regardless of their importance. Contrarily to LASSO, the Boruta algorithm considers the interdependence between SNPs and selects informative variables even if they are correlated and have the same effect. The three panels shared 23 SNPs; the distribution of the individuals according to these SNPs showed a grouping of individuals of each breed in well-defined clusters without any overlapping. Conclusions The biological pathways represented by 23 breed informative SNPs resulted by the combination of PCA, LASSO, and Boruta should be explored in further analysis. The results provided by our study are promising for further applications of this method in other livestock animals.
更多
查看译文
关键词
Single-nucleotide polymorphism, Principal component analysis, Least absolute shrinkage and selection operator, Random forest, Boruta, Pig breeds
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要