畜牧兽医学报 ›› 2024, Vol. 55 ›› Issue (6): 2431-2440.doi: 10.11843/j.issn.0366-6964.2024.06.015
收稿日期:
2023-11-08
出版日期:
2024-06-23
发布日期:
2024-06-28
通讯作者:
杜志强
E-mail:2021710855@yangtzeu.edu.cn;zhqdu@yangtzeu.edu.cn
作者简介:
吴华煊(1998-),男,江西上饶人,硕士生,主要从事动物遗传育种研究,E-mail: 2021710855@yangtzeu.edu.cn
基金资助:
Received:
2023-11-08
Online:
2024-06-23
Published:
2024-06-28
Contact:
Zhiqiang DU
E-mail:2021710855@yangtzeu.edu.cn;zhqdu@yangtzeu.edu.cn
摘要:
旨在探索并评估6种不同的单核苷多态性(single nucleotide polymorphisms,SNP)基因型特征提取方法。本研究分析比较了6种方法:主成分分析(principal component analysis,PCA)、基因主成分分析(gene-principal component analysis,gene-PCA)、SNP位点间皮尔逊相关系数(SNP-pearson correlation coefficient, SNP-PCC)、连锁不平衡(linkage disequilibrium,LD)、全基因组关联分析(genome-wide association study,GWAS)和随机抽样(random sampling,RS),在两组数据(北京鸭,542个样本,SNP位点数39 932;杜洛克猪,2 549个样本,SNP位点数230 884)3组表型(北京鸭体长(body length)、杜洛克猪背膘厚(backfat thickness)和乳头数(teat number))上的GEBV预测准确率。发现SNP-PCC结合5种GS方法(GBLUP、BayesA、BayesB、BayesC、Bayesian Lasso),在北京鸭数据获得相对可靠的预测精度,在猪背膘厚和乳头数表型获得最高平均预测准确性(提升5%,达到32.3%),并显著提升计算效率(平均提升5~7倍)。综上,本研究发现选择合适的特征提取方法可以有效提升GS的预测准确性和计算效率,为深入研究不同特征提取方法对GS预测准确性的影响奠定了基础,并为其在育种实践中应用提供了参考。
中图分类号:
吴华煊, 杜志强. 基因型特征提取方法影响基因组选择预测准确性的研究[J]. 畜牧兽医学报, 2024, 55(6): 2431-2440.
Huaxuan WU, Zhiqiang DU. Methods of Genotype Feature Extraction Affecting the Prediction Accuracy of Genomic Selection[J]. Acta Veterinaria et Zootechnica Sinica, 2024, 55(6): 2431-2440.
表 3
不同特征提取方法的平均预测准确率"
数据集 Dataset | 方法 Method | 平均PCC Mean PCC | PCC标准差 PCC STD | 平均MSE Mean MSE | MSE标准差 MSE STD |
北京鸭,体长/cm | GWAS(P < 0.05) | 0.484 | 0.081 | 8.701 | 2.835 |
Pecking duck, body length | PCA | 0.243 | 0.043 | 5.741 | 0.244 |
Gene-PCA | 0.132 | 0.026 | 6.094 | 0.080 | |
LD | 0.277 | 0.052 | 5.803 | 0.343 | |
SNP-PCC | 0.290 | 0.058 | 5.916 | 0.556 | |
Random | 0.265 | 0.036 | 6.002 | 0.375 | |
Origin | 0.217 | 0.020 | 6.525 | 0.888 | |
猪,背膘厚/mm | GWAS(P < 0.05) | 0.366 | 0.037 | 4.927 | 0.595 |
Pig, backfat thickness | PCA | 0.341 | 0.034 | 4.102 | 0.105 |
Gene-PCA | 0.186 | 0.025 | 4.560 | 0.118 | |
LD | 0.358 | 0.020 | 4.259 | 0.211 | |
SNP-PCC | 0.367 | 0.023 | 4.164 | 0.181 | |
Random | 0.338 | 0.021 | 4.286 | 0.158 | |
Origin | 0.336 | 0.026 | 4.612 | 0.552 | |
猪,乳头数/个 | GWAS(P < 0.05) | 0.240 | 0.042 | 1.247 | 0.105 |
Pig, teat number | PCA | 0.313 | 0.030 | 1.024 | 0.029 |
Gene-PCA | 0.123 | 0.011 | 1.148 | 0.023 | |
LD | 0.301 | 0.022 | 1.111 | 0.068 | |
SNP-PCC | 0.312 | 0.018 | 1.090 | 0.043 | |
Random | 0.299 | 0.020 | 1.088 | 0.053 | |
Origin | 0.280 | 0.025 | 1.207 | 0.158 |
表 4
不同特征提取方法所需计算时间的比较"
数据集 Dataset | 方法 Method | 位点数 Number of markers | 计算速度 Computing speed |
体长 | GWAS(P < 0.05) | 4 282 | 2 s |
Body length | PCA | 359 | 1 s |
Gene-PCA | 542 | 1 s | |
LD | 31 002 | 10 s | |
SNP-PCC | 26 495 | 6 s | |
Random | 3 993 | 2 s | |
Origin | 39 932 | 15 s | |
背膘厚、乳头数 | GWAS(P < 0.05) | 21 760 | 12 s |
Backfat thickness, teat number | PCA | 716 | 1 s |
Gene-PCA | 2 500 | 1.6 s | |
LD | 33 101 | 18 s | |
SNP-PCC | 20 731 | 11 s | |
Random | 23 088 | 12 s | |
Origin | 230 884 | 2 min 20 s |
1 |
MEUWISSENT H E,HAYESB J,GODDARDM E.Prediction of total genetic value using genome-wide dense marker maps[J].Genetics,2001,157(4):1819-1829.
doi: 10.1093/genetics/157.4.1819 |
2 |
OSTERSENT,CHRISTENSENO F,HENRYONM,et al.Deregressed EBV as the response variable yield more reliable genomic predictions than traditional EBV in pure-bred pigs[J].Genet Sel Evol,2011,43(1):38.
doi: 10.1186/1297-9686-43-38 |
3 |
ZHAOY S,GOWDAM,LIUW X,et al.Accuracy of genomic selection in European maize elite breeding populations[J].Theor Appl Genet,2012,124(4):769-776.
doi: 10.1007/s00122-011-1745-y |
4 |
LIUT F,QUH,LUOC L,et al.Genomic selection for the improvement of antibody response to Newcastle disease and avian influenza virus in chickens[J].PLoS One,2014,9(11):e112685.
doi: 10.1371/journal.pone.0112685 |
5 |
BEYENEY,SEMAGNK,MUGOS,et al.Genetic gains in grain yield through genomic selection in eight Bi-parental maize populations under drought stress[J].Crop Sci,2015,55(1):154-163.
doi: 10.2135/cropsci2014.07.0460 |
6 |
PALAIOKOSTASC,FERRARESSOS,FRANCHR,et al.Genomic prediction of resistance to pasteurellosis in gilthead sea bream (Sparus aurata) using 2b-RAD sequencing[J].G3 (Bethesda),2016,6(11):3693-3700.
doi: 10.1534/g3.116.035220 |
7 |
MEUWISSENT H.Accuracy of breeding values of 'unrelated' individuals predicted by dense SNP genotyping[J].Genet Sel Evol,2009,41(1):35.
doi: 10.1186/1297-9686-41-35 |
8 |
AKBARZADEHM,DEHKORDIS R,ROUDBARM A,et al.GWAS findings improved genomic prediction accuracy of lipid profile traits: tehran cardiometabolic genetic study[J].Sci Rep,2021,11(1):5780.
doi: 10.1038/s41598-021-85203-8 |
9 |
LIB,ZHANGN X,WANGY G,et al.Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods[J].Front Genet,2018,9,237.
doi: 10.3389/fgene.2018.00237 |
10 |
PILESM,BERGSMAR,GIANOLAD,et al.Feature selection stability and accuracy of prediction models for genomic prediction of residual feed intake in pigs using machine learning[J].Front Genet,2021,12,611506.
doi: 10.3389/fgene.2021.611506 |
11 | TORADAL,LORENZONL,BEDDISA,et al.ImaGene: a convolutional neural network to quantify natural selection from genomic data[J].BMC Bioinformatics,2019,20(Suppl 9):337. |
12 | 王万年,陈思佳,郜金荣,等.基于多层感知机的绵羊限性性状基因组选择模拟研究[J].畜牧兽医学报,2023,54(7):2824-2835. |
WANGW N,CHENS J,GAOJ R,et al.Simulation study on genomic selection of sex-limited traits using multilayer perceptron in sheep[J].Acta Veterinaria et Zootechnica Sinica,2023,54(7):2824-2835. | |
13 | 丁纪强,李庆贺,张高猛,等.比较机器学习等算法对肉鸡产蛋性状育种值估计的准确性[J].畜牧兽医学报,2022,53(5):1364-1372. |
DINGJ Q,LIQ H,ZHANGG M,et al.Comparing the accuracy of estimated breeding value by several algorithms on laying traits in broilers[J].Acta Veterinaria et Zootechnica Sinica,2022,53(5):1364-1372. | |
14 |
AZODIC B,BOLGERE,MCCARRENA,et al.Benchmarking parametric and machine learning models for genomic prediction of complex traits[J].G3 (Bethesda),2019,9(11):3691-3702.
doi: 10.1534/g3.119.400498 |
15 |
WANGK Q,YANGB,LIQ,et al.Systematic evaluation of genomic prediction algorithms for genomic prediction and breeding of aquatic animals[J].Genes (Basel),2022,13(12):2247.
doi: 10.3390/genes13122247 |
16 |
XIANGT,LIT,LIJ L,et al.Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs[J].FASEB J,2023,37(6):e22961.
doi: 10.1096/fj.202300245R |
17 |
DENGM T,ZHUF,YANGY Z,et al.Genome-wide association study reveals novel loci associated with body size and carcass yields in Pekin ducks[J].BMC Genomics,2019,20(1):1.
doi: 10.1186/s12864-018-5379-1 |
18 |
TANC,WUZ F,RENJ L,et al.Genome-wide association study and accuracy of genomic prediction for teat number in Duroc pigs using genotyping-by-sequencing[J].Genet Sel Evol,2017,49(1):35.
doi: 10.1186/s12711-017-0311-8 |
19 | GOODFELLOWI,BENGIOY,COURVILLEA.Deep learning[M].Cambridge:The MIT Press,2016. |
20 | PEDREGOSAF,VAROQUAUXG,GRAMFORTA,et al.Scikit-learn: machine learning in Python[J].J Mach Learn Res,2011,12,2825-2830. |
21 |
SLATKINM.Linkage disequilibrium-understanding the evolutionary past and mapping the medical future[J].Nat Rev Genet,2008,9(6):477-485.
doi: 10.1038/nrg2361 |
22 |
HILLW G,ROBERTSONA.Linkage disequilibrium in finite populations[J].Theor Appl Genet,1968,38(6):226-231.
doi: 10.1007/BF01245622 |
23 |
HILLW G,MACKAYT F C.D. S.Falconer and introduction to quantitative genetics[J].Genetics,2004,167(4):1529-1536.
doi: 10.1093/genetics/167.4.1529 |
24 | SVEDJ A,HILLW G.One hundred years of linkage disequilibrium[J].Genetics,2018,209(3):629-636. |
25 |
HENDERSONC R.Best linear unbiased estimation and prediction under a selection model[J].Biometrics,1975,31(2):423-447.
doi: 10.2307/2529430 |
26 |
HABIERD,FERNANDOR L,KIZILKAYAK,et al.Extension of the Bayesian alphabet for genomic selection[J].BMC Bioinformatics,2011,12,186.
doi: 10.1186/1471-2105-12-186 |
27 |
PÉREZP,DE LOS CAMPOSG.Genome-wide regression and prediction with the BGLR statistical package[J].Genetics,2014,198(2):483-495.
doi: 10.1534/genetics.114.164442 |
28 |
GREŠOVÁK,MARTINEKV,AČGECHÁK D,et al.Genomic benchmarks: a collection of datasets for genomic sequence classification[J].BMC Genom Data,2023,24(1):1-25.
doi: 10.1186/s12863-022-01102-5 |
29 |
LUECKENM D,BVTTNERM,CHAICHOOMPUK,et al.Benchmarking atlas-level data integration in single-cell genomics[J].Nature methods,2022,19(1):41-50.
doi: 10.1038/s41592-021-01336-8 |
30 |
LIY,MANSMANNU,DUS,et al.Benchmark study of feature selection strategies for multi-omics data[J].BMC Bioinformatics,2022,23(1):412.
doi: 10.1186/s12859-022-04962-x |
31 |
PRICEA L,PATTERSONN J,PLENGER M,et al.Principal components analysis corrects for stratification in genome-wide association studies[J].Nat Genet,2006,38(8):904-909.
doi: 10.1038/ng1847 |
32 |
BEHARD M,YUNUSBAYEVB,METSPALUM,et al.The genome-wide structure of the Jewish people[J].Nature,2010,466(7303):238-242.
doi: 10.1038/nature09103 |
33 |
ATZMONG,HAOL,PE'ERI,et al.Abraham's children in the genome era: major Jewish diaspora populations comprise distinct genetic clusters with shared Middle eastern Ancestry[J].Am J Hum Genet,2010,86(6):850-859.
doi: 10.1016/j.ajhg.2010.04.015 |
34 |
CAMPBELLC L,PALAMARAP F,DUBROVSKYM,et al.North African Jewish and non-Jewish populations form distinctive, orthogonal clusters[J].Proc Natl Acad Sci U S A,2012,109(34):13865-13870.
doi: 10.1073/pnas.1204840109 |
35 |
ELHAIKE.Principal component analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated[J].Sci Rep,2022,12(1):14683.
doi: 10.1038/s41598-022-14395-4 |
36 |
REND Y,CAIX D,LINQ,et al.Impact of linkage disequilibrium heterogeneity along the genome on genomic prediction and heritability estimation[J].Genet Sel Evol,2022,54(1):47.
doi: 10.1186/s12711-022-00737-3 |
37 |
REICHD E,CARGILLM,BOLKS,et al.Linkage disequilibrium in the human genome[J].Nature,2001,411(6834):199-204.
doi: 10.1038/35075590 |
38 |
CLIMERS,YANGW,DE LAS FUENTESL,et al.A custom correlation coefficient (CCC) approach for fast identification of multi-SNP association patterns in genome-wide SNPs data[J].Genet Epidemiol,2014,38(7):610-621.
doi: 10.1002/gepi.21833 |
39 | ZHOUY,VALESM I,WANGA X,et al.Systematic bias of correlation coefficient may explain negative accuracy of genomic prediction[J].Brief Bioinform,2017,18(5):744-753. |
40 |
SUBRAMANIANJ,SIMONR.Overfitting in prediction models-is it a problem only in high dimensions?[J].Contemp Clin Trials,2013,36(2):636-641.
doi: 10.1016/j.cct.2013.06.011 |
41 |
FROUINA,DANDINE-ROULLANDC,PIERRE-JEANM,et al.Exploring the link between additive heritability and prediction accuracy from a ridge regression perspective[J].Front Genet,2020,11,581594.
doi: 10.3389/fgene.2020.581594 |
[1] | 李竟, 张元旭, 王泽昭, 陈燕, 徐凌洋, 张路培, 高雪, 高会江, 李俊雅, 朱波, 郭鹏. 机器学习全基因组选择研究进展[J]. 畜牧兽医学报, 2024, 55(6): 2281-2292. |
[2] | 钟欣, 张晖, 张充, 刘小红. 母猪繁殖力基因遗传育种研究进展[J]. 畜牧兽医学报, 2024, 55(2): 438-450. |
[3] | 严晓春, 习海娇, 李金泉, 王志英, 苏蕊. 内蒙古绒山羊绒毛性状基因组育种值估计准确性研究[J]. 畜牧兽医学报, 2024, 55(1): 120-128. |
[4] | 王万年, 陈思佳, 郜金荣, 温中豪, 袁梦娇, 张洪志, 庞志旭, 乔利英, 刘文忠. 基于多层感知机的绵羊限性性状基因组选择模拟研究[J]. 畜牧兽医学报, 2023, 54(7): 2824-2835. |
[5] | 李浩东, 闵祥玉, 周雅, 张禾垟, 郑军军, 刘琳玲, 王平, 王艳梅, 杨福合, 王桂武. 基于GBLUP等模型对梅花鹿(Cervus Nippon)生长相关性状基因组选择的预测准确性比较[J]. 畜牧兽医学报, 2023, 54(2): 608-616. |
[6] | 杨凯, 卢倬达, 何健, 张瑞琪, 王素青, 李克标, 赵云翔, 朱晓萍, 郭金彪. 商品猪群体效应对纯种猪胴体性状基因组选择准确性的影响[J]. 畜牧兽医学报, 2023, 54(12): 4943-4951. |
[7] | 牛一凡, 杨柏高, 张培培, 张航, 冯肖艺, 曹建华, 余洲, 郝海生, 杜卫华, 邹惠影, 朱化彬, 马友记, 赵学明. 牛胚胎基因组选择研究进展[J]. 畜牧兽医学报, 2023, 54(11): 4449-4457. |
[8] | 孙东晓, 张胜利, 张勤, 李姣, 张桂香, 刘丑生, 郑伟杰. 我国奶牛基因组选择技术应用进展[J]. 畜牧兽医学报, 2023, 54(10): 4028-4039. |
[9] | 马浩然, 张路培, 金生云, 宝金山, 李红艳, 高会江, 徐凌洋, 王泽昭, 李俊雅. 利用高密度SNP芯片评估中国地方肉牛品种基因组亲缘关系[J]. 畜牧兽医学报, 2023, 54(10): 4174-4185. |
[10] | 师睿, 苏国生, 陈紫薇, 李想, 罗汉鹏, 刘林, 郭刚, 张毅, 王雅春, 张胜利, 张勤. 中国荷斯坦牛繁殖性状的基因组预测效果比较[J]. 畜牧兽医学报, 2022, 53(9): 2944-2954. |
[11] | 庞志旭, 张洪志, 乔利英, 王万年, 潘洋洋, 刘文忠. 基于元共祖的基因组联合育种模拟研究[J]. 畜牧兽医学报, 2022, 53(7): 2172-2181. |
[12] | 丁纪强, 李庆贺, 张高猛, 李森, 郑麦青, 文杰, 赵桂苹. 比较机器学习等算法对肉鸡产蛋性状育种值估计的准确性[J]. 畜牧兽医学报, 2022, 53(5): 1364-1372. |
[13] | 杜永旺, 黄超, 王一东, 李森, 文杰, 陈智武, 赵桂苹, 郑麦青. 结合GWAS先验标记信息的肉鸡RFI性状全基因组选择研究[J]. 畜牧兽医学报, 2022, 53(10): 3403-3411. |
[14] | 张鹏飞, 何俊, 王立贤, 赵福平. 基于基因组和系谱信息的不同选配方案效果模拟研究[J]. 畜牧兽医学报, 2022, 53(10): 3448-3458. |
[15] | 李森, 杜永旺, 文杰, 黄超, 陈智武, 赵桂苹, 郑麦青. 快速型黄羽肉鸡饲料利用效率性状的基因组选择研究[J]. 畜牧兽医学报, 2021, 52(8): 2151-2161. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||