畜牧兽医学报 ›› 2025, Vol. 56 ›› Issue (5): 2157-2167.doi: 10.11843/j.issn.0366-6964.2025.05.016
乔利英1,2(), 王万年1, 张莉1, 庞志旭1, 张思颖1, 李一凡1, 刘文忠1,2,*(
)
收稿日期:
2024-10-24
出版日期:
2025-05-23
发布日期:
2025-05-27
通讯作者:
刘文忠
E-mail:liyingqiao1970@163.com;tglwzyc@163.com
作者简介:
乔利英(1970-),女,山西定襄人,高级实验师,主要从事动物数量遗传学研究, E-mail: liyingqiao1970@163.com
基金资助:
QIAO Liying1,2(), WANG Wannian1, ZHANG Li1, PANG Zhixu1, ZHANG Siying1, LI Yifan1, LIU Wenzhong1,2,*(
)
Received:
2024-10-24
Online:
2025-05-23
Published:
2025-05-27
Contact:
LIU Wenzhong
E-mail:liyingqiao1970@163.com;tglwzyc@163.com
摘要:
旨在评估基于基因组标记的机器学习(machine learning, ML)算法在品种分类中的有效性,检验不同ML算法在绵羊品种分类中的应用效果如何。本研究采用2种方式对10个绵羊品种进行单核苷酸多态性(single nucleotide polymorphisms, SNPs)位点选择,第一种利用群体间分化指数(fixation index, FST)进行选择,第二种方式是在FST的基础上使用Boruta特征选择算法对SNPs位点进一步筛选。采用K-近邻、支持向量机(support vector machines, SVM)和自适应增强(adaptive boosting, AdaBoost)等8种不同类别的ML算法对绵羊品种进行分类,采用准确性评估不同SNPs选择方式和不同ML算法在品种鉴定中的差异,鉴定绵羊品种分类的最佳组合方式。本研究采用的数据中既有遗传关系较远的品种,也有遗传相似的品种,保证了后续分析的可靠性。根据前1%的筛选标准,FST分析每次筛选出5 361个SNPs位点,Boruta算法最终保留(328±11.7)个SNPs位点用于ML品种分类,且在多次迭代后,被标记为“确认”的SNPs位点得分稳定高于阴影特征和被标记的其他两类SNPs位点。Boruta算法保留的SNPs位点数远低于FST分析。在使用ML模型进行品种分类时,大多数模型的准确性均高于0.9。其中,经过Boruta算法选择SNPs位点之后使用SVM模型进行品种分类准确性最高(0.953),AdaBoost表现也同样优秀(0.947),仅使用FST选择SNPs位点之后使用NB模型分类效果最差(0.601)。除NB外,其余模型接收者操作特征曲线下面积均接近于1。无论使用哪种SNPs选择方式均具有较强的区分能力,使用Boruta算法后效果略好。根据上述结果表明,ML方法的实施有效提高了品种分类的准确性,在绵羊品种鉴定中有良好的应用潜力。
中图分类号:
乔利英, 王万年, 张莉, 庞志旭, 张思颖, 李一凡, 刘文忠. 基于基因组标记对绵羊品种分类的机器学习方法研究[J]. 畜牧兽医学报, 2025, 56(5): 2157-2167.
QIAO Liying, WANG Wannian, ZHANG Li, PANG Zhixu, ZHANG Siying, LI Yifan, LIU Wenzhong. Machine Learning Methods for Sheep Breed Classification Based on Genomic Markers[J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(5): 2157-2167.
表 1
经过质量控制后保留的品种及个体数"
品种 Breed | 个体数 Number of individuals |
芬兰羊Finnish sheep(Finn) | 54 |
冰岛羊Icelandic sheep(Icelandic) | 54 |
罗曼诺夫羊Romanov sheep(Romanov) | 79 |
特塞尔羊Texel sheep(Texel) | 59 |
多浪羊Duolang sheep(DL) | 119 |
湖羊Hu sheep(Hu) | 112 |
大尾寒羊Large Tail Han sheep(LTH) | 106 |
泗水裘皮羊Sishui Fur sheep(SSF) | 58 |
小尾寒羊Small Tail Han sheep(STH) | 102 |
洼地羊Wadi sheep(WD) | 146 |
1 |
GREENER J G , KANDATHIL S M , MOFFAT L , et al. A guide to machine learning for biologists[J]. Nat Rev Mol Cell Biol, 2022, 23 (1): 40- 55.
doi: 10.1038/s41580-021-00407-0 |
2 |
CHAFAI N , HAYAH I , HOUAGA I , et al. A review of machine learning models applied to genomic prediction in animal breeding[J]. Front Genet, 2023, 14, 1150596.
doi: 10.3389/fgene.2023.1150596 |
3 |
李棉燕, 王立贤, 赵福平. 机器学习在动物基因组选择中的研究进展[J]. 中国农业科学, 2023, 56 (18): 3682- 3692.
doi: 10.3864/j.issn.0578-1752.2023.18.015 |
LI M Y , WANG L X , ZHAO F P . Research progress on machine learning for genomic selection in animals[J]. Scientia Agricultura Sinica, 2023, 56 (18): 3682- 3692.
doi: 10.3864/j.issn.0578-1752.2023.18.015 |
|
4 |
LIU R , XU Z , TENG J , et al. Evaluation of six machine learning classification algorithms in pig breed identification using SNPs array data[J]. Anim Genet, 2023, 54 (2): 113- 122.
doi: 10.1111/age.13279 |
5 |
ZHAO C , WANG D , YANG C , et al. Population structure and breed identification of Chinese indigenous sheep breeds using whole genome SNPs and InDels[J]. Genet Sel Evol, 2024, 56 (1): 60.
doi: 10.1186/s12711-024-00927-1 |
6 |
JIE W , LEI Q X , CAO D G , et al. Whole genome SNPs among 8 chicken breeds enable identification of genetic signatures that underlie breed features[J]. J Integr Agr, 2023, 22 (7): 2200- 2212.
doi: 10.1016/j.jia.2022.11.007 |
7 |
LIAKOS K G , BUSATO P , MOSHOU D , et al. Machine learning in agriculture: A review[J]. Sensors, 2018, 18 (8): 2674.
doi: 10.3390/s18082674 |
8 |
AYO F E , AWOTUNDE J B , FOLORUNSO S O , et al. A genomic rule-based KNN model for fast flux botnet detection[J]. Egypt Inform J, 2023, 24 (2): 313- 325.
doi: 10.1016/j.eij.2023.05.002 |
9 |
YUAN Y , SHI C , ZHAO H . Machine learning-enabled genome mining and bioactivity prediction of natural products[J]. ACS Synth Biol, 2023, 12 (9): 2650- 2662.
doi: 10.1021/acssynbio.3c00234 |
10 |
PEIGNIER S , SORIN B , CALEVRO F . Ensemble learning based gene regulatory network inference[J]. Int J Artif Intell T, 2023, 32 (5): 2360005.
doi: 10.1142/S0218213023600059 |
11 |
GOUDET J , WEIR B S . An allele-sharing, moment-based estimator of global, population-specific and population-pair F ST under a general model of population structure[J]. PLoS Genet, 2023, 19 (11): e1010871.
doi: 10.1371/journal.pgen.1010871 |
12 |
ZHOU H , XIN Y , LI S . A diabetes prediction model based on Boruta feature selection and ensemble learning[J]. BMC Bioinformatics, 2023, 24 (1): 224.
doi: 10.1186/s12859-023-05300-5 |
13 | 梁卉, 王雪, 司敬方, 等. 利用基因组标记和机器学习算法对中国牛品种的分类准确性研究[J]. 遗传, 2024, 46 (7): 530- 539. |
LIANG H , WANG X , SI J F , et al. Classification accuracy of machine learning algorithms for Chinese local cattle breeds using genomic markers[J]. Hereditas (Beijing), 2024, 46 (7): 530- 539. | |
14 |
RAMÍREZ-GALLEGO S , LASTRA I , MARTÍNEZ-REGO D , et al. Fast-mRMR: Fast minimum redundancy maximum relevance algorithm for high-dimensional big data[J]. Int J Intell Syst, 2017, 32 (2): 134- 152.
doi: 10.1002/int.21833 |
15 |
LI X , LI H , YANG Z , et al. Distribution rules of 8-mer spectra and characterization of evolution state in animal genome sequences[J]. BMC Genomics, 2024, 25 (1): 855.
doi: 10.1186/s12864-024-10786-1 |
16 |
SCHIAVO G , BERTOLINI F , GALIMBERTI G , et al. A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds[J]. Animal, 2020, 14 (2): 223- 232.
doi: 10.1017/S1751731119002167 |
17 | DAN S , MANDAL S N , GHOSH P , et al. Principal component analysis in pig breeds identification[J]. Indian J Anim Sci, 2023, 93 (4): 401- 405. |
18 |
PILES M , BERGSMA R , GIANOLA D , et al. Feature selection stability and accuracy of prediction models for genomic prediction of residual feed intake in pigs using machine learning[J]. Front Genet, 2021, 12, 611506.
doi: 10.3389/fgene.2021.611506 |
19 |
王万年, 陈思佳, 郜金荣, 等. 基于多层感知机的绵羊限性性状基因组选择模拟研究[J]. 畜牧兽医学报, 2023, 54 (7): 2824- 2835.
doi: 10.11843/j.issn.0366-6964.2023.07.015 |
WANG W N , CHEN S J , GAO J R , et al. Simulation study on genomic selection of sex-limited traits using multilayer perceptron in sheep[J]. Acta Veterinaria et Zootechnica Sinica, 2023, 54 (7): 2824- 2835.
doi: 10.11843/j.issn.0366-6964.2023.07.015 |
|
20 |
MORONI B , BRAMBILLA A , ROSSI L , et al. Hybridization between Alpine Ibex and Domestic Goat in the Alps: A Sporadic and Localized Phenomenon?[J]. Animals, 2022, 12 (6): 751.
doi: 10.3390/ani12060751 |
21 |
WANG Z H , ZHU Q H , LI X , et al. iSheep: an integrated resource for sheep genome, variant and phenotype[J]. Front Genet, 2021, 12, 714852.
doi: 10.3389/fgene.2021.714852 |
22 |
PURCELL S , NEALE B , TODD-BROWN K , et al. PLINK: a tool set for whole-genome association and population-based linkage analyses[J]. Am J Hum Genet, 2007, 81 (3): 559- 575.
doi: 10.1086/519795 |
23 |
IHAKA R , GENTLEMAN R . R: a language for data analysis and graphics[J]. J Comput Graph Stat, 1996, 5 (3): 299- 314.
doi: 10.1080/10618600.1996.10474713 |
24 |
ALEXANDER D H , NOVEMBRE J , LANGE K . Fast model-based estimation of ancestry in unrelated individuals[J]. Genome Res, 2009, 19 (9): 1655- 1664.
doi: 10.1101/gr.094052.109 |
25 |
WICKHAM H . ggplot2[J]. Wiley Interdiscip Rev Comput Stat, 2011, 3 (2): 180- 185.
doi: 10.1002/wics.147 |
26 | KARATZOGLOU A , SMOLA A , HORNIK K , et al. kernlab-an S4 package for kernel methods in R[J]. J Stat Softw, 2004, 11, 1- 20. |
27 | HALDAR A , PAL P , GHOSH S , et al. Body weight prediction using recursive partitioning and regression trees (RPART) model in indian black Bengal goat breed: A machine learning approach[J]. Indian J Anim Res, 2023, 57 (9): 1251- 1257. |
28 |
HENGL T , MENDES DE JESUS J , HEUVELINK G B , et al. SoilGrids250m: Global gridded soil information based on machine learning[J]. PLoS One, 2017, 12 (2): e0169748.
doi: 10.1371/journal.pone.0169748 |
29 | MEYER D , WIEN F T . Support vector machines[J]. R News, 2001, 1 (3): 23- 26. |
30 | RCOLORBREWER S , LIAW M A . Package 'randomforest '[J]. UC Berkeley: Berkeley, CA, USA, 2018, |
31 | ALFARO E , GAMEZ M , GARCIA N . Adabag: An R package for classification with boosting and bagging[J]. J Stat Softw, 2013, 54, 1- 35. |
32 | RIDGEWAY G . Generalized boosted models: A guide to the gbm package[J]. Update, 2007, 1 (1): 2007. |
33 | CHEN T , HE T , BENESTY M , et al. Package 'xgboost '[J]. R Version, 2019, 90 (1-66): 40. |
34 |
ROBIN X , TURCK N , HAINARD A , et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves[J]. BMC Bioinformatics, 2011, 12, 77.
doi: 10.1186/1471-2105-12-77 |
35 | TAN K , WANG R , LI M , et al. Discriminating soybean seed varieties using hyperspectral imaging and machine learning[J]. J Comput Methods Sci, 2019, 19 (4): 1001- 1015. |
36 |
MEADOWS J R S , HIENDLEDER S , KIJAS J W . Haplogroup relationships between domestic and wild sheep resolved using a mitogenome panel[J]. Heredity, 2011, 106 (4): 700- 706.
doi: 10.1038/hdy.2010.122 |
37 |
BRAGA-NETO U M , ZOLLANVARI A , DOUGHERTY E R . Cross-validation under separate sampling: strong bias and how to correct it[J]. Bioinformatics, 2014, 30 (23): 3349- 3355.
doi: 10.1093/bioinformatics/btu527 |
38 |
LIU R , XU Z , TENG J , et al. Evaluation of six machine learning classification algorithms in pig breed identification using SNPs array data[J]. Anim Genet, 2023, 54 (2): 113- 122.
doi: 10.1111/age.13279 |
39 | ZHANG Y , DING C , LI T . Gene selection algorithm by combining reliefF and mRMR[J]. BMC Genomics, 2008, 9 (Suppl 2): 527. |
40 |
WANG X , REN J , REN H , et al. Diabetes mellitus early warning and factor analysis using ensemble Bayesian networks with SMOTE-ENN and Boruta[J]. Sci Rep, 2023, 13 (1): 12718.
doi: 10.1038/s41598-023-40036-5 |
41 |
AL-MAMUN H A , DANILEVICZ M F , MARSH J I , et al. Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large-scale soybean dataset[J]. Plant Genome, 2025, 18 (1): e20503.
doi: 10.1002/tpg2.20503 |
42 |
SARDER M A , MANIRUZZAMAN M , AHAMMED B . Feature selection and classification of leukemia cancer using machine learning techniques[J]. J Mach Learn Res, 2020, 5 (2): 18.
doi: 10.11648/j.mlr.20200502.11 |
43 | CHANDRA M A , BEDI S S . Survey on SVM and their application in image classification[J]. J Inf Technol, 2021, 13 (5): 1- 11. |
44 |
BLANQUERO R , CARRIZOSA E , RAMÍREZ-COBO P , et al. Variable selection for Naïve Bayes classification[J]. Comput Oper Res, 2021, 135, 105456.
doi: 10.1016/j.cor.2021.105456 |
45 |
IMRAN M , BHATTI A , KING D M , et al. Supervised Machine Learning-Based Decision Support for Signal Validation Classification[J]. Drug Saf, 2022, 45 (5): 583- 596.
doi: 10.1007/s40264-022-01159-2 |
46 | ZHANG S . Challenges in KNN classification[J]. IEEE Trans Knowl Data Eng, 2021, 34 (10): 4663- 4675. |
47 | XU Z , DIAO S , TENG J , et al. Breed identification of meat using machine learning and breed tag SNPs[J]. Food Control, 2021, 125 (1): 107971. |
48 |
ZHAO C , WANG D , TENG J , et al. Breed identification using breed-informative SNPs and machine learning based on whole genome sequence data and SNP chip data[J]. J Anim Sci Biotechnol, 2023, 14 (1): 85.
doi: 10.1186/s40104-023-00880-x |
49 |
NEETHIRAJAN S . Affective state recognition in livestock-artificial intelligence approaches[J]. Animals, 2022, 12 (6): 759.
doi: 10.3390/ani12060759 |
[1] | 马应天, 姜璐瑶, 李增开, 秦剑平, 赵建华, 贺玉芳, 宋宇轩, 张磊. 矢车菊素-3-芸香糖苷对奶绵羊精液冷冻保存效果的影响[J]. 畜牧兽医学报, 2025, 56(4): 1768-1778. |
[2] | 李艳娥, 梁友萍, 樊洁, 吴芳燕, 尧香悦, 李毛却乎, 次仁仓决, 郝桂英, 古小彬. 绵羊痒螨钙网蛋白对兔外周血单个核细胞Th1/Th2和Th17/Treg免疫平衡的影响[J]. 畜牧兽医学报, 2025, 56(4): 1910-1918. |
[3] | 杨杨, 李良远, 万鹏程, 卢守亮, 刘长彬, 杨华, 王立民, 代蓉, 周平. 绵羊季节性发情性状核心基因和关键lncRNA的筛选与分析[J]. 畜牧兽医学报, 2025, 56(3): 1264-1277. |
[4] | 杨苗苗, 谢莉, 简宝怡, 罗超维, 谢卓君, 朱飘, 周天日, 李华, 向海. 利用机器学习构建和优化早期体尺性状对成年母鸡腹脂沉积的预测模型[J]. 畜牧兽医学报, 2025, 56(2): 548-558. |
[5] | 习海娇, 李金泉, 张燕军, 王瑞军, 吕琦, 梅步俊, 王娜, 苏蕊, 王志英. 显性效应对内蒙古绒山羊产绒量和绒纤维直径育种值估计准确性的影响[J]. 畜牧兽医学报, 2025, 56(2): 571-581. |
[6] | 何雨, 王翔宇, 狄冉, 储明星, 梁琛. BMP4/SMAD4通过下调GJA1基因表达影响绵羊卵巢颗粒间隙连接活性[J]. 畜牧兽医学报, 2025, 56(2): 679-688. |
[7] | 楚翼健, 崔久增, 李增开, 张磊, 褚婷婷, 黄艳平, 宋宇轩. 绵羊子宫内膜容受前期与容受期的阴道微生物比较研究[J]. 畜牧兽医学报, 2025, 56(2): 689-699. |
[8] | 王晓飞, 王勃森, 卫梦瑶, 姜璐瑶, 徐刚刚, 刘佳欣, 马应天, 王丽, 宋宇轩, 张磊. 羊奶改善糖尿病模型小鼠肝、肾病理变化的作用研究[J]. 畜牧兽医学报, 2025, 56(2): 870-882. |
[9] | 阳文攀, 刘相杰, 罗冬香, 陈梦会, 谢瑛, 方跃鑫, 林婷燕, 李爱民, 李文静, 邓政, 丁能水. 基于芯片数据的长白猪繁殖性状基因组选择研究[J]. 畜牧兽医学报, 2025, 56(1): 213-221. |
[10] | 范维, 刘昕昕, 翟艺禄, 张新玉, 王唯, 付佳棋, 孙福亮. 羊源肺炎克雷伯菌分离鉴定及其外膜囊泡提取方法的建立[J]. 畜牧兽医学报, 2025, 56(1): 353-364. |
[11] | 贾宇航, 郭良富, 张茹楠, 赵阿勇, 刘玉芳, 储明星. miR-127调控绵羊骨骼肌细胞增殖分化及其转录因子PAX3筛选[J]. 畜牧兽医学报, 2024, 55(9): 3864-3875. |
[12] | 龚一鸣, 贾一轩, 李佳骏, 王翔宇, 贺小云, 储明星, 狄冉. 不同FecB基因型和不同直径绵羊卵泡中BMP/SMAD通路活性及蛋白表达差异[J]. 畜牧兽医学报, 2024, 55(9): 3957-3967. |
[13] | 王进部, 李佳, 任德明, 王立贤, 王立刚. 机器学习在畜禽基因组选择中的应用进展[J]. 畜牧兽医学报, 2024, 55(7): 2775-2785. |
[14] | 王婷, 张元庆, 闫益波, 上官明军, 郭宏宇, 王志武. “特藏寒羊”群体遗传结构分析与选择信号的对比分析[J]. 畜牧兽医学报, 2024, 55(7): 2913-2926. |
[15] | 李竟, 张元旭, 王泽昭, 陈燕, 徐凌洋, 张路培, 高雪, 高会江, 李俊雅, 朱波, 郭鹏. 机器学习全基因组选择研究进展[J]. 畜牧兽医学报, 2024, 55(6): 2281-2292. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||