基于基因组标记对绵羊品种分类的机器学习方法研究

doi:10.11843/j.issn.0366-6964.2025.05.016

摘要/Abstract

摘要：

旨在评估基于基因组标记的机器学习(machine learning, ML)算法在品种分类中的有效性，检验不同ML算法在绵羊品种分类中的应用效果如何。本研究采用2种方式对10个绵羊品种进行单核苷酸多态性(single nucleotide polymorphisms, SNPs)位点选择，第一种利用群体间分化指数(fixation index, F_ST)进行选择，第二种方式是在F_ST的基础上使用Boruta特征选择算法对SNPs位点进一步筛选。采用K-近邻、支持向量机(support vector machines, SVM)和自适应增强(adaptive boosting, AdaBoost)等8种不同类别的ML算法对绵羊品种进行分类，采用准确性评估不同SNPs选择方式和不同ML算法在品种鉴定中的差异，鉴定绵羊品种分类的最佳组合方式。本研究采用的数据中既有遗传关系较远的品种，也有遗传相似的品种，保证了后续分析的可靠性。根据前1%的筛选标准，F_ST分析每次筛选出5 361个SNPs位点，Boruta算法最终保留(328±11.7)个SNPs位点用于ML品种分类，且在多次迭代后，被标记为“确认”的SNPs位点得分稳定高于阴影特征和被标记的其他两类SNPs位点。Boruta算法保留的SNPs位点数远低于F_ST分析。在使用ML模型进行品种分类时，大多数模型的准确性均高于0.9。其中，经过Boruta算法选择SNPs位点之后使用SVM模型进行品种分类准确性最高(0.953)，AdaBoost表现也同样优秀(0.947)，仅使用F_ST选择SNPs位点之后使用NB模型分类效果最差(0.601)。除NB外，其余模型接收者操作特征曲线下面积均接近于1。无论使用哪种SNPs选择方式均具有较强的区分能力，使用Boruta算法后效果略好。根据上述结果表明，ML方法的实施有效提高了品种分类的准确性，在绵羊品种鉴定中有良好的应用潜力。

关键词: 机器学习, 品种分类, 基因组标记, 绵羊, 准确性

Abstract:

This study aimed to evaluate the effectiveness of machine learning (ML) algorithms based on genomic markers in breed classification and to examine the performance of different ML algorithms in the classification of sheep breeds. In this study, 2 methods were used to select single nucleotide polymorphisms (SNPs) sites for 10 sheep breeds. The first method used the fixation index (F_ST) for selection, and the second method used the Boruta feature selection algorithm based on F_ST to further screen the SNPs sites. The 8 different types of ML algorithms, including K-nearest neighbor, support vector machines (SVM) and adaptive boosting (AdaBoost), were used to classify sheep breeds. The accuracy was used to evaluate the differences between different SNPs selection methods and different ML algorithms in breed identification, and the best combination method for sheep breed classification was identified. The data used in this study included both genetically distantly related varieties and genetically similar varieties, ensuring the reliability of subsequent analysis. Based on the top 1% selection criteria, the F_ST analysis identified 5 361 SNPs loci in each run, while the Boruta algorithm ultimately retained (328±11.7) SNPs loci for ML-based breed classification. After multiple iterations, SNPs loci marked as "confirmed" by the Boruta algorithm consistently scored higher than shadow features and the other two categories of SNPs loci. The number of SNPs loci retained by the Boruta algorithm was significantly lower than that of the F_ST analysis. When ML models were used for breed classification, most models achieved an accuracy above 0.9. The SVM model, using SNPs loci selected by the Boruta algorithm, achieved the highest classification accuracy (0.953), followed closely by AdaBoost (0.947). In contrast, the NB model, using SNPs loci selected solely by the F_ST analysis, showed the lowest classification accuracy (0.601). Except for NB, the area under the receiver operating characteristic curve (AUC) for the other models was close to 1. Regardless of the SNPs selection method, both approaches demonstrated strong discriminatory power, with slightly better performance observed after applying the Boruta algorithm. The results indicate that the implementation of ML methods effectively improves the accuracy of breed classification and demonstrates strong potential for application in sheep breed identification.

Key words: machine learning, breed classification, genomic markers, sheep, accuracy

中图分类号:

S826.2

乔利英, 王万年, 张莉, 庞志旭, 张思颖, 李一凡, 刘文忠. 基于基因组标记对绵羊品种分类的机器学习方法研究[J]. 畜牧兽医学报, 2025, 56(5): 2157-2167.

QIAO Liying, WANG Wannian, ZHANG Li, PANG Zhixu, ZHANG Siying, LI Yifan, LIU Wenzhong. Machine Learning Methods for Sheep Breed Classification Based on Genomic Markers[J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(5): 2157-2167.

图/表 5

表 1

图 1

图 2

图 3

图 4

参考文献 49

1	GREENER J G , KANDATHIL S M , MOFFAT L , et al. A guide to machine learning for biologists[J]. Nat Rev Mol Cell Biol, 2022, 23 (1): 40- 55. doi: 10.1038/s41580-021-00407-0
2	CHAFAI N , HAYAH I , HOUAGA I , et al. A review of machine learning models applied to genomic prediction in animal breeding[J]. Front Genet, 2023, 14, 1150596. doi: 10.3389/fgene.2023.1150596
3	李棉燕, 王立贤, 赵福平. 机器学习在动物基因组选择中的研究进展[J]. 中国农业科学, 2023, 56 (18): 3682- 3692. doi: 10.3864/j.issn.0578-1752.2023.18.015
	LI M Y , WANG L X , ZHAO F P . Research progress on machine learning for genomic selection in animals[J]. Scientia Agricultura Sinica, 2023, 56 (18): 3682- 3692. doi: 10.3864/j.issn.0578-1752.2023.18.015
4	LIU R , XU Z , TENG J , et al. Evaluation of six machine learning classification algorithms in pig breed identification using SNPs array data[J]. Anim Genet, 2023, 54 (2): 113- 122. doi: 10.1111/age.13279
5	ZHAO C , WANG D , YANG C , et al. Population structure and breed identification of Chinese indigenous sheep breeds using whole genome SNPs and InDels[J]. Genet Sel Evol, 2024, 56 (1): 60. doi: 10.1186/s12711-024-00927-1
6	JIE W , LEI Q X , CAO D G , et al. Whole genome SNPs among 8 chicken breeds enable identification of genetic signatures that underlie breed features[J]. J Integr Agr, 2023, 22 (7): 2200- 2212. doi: 10.1016/j.jia.2022.11.007
7	LIAKOS K G , BUSATO P , MOSHOU D , et al. Machine learning in agriculture: A review[J]. Sensors, 2018, 18 (8): 2674. doi: 10.3390/s18082674
8	AYO F E , AWOTUNDE J B , FOLORUNSO S O , et al. A genomic rule-based KNN model for fast flux botnet detection[J]. Egypt Inform J, 2023, 24 (2): 313- 325. doi: 10.1016/j.eij.2023.05.002
9	YUAN Y , SHI C , ZHAO H . Machine learning-enabled genome mining and bioactivity prediction of natural products[J]. ACS Synth Biol, 2023, 12 (9): 2650- 2662. doi: 10.1021/acssynbio.3c00234
10	PEIGNIER S , SORIN B , CALEVRO F . Ensemble learning based gene regulatory network inference[J]. Int J Artif Intell T, 2023, 32 (5): 2360005. doi: 10.1142/S0218213023600059
11	GOUDET J , WEIR B S . An allele-sharing, moment-based estimator of global, population-specific and population-pair F ST under a general model of population structure[J]. PLoS Genet, 2023, 19 (11): e1010871. doi: 10.1371/journal.pgen.1010871
12	ZHOU H , XIN Y , LI S . A diabetes prediction model based on Boruta feature selection and ensemble learning[J]. BMC Bioinformatics, 2023, 24 (1): 224. doi: 10.1186/s12859-023-05300-5
13	梁卉, 王雪, 司敬方, 等. 利用基因组标记和机器学习算法对中国牛品种的分类准确性研究[J]. 遗传, 2024, 46 (7): 530- 539.
	LIANG H , WANG X , SI J F , et al. Classification accuracy of machine learning algorithms for Chinese local cattle breeds using genomic markers[J]. Hereditas (Beijing), 2024, 46 (7): 530- 539.
14	RAMÍREZ-GALLEGO S , LASTRA I , MARTÍNEZ-REGO D , et al. Fast-mRMR: Fast minimum redundancy maximum relevance algorithm for high-dimensional big data[J]. Int J Intell Syst, 2017, 32 (2): 134- 152. doi: 10.1002/int.21833
15	LI X , LI H , YANG Z , et al. Distribution rules of 8-mer spectra and characterization of evolution state in animal genome sequences[J]. BMC Genomics, 2024, 25 (1): 855. doi: 10.1186/s12864-024-10786-1
16	SCHIAVO G , BERTOLINI F , GALIMBERTI G , et al. A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds[J]. Animal, 2020, 14 (2): 223- 232. doi: 10.1017/S1751731119002167
17	DAN S , MANDAL S N , GHOSH P , et al. Principal component analysis in pig breeds identification[J]. Indian J Anim Sci, 2023, 93 (4): 401- 405.
18	PILES M , BERGSMA R , GIANOLA D , et al. Feature selection stability and accuracy of prediction models for genomic prediction of residual feed intake in pigs using machine learning[J]. Front Genet, 2021, 12, 611506. doi: 10.3389/fgene.2021.611506
19	王万年, 陈思佳, 郜金荣, 等. 基于多层感知机的绵羊限性性状基因组选择模拟研究[J]. 畜牧兽医学报, 2023, 54 (7): 2824- 2835. doi: 10.11843/j.issn.0366-6964.2023.07.015
	WANG W N , CHEN S J , GAO J R , et al. Simulation study on genomic selection of sex-limited traits using multilayer perceptron in sheep[J]. Acta Veterinaria et Zootechnica Sinica, 2023, 54 (7): 2824- 2835. doi: 10.11843/j.issn.0366-6964.2023.07.015
20	MORONI B , BRAMBILLA A , ROSSI L , et al. Hybridization between Alpine Ibex and Domestic Goat in the Alps: A Sporadic and Localized Phenomenon?[J]. Animals, 2022, 12 (6): 751. doi: 10.3390/ani12060751
21	WANG Z H , ZHU Q H , LI X , et al. iSheep: an integrated resource for sheep genome, variant and phenotype[J]. Front Genet, 2021, 12, 714852. doi: 10.3389/fgene.2021.714852
22	PURCELL S , NEALE B , TODD-BROWN K , et al. PLINK: a tool set for whole-genome association and population-based linkage analyses[J]. Am J Hum Genet, 2007, 81 (3): 559- 575. doi: 10.1086/519795
23	IHAKA R , GENTLEMAN R . R: a language for data analysis and graphics[J]. J Comput Graph Stat, 1996, 5 (3): 299- 314. doi: 10.1080/10618600.1996.10474713
24	ALEXANDER D H , NOVEMBRE J , LANGE K . Fast model-based estimation of ancestry in unrelated individuals[J]. Genome Res, 2009, 19 (9): 1655- 1664. doi: 10.1101/gr.094052.109
25	WICKHAM H . ggplot2[J]. Wiley Interdiscip Rev Comput Stat, 2011, 3 (2): 180- 185. doi: 10.1002/wics.147
26	KARATZOGLOU A , SMOLA A , HORNIK K , et al. kernlab-an S4 package for kernel methods in R[J]. J Stat Softw, 2004, 11, 1- 20.
27	HALDAR A , PAL P , GHOSH S , et al. Body weight prediction using recursive partitioning and regression trees (RPART) model in indian black Bengal goat breed: A machine learning approach[J]. Indian J Anim Res, 2023, 57 (9): 1251- 1257.
28	HENGL T , MENDES DE JESUS J , HEUVELINK G B , et al. SoilGrids250m: Global gridded soil information based on machine learning[J]. PLoS One, 2017, 12 (2): e0169748. doi: 10.1371/journal.pone.0169748
29	MEYER D , WIEN F T . Support vector machines[J]. R News, 2001, 1 (3): 23- 26.
30	RCOLORBREWER S , LIAW M A . Package 'randomforest '[J]. UC Berkeley: Berkeley, CA, USA, 2018,
31	ALFARO E , GAMEZ M , GARCIA N . Adabag: An R package for classification with boosting and bagging[J]. J Stat Softw, 2013, 54, 1- 35.
32	RIDGEWAY G . Generalized boosted models: A guide to the gbm package[J]. Update, 2007, 1 (1): 2007.
33	CHEN T , HE T , BENESTY M , et al. Package 'xgboost '[J]. R Version, 2019, 90 (1-66): 40.
34	ROBIN X , TURCK N , HAINARD A , et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves[J]. BMC Bioinformatics, 2011, 12, 77. doi: 10.1186/1471-2105-12-77
35	TAN K , WANG R , LI M , et al. Discriminating soybean seed varieties using hyperspectral imaging and machine learning[J]. J Comput Methods Sci, 2019, 19 (4): 1001- 1015.
36	MEADOWS J R S , HIENDLEDER S , KIJAS J W . Haplogroup relationships between domestic and wild sheep resolved using a mitogenome panel[J]. Heredity, 2011, 106 (4): 700- 706. doi: 10.1038/hdy.2010.122
37	BRAGA-NETO U M , ZOLLANVARI A , DOUGHERTY E R . Cross-validation under separate sampling: strong bias and how to correct it[J]. Bioinformatics, 2014, 30 (23): 3349- 3355. doi: 10.1093/bioinformatics/btu527
38	LIU R , XU Z , TENG J , et al. Evaluation of six machine learning classification algorithms in pig breed identification using SNPs array data[J]. Anim Genet, 2023, 54 (2): 113- 122. doi: 10.1111/age.13279
39	ZHANG Y , DING C , LI T . Gene selection algorithm by combining reliefF and mRMR[J]. BMC Genomics, 2008, 9 (Suppl 2): 527.
40	WANG X , REN J , REN H , et al. Diabetes mellitus early warning and factor analysis using ensemble Bayesian networks with SMOTE-ENN and Boruta[J]. Sci Rep, 2023, 13 (1): 12718. doi: 10.1038/s41598-023-40036-5
41	AL-MAMUN H A , DANILEVICZ M F , MARSH J I , et al. Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large-scale soybean dataset[J]. Plant Genome, 2025, 18 (1): e20503. doi: 10.1002/tpg2.20503
42	SARDER M A , MANIRUZZAMAN M , AHAMMED B . Feature selection and classification of leukemia cancer using machine learning techniques[J]. J Mach Learn Res, 2020, 5 (2): 18. doi: 10.11648/j.mlr.20200502.11
43	CHANDRA M A , BEDI S S . Survey on SVM and their application in image classification[J]. J Inf Technol, 2021, 13 (5): 1- 11.
44	BLANQUERO R , CARRIZOSA E , RAMÍREZ-COBO P , et al. Variable selection for Naïve Bayes classification[J]. Comput Oper Res, 2021, 135, 105456. doi: 10.1016/j.cor.2021.105456
45	IMRAN M , BHATTI A , KING D M , et al. Supervised Machine Learning-Based Decision Support for Signal Validation Classification[J]. Drug Saf, 2022, 45 (5): 583- 596. doi: 10.1007/s40264-022-01159-2
46	ZHANG S . Challenges in KNN classification[J]. IEEE Trans Knowl Data Eng, 2021, 34 (10): 4663- 4675.
47	XU Z , DIAO S , TENG J , et al. Breed identification of meat using machine learning and breed tag SNPs[J]. Food Control, 2021, 125 (1): 107971.
48	ZHAO C , WANG D , TENG J , et al. Breed identification using breed-informative SNPs and machine learning based on whole genome sequence data and SNP chip data[J]. J Anim Sci Biotechnol, 2023, 14 (1): 85. doi: 10.1186/s40104-023-00880-x
49	NEETHIRAJAN S . Affective state recognition in livestock-artificial intelligence approaches[J]. Animals, 2022, 12 (6): 759. doi: 10.3390/ani12060759

品种 Breed	个体数 Number of individuals
芬兰羊Finnish sheep(Finn)	54
冰岛羊Icelandic sheep(Icelandic)	54
罗曼诺夫羊Romanov sheep(Romanov)	79
特塞尔羊Texel sheep(Texel)	59
多浪羊Duolang sheep(DL)	119
湖羊Hu sheep(Hu)	112
大尾寒羊Large Tail Han sheep(LTH)	106
泗水裘皮羊Sishui Fur sheep(SSF)	58
小尾寒羊Small Tail Han sheep(STH)	102
洼地羊Wadi sheep(WD)	146

[1]	马应天, 姜璐瑶, 李增开, 秦剑平, 赵建华, 贺玉芳, 宋宇轩, 张磊. 矢车菊素-3-芸香糖苷对奶绵羊精液冷冻保存效果的影响[J]. 畜牧兽医学报, 2025, 56(4): 1768-1778.
[2]	李艳娥, 梁友萍, 樊洁, 吴芳燕, 尧香悦, 李毛却乎, 次仁仓决, 郝桂英, 古小彬. 绵羊痒螨钙网蛋白对兔外周血单个核细胞Th1/Th2和Th17/Treg免疫平衡的影响[J]. 畜牧兽医学报, 2025, 56(4): 1910-1918.
[3]	杨杨, 李良远, 万鹏程, 卢守亮, 刘长彬, 杨华, 王立民, 代蓉, 周平. 绵羊季节性发情性状核心基因和关键lncRNA的筛选与分析[J]. 畜牧兽医学报, 2025, 56(3): 1264-1277.
[4]	杨苗苗, 谢莉, 简宝怡, 罗超维, 谢卓君, 朱飘, 周天日, 李华, 向海. 利用机器学习构建和优化早期体尺性状对成年母鸡腹脂沉积的预测模型[J]. 畜牧兽医学报, 2025, 56(2): 548-558.
[5]	习海娇, 李金泉, 张燕军, 王瑞军, 吕琦, 梅步俊, 王娜, 苏蕊, 王志英. 显性效应对内蒙古绒山羊产绒量和绒纤维直径育种值估计准确性的影响[J]. 畜牧兽医学报, 2025, 56(2): 571-581.
[6]	何雨, 王翔宇, 狄冉, 储明星, 梁琛. BMP4/SMAD4通过下调GJA1基因表达影响绵羊卵巢颗粒间隙连接活性[J]. 畜牧兽医学报, 2025, 56(2): 679-688.
[7]	楚翼健, 崔久增, 李增开, 张磊, 褚婷婷, 黄艳平, 宋宇轩. 绵羊子宫内膜容受前期与容受期的阴道微生物比较研究[J]. 畜牧兽医学报, 2025, 56(2): 689-699.
[8]	王晓飞, 王勃森, 卫梦瑶, 姜璐瑶, 徐刚刚, 刘佳欣, 马应天, 王丽, 宋宇轩, 张磊. 羊奶改善糖尿病模型小鼠肝、肾病理变化的作用研究[J]. 畜牧兽医学报, 2025, 56(2): 870-882.
[9]	阳文攀, 刘相杰, 罗冬香, 陈梦会, 谢瑛, 方跃鑫, 林婷燕, 李爱民, 李文静, 邓政, 丁能水. 基于芯片数据的长白猪繁殖性状基因组选择研究[J]. 畜牧兽医学报, 2025, 56(1): 213-221.
[10]	范维, 刘昕昕, 翟艺禄, 张新玉, 王唯, 付佳棋, 孙福亮. 羊源肺炎克雷伯菌分离鉴定及其外膜囊泡提取方法的建立[J]. 畜牧兽医学报, 2025, 56(1): 353-364.
[11]	贾宇航, 郭良富, 张茹楠, 赵阿勇, 刘玉芳, 储明星. miR-127调控绵羊骨骼肌细胞增殖分化及其转录因子PAX3筛选[J]. 畜牧兽医学报, 2024, 55(9): 3864-3875.
[12]	龚一鸣, 贾一轩, 李佳骏, 王翔宇, 贺小云, 储明星, 狄冉. 不同FecB基因型和不同直径绵羊卵泡中BMP/SMAD通路活性及蛋白表达差异[J]. 畜牧兽医学报, 2024, 55(9): 3957-3967.
[13]	王进部, 李佳, 任德明, 王立贤, 王立刚. 机器学习在畜禽基因组选择中的应用进展[J]. 畜牧兽医学报, 2024, 55(7): 2775-2785.
[14]	王婷, 张元庆, 闫益波, 上官明军, 郭宏宇, 王志武. “特藏寒羊”群体遗传结构分析与选择信号的对比分析[J]. 畜牧兽医学报, 2024, 55(7): 2913-2926.
[15]	李竟, 张元旭, 王泽昭, 陈燕, 徐凌洋, 张路培, 高雪, 高会江, 李俊雅, 朱波, 郭鹏. 机器学习全基因组选择研究进展[J]. 畜牧兽医学报, 2024, 55(6): 2281-2292.