基于基因组标记对绵羊品种分类的机器学习方法研究

doi:10.11843/j.issn.0366-6964.2025.05.016

Abstract

Abstract:

This study aimed to evaluate the effectiveness of machine learning (ML) algorithms based on genomic markers in breed classification and to examine the performance of different ML algorithms in the classification of sheep breeds. In this study, 2 methods were used to select single nucleotide polymorphisms (SNPs) sites for 10 sheep breeds. The first method used the fixation index (F_ST) for selection, and the second method used the Boruta feature selection algorithm based on F_ST to further screen the SNPs sites. The 8 different types of ML algorithms, including K-nearest neighbor, support vector machines (SVM) and adaptive boosting (AdaBoost), were used to classify sheep breeds. The accuracy was used to evaluate the differences between different SNPs selection methods and different ML algorithms in breed identification, and the best combination method for sheep breed classification was identified. The data used in this study included both genetically distantly related varieties and genetically similar varieties, ensuring the reliability of subsequent analysis. Based on the top 1% selection criteria, the F_ST analysis identified 5 361 SNPs loci in each run, while the Boruta algorithm ultimately retained (328±11.7) SNPs loci for ML-based breed classification. After multiple iterations, SNPs loci marked as "confirmed" by the Boruta algorithm consistently scored higher than shadow features and the other two categories of SNPs loci. The number of SNPs loci retained by the Boruta algorithm was significantly lower than that of the F_ST analysis. When ML models were used for breed classification, most models achieved an accuracy above 0.9. The SVM model, using SNPs loci selected by the Boruta algorithm, achieved the highest classification accuracy (0.953), followed closely by AdaBoost (0.947). In contrast, the NB model, using SNPs loci selected solely by the F_ST analysis, showed the lowest classification accuracy (0.601). Except for NB, the area under the receiver operating characteristic curve (AUC) for the other models was close to 1. Regardless of the SNPs selection method, both approaches demonstrated strong discriminatory power, with slightly better performance observed after applying the Boruta algorithm. The results indicate that the implementation of ML methods effectively improves the accuracy of breed classification and demonstrates strong potential for application in sheep breed identification.

Key words: machine learning, breed classification, genomic markers, sheep, accuracy

CLC Number:

S826.2

QIAO Liying, WANG Wannian, ZHANG Li, PANG Zhixu, ZHANG Siying, LI Yifan, LIU Wenzhong. Machine Learning Methods for Sheep Breed Classification Based on Genomic Markers[J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(5): 2157-2167.

Figures/Tables 5

Table 1

Fig. 1

Fig. 2

Fig. 3

Fig. 4

References 49

1	GREENER J G , KANDATHIL S M , MOFFAT L , et al. A guide to machine learning for biologists[J]. Nat Rev Mol Cell Biol, 2022, 23 (1): 40- 55. doi: 10.1038/s41580-021-00407-0
2	CHAFAI N , HAYAH I , HOUAGA I , et al. A review of machine learning models applied to genomic prediction in animal breeding[J]. Front Genet, 2023, 14, 1150596. doi: 10.3389/fgene.2023.1150596
3	李棉燕, 王立贤, 赵福平. 机器学习在动物基因组选择中的研究进展[J]. 中国农业科学, 2023, 56 (18): 3682- 3692. doi: 10.3864/j.issn.0578-1752.2023.18.015
	LI M Y , WANG L X , ZHAO F P . Research progress on machine learning for genomic selection in animals[J]. Scientia Agricultura Sinica, 2023, 56 (18): 3682- 3692. doi: 10.3864/j.issn.0578-1752.2023.18.015
4	LIU R , XU Z , TENG J , et al. Evaluation of six machine learning classification algorithms in pig breed identification using SNPs array data[J]. Anim Genet, 2023, 54 (2): 113- 122. doi: 10.1111/age.13279
5	ZHAO C , WANG D , YANG C , et al. Population structure and breed identification of Chinese indigenous sheep breeds using whole genome SNPs and InDels[J]. Genet Sel Evol, 2024, 56 (1): 60. doi: 10.1186/s12711-024-00927-1
6	JIE W , LEI Q X , CAO D G , et al. Whole genome SNPs among 8 chicken breeds enable identification of genetic signatures that underlie breed features[J]. J Integr Agr, 2023, 22 (7): 2200- 2212. doi: 10.1016/j.jia.2022.11.007
7	LIAKOS K G , BUSATO P , MOSHOU D , et al. Machine learning in agriculture: A review[J]. Sensors, 2018, 18 (8): 2674. doi: 10.3390/s18082674
8	AYO F E , AWOTUNDE J B , FOLORUNSO S O , et al. A genomic rule-based KNN model for fast flux botnet detection[J]. Egypt Inform J, 2023, 24 (2): 313- 325. doi: 10.1016/j.eij.2023.05.002
9	YUAN Y , SHI C , ZHAO H . Machine learning-enabled genome mining and bioactivity prediction of natural products[J]. ACS Synth Biol, 2023, 12 (9): 2650- 2662. doi: 10.1021/acssynbio.3c00234
10	PEIGNIER S , SORIN B , CALEVRO F . Ensemble learning based gene regulatory network inference[J]. Int J Artif Intell T, 2023, 32 (5): 2360005. doi: 10.1142/S0218213023600059
11	GOUDET J , WEIR B S . An allele-sharing, moment-based estimator of global, population-specific and population-pair F ST under a general model of population structure[J]. PLoS Genet, 2023, 19 (11): e1010871. doi: 10.1371/journal.pgen.1010871
12	ZHOU H , XIN Y , LI S . A diabetes prediction model based on Boruta feature selection and ensemble learning[J]. BMC Bioinformatics, 2023, 24 (1): 224. doi: 10.1186/s12859-023-05300-5
13	梁卉, 王雪, 司敬方, 等. 利用基因组标记和机器学习算法对中国牛品种的分类准确性研究[J]. 遗传, 2024, 46 (7): 530- 539.
	LIANG H , WANG X , SI J F , et al. Classification accuracy of machine learning algorithms for Chinese local cattle breeds using genomic markers[J]. Hereditas (Beijing), 2024, 46 (7): 530- 539.
14	RAMÍREZ-GALLEGO S , LASTRA I , MARTÍNEZ-REGO D , et al. Fast-mRMR: Fast minimum redundancy maximum relevance algorithm for high-dimensional big data[J]. Int J Intell Syst, 2017, 32 (2): 134- 152. doi: 10.1002/int.21833
15	LI X , LI H , YANG Z , et al. Distribution rules of 8-mer spectra and characterization of evolution state in animal genome sequences[J]. BMC Genomics, 2024, 25 (1): 855. doi: 10.1186/s12864-024-10786-1
16	SCHIAVO G , BERTOLINI F , GALIMBERTI G , et al. A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds[J]. Animal, 2020, 14 (2): 223- 232. doi: 10.1017/S1751731119002167
17	DAN S , MANDAL S N , GHOSH P , et al. Principal component analysis in pig breeds identification[J]. Indian J Anim Sci, 2023, 93 (4): 401- 405.
18	PILES M , BERGSMA R , GIANOLA D , et al. Feature selection stability and accuracy of prediction models for genomic prediction of residual feed intake in pigs using machine learning[J]. Front Genet, 2021, 12, 611506. doi: 10.3389/fgene.2021.611506
19	王万年, 陈思佳, 郜金荣, 等. 基于多层感知机的绵羊限性性状基因组选择模拟研究[J]. 畜牧兽医学报, 2023, 54 (7): 2824- 2835. doi: 10.11843/j.issn.0366-6964.2023.07.015
	WANG W N , CHEN S J , GAO J R , et al. Simulation study on genomic selection of sex-limited traits using multilayer perceptron in sheep[J]. Acta Veterinaria et Zootechnica Sinica, 2023, 54 (7): 2824- 2835. doi: 10.11843/j.issn.0366-6964.2023.07.015
20	MORONI B , BRAMBILLA A , ROSSI L , et al. Hybridization between Alpine Ibex and Domestic Goat in the Alps: A Sporadic and Localized Phenomenon?[J]. Animals, 2022, 12 (6): 751. doi: 10.3390/ani12060751
21	WANG Z H , ZHU Q H , LI X , et al. iSheep: an integrated resource for sheep genome, variant and phenotype[J]. Front Genet, 2021, 12, 714852. doi: 10.3389/fgene.2021.714852
22	PURCELL S , NEALE B , TODD-BROWN K , et al. PLINK: a tool set for whole-genome association and population-based linkage analyses[J]. Am J Hum Genet, 2007, 81 (3): 559- 575. doi: 10.1086/519795
23	IHAKA R , GENTLEMAN R . R: a language for data analysis and graphics[J]. J Comput Graph Stat, 1996, 5 (3): 299- 314. doi: 10.1080/10618600.1996.10474713
24	ALEXANDER D H , NOVEMBRE J , LANGE K . Fast model-based estimation of ancestry in unrelated individuals[J]. Genome Res, 2009, 19 (9): 1655- 1664. doi: 10.1101/gr.094052.109
25	WICKHAM H . ggplot2[J]. Wiley Interdiscip Rev Comput Stat, 2011, 3 (2): 180- 185. doi: 10.1002/wics.147
26	KARATZOGLOU A , SMOLA A , HORNIK K , et al. kernlab-an S4 package for kernel methods in R[J]. J Stat Softw, 2004, 11, 1- 20.
27	HALDAR A , PAL P , GHOSH S , et al. Body weight prediction using recursive partitioning and regression trees (RPART) model in indian black Bengal goat breed: A machine learning approach[J]. Indian J Anim Res, 2023, 57 (9): 1251- 1257.
28	HENGL T , MENDES DE JESUS J , HEUVELINK G B , et al. SoilGrids250m: Global gridded soil information based on machine learning[J]. PLoS One, 2017, 12 (2): e0169748. doi: 10.1371/journal.pone.0169748
29	MEYER D , WIEN F T . Support vector machines[J]. R News, 2001, 1 (3): 23- 26.
30	RCOLORBREWER S , LIAW M A . Package 'randomforest '[J]. UC Berkeley: Berkeley, CA, USA, 2018,
31	ALFARO E , GAMEZ M , GARCIA N . Adabag: An R package for classification with boosting and bagging[J]. J Stat Softw, 2013, 54, 1- 35.
32	RIDGEWAY G . Generalized boosted models: A guide to the gbm package[J]. Update, 2007, 1 (1): 2007.
33	CHEN T , HE T , BENESTY M , et al. Package 'xgboost '[J]. R Version, 2019, 90 (1-66): 40.
34	ROBIN X , TURCK N , HAINARD A , et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves[J]. BMC Bioinformatics, 2011, 12, 77. doi: 10.1186/1471-2105-12-77
35	TAN K , WANG R , LI M , et al. Discriminating soybean seed varieties using hyperspectral imaging and machine learning[J]. J Comput Methods Sci, 2019, 19 (4): 1001- 1015.
36	MEADOWS J R S , HIENDLEDER S , KIJAS J W . Haplogroup relationships between domestic and wild sheep resolved using a mitogenome panel[J]. Heredity, 2011, 106 (4): 700- 706. doi: 10.1038/hdy.2010.122
37	BRAGA-NETO U M , ZOLLANVARI A , DOUGHERTY E R . Cross-validation under separate sampling: strong bias and how to correct it[J]. Bioinformatics, 2014, 30 (23): 3349- 3355. doi: 10.1093/bioinformatics/btu527
38	LIU R , XU Z , TENG J , et al. Evaluation of six machine learning classification algorithms in pig breed identification using SNPs array data[J]. Anim Genet, 2023, 54 (2): 113- 122. doi: 10.1111/age.13279
39	ZHANG Y , DING C , LI T . Gene selection algorithm by combining reliefF and mRMR[J]. BMC Genomics, 2008, 9 (Suppl 2): 527.
40	WANG X , REN J , REN H , et al. Diabetes mellitus early warning and factor analysis using ensemble Bayesian networks with SMOTE-ENN and Boruta[J]. Sci Rep, 2023, 13 (1): 12718. doi: 10.1038/s41598-023-40036-5
41	AL-MAMUN H A , DANILEVICZ M F , MARSH J I , et al. Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large-scale soybean dataset[J]. Plant Genome, 2025, 18 (1): e20503. doi: 10.1002/tpg2.20503
42	SARDER M A , MANIRUZZAMAN M , AHAMMED B . Feature selection and classification of leukemia cancer using machine learning techniques[J]. J Mach Learn Res, 2020, 5 (2): 18. doi: 10.11648/j.mlr.20200502.11
43	CHANDRA M A , BEDI S S . Survey on SVM and their application in image classification[J]. J Inf Technol, 2021, 13 (5): 1- 11.
44	BLANQUERO R , CARRIZOSA E , RAMÍREZ-COBO P , et al. Variable selection for Naïve Bayes classification[J]. Comput Oper Res, 2021, 135, 105456. doi: 10.1016/j.cor.2021.105456
45	IMRAN M , BHATTI A , KING D M , et al. Supervised Machine Learning-Based Decision Support for Signal Validation Classification[J]. Drug Saf, 2022, 45 (5): 583- 596. doi: 10.1007/s40264-022-01159-2
46	ZHANG S . Challenges in KNN classification[J]. IEEE Trans Knowl Data Eng, 2021, 34 (10): 4663- 4675.
47	XU Z , DIAO S , TENG J , et al. Breed identification of meat using machine learning and breed tag SNPs[J]. Food Control, 2021, 125 (1): 107971.
48	ZHAO C , WANG D , TENG J , et al. Breed identification using breed-informative SNPs and machine learning based on whole genome sequence data and SNP chip data[J]. J Anim Sci Biotechnol, 2023, 14 (1): 85. doi: 10.1186/s40104-023-00880-x
49	NEETHIRAJAN S . Affective state recognition in livestock-artificial intelligence approaches[J]. Animals, 2022, 12 (6): 759. doi: 10.3390/ani12060759

品种 Breed	个体数 Number of individuals
芬兰羊Finnish sheep(Finn)	54
冰岛羊Icelandic sheep(Icelandic)	54
罗曼诺夫羊Romanov sheep(Romanov)	79
特塞尔羊Texel sheep(Texel)	59
多浪羊Duolang sheep(DL)	119
湖羊Hu sheep(Hu)	112
大尾寒羊Large Tail Han sheep(LTH)	106
泗水裘皮羊Sishui Fur sheep(SSF)	58
小尾寒羊Small Tail Han sheep(STH)	102
洼地羊Wadi sheep(WD)	146

[1]	SUN Guoxin, LI Yunhua, SAI Yin, GUO Wenhua, ZHAO Yanhong, ZHANG Manxin, LIU Jiasen. Population Structure Analysis and Economic Traits Related Selection Signal Detection of Hu Sheep [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(5): 2168-2181.
[2]	LI Xiaowei, TIAN Wei, LIU Yuan, LI Huixia. Study on the Difference of m⁶A Methylation Modification in Ovarian Granulosa Cells of Hu Sheep under Heat Stress [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(4): 1712-1721.
[3]	MA Yingtian, JIANG Luyao, LI Zengkai, QIN Jianping, ZHAO Jianhua, HE Yufang, SONG Yuxuan, ZHANG Lei. Effect of Cyanidin-3-rutinoside on Cryopreservation of Semen of Dairy Sheep [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(4): 1768-1778.
[4]	YANG Yang, LI Liangyuan, WAN Pengcheng, LU Shouliang, LIU Changbin, YANG Hua, WANG Limin, DAI Rong, ZHOU Ping. Screening and Analysis of Core Genes and Key lncRNAs for Seasonal Estrus Traits in Sheep [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(3): 1264-1277.
[5]	YANG Miaomiao, XIE Li, JIAN Baoyi, LUO Chaowei, XIE Zhuojun, ZHU Piao, ZHOU Tianri, LI Hua, XIANG Hai. Construction and Optimization of Prediction Models for Abdominal Fat Deposition in Adult Hens based on Early Body Size Traits using Machine Learning [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(2): 548-558.
[6]	XI Haijiao, LI Jinquan, ZHANG Yanjun, WANG Ruijun, LÜ Qi, MEI Bujun, WANG Na, SU Rui, WANG Zhiying. Influence of Dominance Effects on the Accuracy of Breeding Value Estimation of Cashmere Production and Cashmere Diameter in Inner Mongolia Cashmere Goats [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(2): 571-581.
[7]	HE Yu, WANG Xiangyu, DI Ran, CHU Mingxing, LIANG Chen. BMP4/SMAD4 Downregulates GJA1 Gene Expression to Affect the Gap Junctional Intercellular Communication Activity in Sheep Ovarian Granulosa Cells [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(2): 679-688.
[8]	CHU Yijian, CUI Jiuzeng, LI Zengkai, ZHANG Lei, CHU Tingting, HUANG Yanping, SONG Yuxuan. Comparative Study on Vaginal Microorganisms in Pre-endometrial Receptivity and Endometrial Receptivity of Sheep [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(2): 689-699.
[9]	WANG Xiaofei, WANG Bosen, WEI Mengyao, JIANG Luyao, XU Ganggang, LIU Jiaxin, MA Yingtian, WANG Li, SONG Yuxuan, ZHANG Lei. Study on the Role of Ewe's Milk in Ameliorating Pathological Changes in the Liver and Kidney of Mice in a Diabetes Model [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(2): 870-882.
[10]	YANG Wenpan, LIU Xiangjie, LUO Dongxiang, CHEN Menghui, XIE Ying, FANG Yuexin, LIN Tingyan, LI Aimin, LI Wenjing, DENG Zheng, DING Nengshui. Research on Genomic Selection of Reproductive Traits in Landrace Pigs Based on Chip Data [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(1): 213-221.
[11]	LI Wei, WU Xilong, ZHAO Xingrui, XU Lanjiao, YANG Xiaobin, SONG Xiaozhen. Effects of Chinese Medicine Jianpisiwei Formulas on Growth Performance, Rumen Fermentation and Microbiota Composition of Weaned Hu Sheep [J]. Acta Veterinaria et Zootechnica Sinica, 2025, 56(1): 466-478.
[12]	Yuhang JIA, Liangfu GUO, Runan ZHANG, Ayong ZHAO, Yufang LIU, Mingxing CHU. miR-127 Regulated the Proliferation and Differentiation of Sheep Skeletal Myoblasts and Its Transcription Factor PAX3 Screening [J]. Acta Veterinaria et Zootechnica Sinica, 2024, 55(9): 3864-3875.
[13]	Yiming GONG, Yixuan JIA, Jiajun LI, Xiangyu WANG, Xiaoyun HE, Mingxing CHU, Ran DI. BMP/SMAD Pathway Activity and Protein Expression Profiles in Ovarian Follicles with Different Diameters in Diverse FecB Genotyped Ewes [J]. Acta Veterinaria et Zootechnica Sinica, 2024, 55(9): 3957-3967.
[14]	Peng SHEN, Yi WANG, Weijie REN, Yongchun YANG, Houhui SONG, Zhiliang WANG. Meta Analysis of Immune Antibody Monitoring for Lumpy Skin Disease [J]. Acta Veterinaria et Zootechnica Sinica, 2024, 55(8): 3649-3658.
[15]	Jinbu WANG, Jia LI, Deming REN, Lixian WANG, Ligang WANG. Progress in the Application of Machine Learning in Livestock and Poultry Genomic Selection [J]. Acta Veterinaria et Zootechnica Sinica, 2024, 55(7): 2775-2785.

Machine Learning Methods for Sheep Breed Classification Based on Genomic Markers

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 5

References 49

Related Articles 15

Recommended Articles

Metrics

Comments