畜牧兽医学报 ›› 2025, Vol. 56 ›› Issue (5): 2157-2167.doi: 10.11843/j.issn.0366-6964.2025.05.016

• 遗传育种 • 上一篇    下一篇

基于基因组标记对绵羊品种分类的机器学习方法研究

乔利英1,2(), 王万年1, 张莉1, 庞志旭1, 张思颖1, 李一凡1, 刘文忠1,2,*()   

  1. 1. 山西农业大学动物科学学院, 太谷 030801
    2. 畜禽遗传资源发掘与精准育种山西省重点实验室, 太谷 030801
  • 收稿日期:2024-10-24 出版日期:2025-05-23 发布日期:2025-05-27
  • 通讯作者: 刘文忠 E-mail:liyingqiao1970@163.com;tglwzyc@163.com
  • 作者简介:乔利英(1970-),女,山西定襄人,高级实验师,主要从事动物数量遗传学研究, E-mail: liyingqiao1970@163.com
  • 基金资助:
    山西省“雁云白羊”育种联合攻关(NYGG23);山西省现代农业产业技术体系建设专项资金(2024-14)

Machine Learning Methods for Sheep Breed Classification Based on Genomic Markers

QIAO Liying1,2(), WANG Wannian1, ZHANG Li1, PANG Zhixu1, ZHANG Siying1, LI Yifan1, LIU Wenzhong1,2,*()   

  1. 1. College of Animal Science, Shanxi Agricultural University, Taigu 030801, China
    2. Shanxi Key Laboratory of Animal Genetics Resource Utilization and Breeding, Taigu 030801, China
  • Received:2024-10-24 Online:2025-05-23 Published:2025-05-27
  • Contact: LIU Wenzhong E-mail:liyingqiao1970@163.com;tglwzyc@163.com

摘要:

旨在评估基于基因组标记的机器学习(machine learning, ML)算法在品种分类中的有效性,检验不同ML算法在绵羊品种分类中的应用效果如何。本研究采用2种方式对10个绵羊品种进行单核苷酸多态性(single nucleotide polymorphisms, SNPs)位点选择,第一种利用群体间分化指数(fixation index, FST)进行选择,第二种方式是在FST的基础上使用Boruta特征选择算法对SNPs位点进一步筛选。采用K-近邻、支持向量机(support vector machines, SVM)和自适应增强(adaptive boosting, AdaBoost)等8种不同类别的ML算法对绵羊品种进行分类,采用准确性评估不同SNPs选择方式和不同ML算法在品种鉴定中的差异,鉴定绵羊品种分类的最佳组合方式。本研究采用的数据中既有遗传关系较远的品种,也有遗传相似的品种,保证了后续分析的可靠性。根据前1%的筛选标准,FST分析每次筛选出5 361个SNPs位点,Boruta算法最终保留(328±11.7)个SNPs位点用于ML品种分类,且在多次迭代后,被标记为“确认”的SNPs位点得分稳定高于阴影特征和被标记的其他两类SNPs位点。Boruta算法保留的SNPs位点数远低于FST分析。在使用ML模型进行品种分类时,大多数模型的准确性均高于0.9。其中,经过Boruta算法选择SNPs位点之后使用SVM模型进行品种分类准确性最高(0.953),AdaBoost表现也同样优秀(0.947),仅使用FST选择SNPs位点之后使用NB模型分类效果最差(0.601)。除NB外,其余模型接收者操作特征曲线下面积均接近于1。无论使用哪种SNPs选择方式均具有较强的区分能力,使用Boruta算法后效果略好。根据上述结果表明,ML方法的实施有效提高了品种分类的准确性,在绵羊品种鉴定中有良好的应用潜力。

关键词: 机器学习, 品种分类, 基因组标记, 绵羊, 准确性

Abstract:

This study aimed to evaluate the effectiveness of machine learning (ML) algorithms based on genomic markers in breed classification and to examine the performance of different ML algorithms in the classification of sheep breeds. In this study, 2 methods were used to select single nucleotide polymorphisms (SNPs) sites for 10 sheep breeds. The first method used the fixation index (FST) for selection, and the second method used the Boruta feature selection algorithm based on FST to further screen the SNPs sites. The 8 different types of ML algorithms, including K-nearest neighbor, support vector machines (SVM) and adaptive boosting (AdaBoost), were used to classify sheep breeds. The accuracy was used to evaluate the differences between different SNPs selection methods and different ML algorithms in breed identification, and the best combination method for sheep breed classification was identified. The data used in this study included both genetically distantly related varieties and genetically similar varieties, ensuring the reliability of subsequent analysis. Based on the top 1% selection criteria, the FST analysis identified 5 361 SNPs loci in each run, while the Boruta algorithm ultimately retained (328±11.7) SNPs loci for ML-based breed classification. After multiple iterations, SNPs loci marked as "confirmed" by the Boruta algorithm consistently scored higher than shadow features and the other two categories of SNPs loci. The number of SNPs loci retained by the Boruta algorithm was significantly lower than that of the FST analysis. When ML models were used for breed classification, most models achieved an accuracy above 0.9. The SVM model, using SNPs loci selected by the Boruta algorithm, achieved the highest classification accuracy (0.953), followed closely by AdaBoost (0.947). In contrast, the NB model, using SNPs loci selected solely by the FST analysis, showed the lowest classification accuracy (0.601). Except for NB, the area under the receiver operating characteristic curve (AUC) for the other models was close to 1. Regardless of the SNPs selection method, both approaches demonstrated strong discriminatory power, with slightly better performance observed after applying the Boruta algorithm. The results indicate that the implementation of ML methods effectively improves the accuracy of breed classification and demonstrates strong potential for application in sheep breed identification.

Key words: machine learning, breed classification, genomic markers, sheep, accuracy

中图分类号: