Acta Veterinaria et Zootechnica Sinica ›› 2025, Vol. 56 ›› Issue (5): 2157-2167.doi: 10.11843/j.issn.0366-6964.2025.05.016

• Animal Genetics and Breeding • Previous Articles     Next Articles

Machine Learning Methods for Sheep Breed Classification Based on Genomic Markers

QIAO Liying1,2(), WANG Wannian1, ZHANG Li1, PANG Zhixu1, ZHANG Siying1, LI Yifan1, LIU Wenzhong1,2,*()   

  1. 1. College of Animal Science, Shanxi Agricultural University, Taigu 030801, China
    2. Shanxi Key Laboratory of Animal Genetics Resource Utilization and Breeding, Taigu 030801, China
  • Received:2024-10-24 Online:2025-05-23 Published:2025-05-27
  • Contact: LIU Wenzhong E-mail:liyingqiao1970@163.com;tglwzyc@163.com

Abstract:

This study aimed to evaluate the effectiveness of machine learning (ML) algorithms based on genomic markers in breed classification and to examine the performance of different ML algorithms in the classification of sheep breeds. In this study, 2 methods were used to select single nucleotide polymorphisms (SNPs) sites for 10 sheep breeds. The first method used the fixation index (FST) for selection, and the second method used the Boruta feature selection algorithm based on FST to further screen the SNPs sites. The 8 different types of ML algorithms, including K-nearest neighbor, support vector machines (SVM) and adaptive boosting (AdaBoost), were used to classify sheep breeds. The accuracy was used to evaluate the differences between different SNPs selection methods and different ML algorithms in breed identification, and the best combination method for sheep breed classification was identified. The data used in this study included both genetically distantly related varieties and genetically similar varieties, ensuring the reliability of subsequent analysis. Based on the top 1% selection criteria, the FST analysis identified 5 361 SNPs loci in each run, while the Boruta algorithm ultimately retained (328±11.7) SNPs loci for ML-based breed classification. After multiple iterations, SNPs loci marked as "confirmed" by the Boruta algorithm consistently scored higher than shadow features and the other two categories of SNPs loci. The number of SNPs loci retained by the Boruta algorithm was significantly lower than that of the FST analysis. When ML models were used for breed classification, most models achieved an accuracy above 0.9. The SVM model, using SNPs loci selected by the Boruta algorithm, achieved the highest classification accuracy (0.953), followed closely by AdaBoost (0.947). In contrast, the NB model, using SNPs loci selected solely by the FST analysis, showed the lowest classification accuracy (0.601). Except for NB, the area under the receiver operating characteristic curve (AUC) for the other models was close to 1. Regardless of the SNPs selection method, both approaches demonstrated strong discriminatory power, with slightly better performance observed after applying the Boruta algorithm. The results indicate that the implementation of ML methods effectively improves the accuracy of breed classification and demonstrates strong potential for application in sheep breed identification.

Key words: machine learning, breed classification, genomic markers, sheep, accuracy

CLC Number: