畜牧兽医学报 ›› 2025, Vol. 56 ›› Issue (9): 4410-4421.doi: 10.11843/j.issn.0366-6964.2025.09.023

• 遗传育种 • 上一篇    下一篇

基于自编码器整合转录组数据提升基因组预测的准确性

骞里1, 梁忙1, 邓天宇1,2, 杜丽丽1, 李柯安宁1, 邱诗元1, 薛青青1,3, 张路培1, 高雪1, 徐凌洋1, 郑彩宏1, 李俊雅1, 高会江1*   

  1. 1. 中国农业科学院北京畜牧兽医研究所, 北京 100193;
    2. 西北农林科技大学动物科技学院, 杨凌 712100;
    3. 黑龙江八一农垦大学, 大庆 163319
  • 收稿日期:2025-02-26 发布日期:2025-09-30
  • 通讯作者: 高会江,主要从事肉牛遗传育种的研究,Tel:010-62516065,E-mail:gaohuijiang@caas.cn
  • 作者简介:骞里(2000-),男,陕西西安人,硕士生,主要从事深度学习基因组选择方法研究,E-mail:pbli0201@163.com
  • 基金资助:
    国家重点研发项目(2024YFF1000102-5);国家自然科学基金(32172693)

Improving Genomic Prediction Accuracy via Auto-encoder-based Compression of Transcriptome Data

QIAN Li1, LIANG Mang1, DENG Tianyu1,2, DU Lili1, LI Keanning1, QIU Shiyuan1, XUE Qingqing1,3, ZHANG Lupei1, GAO Xue1, XU Lingyang1, ZHENG Caihong1, LI Junya1, GAO Huijiang1*   

  1. 1. Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China;
    2. College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China;
    3. Heilongjiang Bayi Agricultural University, Daqing 163319, China
  • Received:2025-02-26 Published:2025-09-30

摘要: 为进一步探索传统线性回归模型难以捕捉基因型与表型之间复杂关系的不足,本研究旨在利用机器学习整合组学数据提升基因组预测的准确性。本研究基于具有基因型与转录组数据的数据集:1)华西牛数据集涉及宰前活重、胴体重和净肉重3个主要经济性状;2)水稻数据集包含单株产量、每穗粒数和千粒重3个农艺性状。采用五折交叉验证,以皮尔逊相关系数评估育种值估计的准确性。首先比较了基于单一组学数据作为输入时的预测表现,随后基于自编码器构建隐含矩阵作为关系矩阵用于模型建模。结果表明,使用转录组数据替代基因组数据作为输入可以提升模型的预测能力。在水稻和华西牛数据集分别提高了44.2%和27.4%,进一步地,将隐含矩阵用于建模后,模型预测准确性相较基因组关系矩阵在水稻和华西牛中分别提升了4.10%和6.81%。相关性分析表明,隐含矩阵与原始组学数据之间存在较强的非线性关系。将转录组作为模型输入,结合自编码器构建的关系矩阵,可有效提升选种选育的准确性,为育种工作的持续改进提供参考依据。

关键词: 多组学数据, 特征提取, 机器学习, 基因组预测

Abstract: The study aimed to further address the limitations of traditional linear regression models in capturing the complex relationships between genotype and phenotype, and improve the accuracy of genomic prediction by integrating omics data using machine learning. This study was based on two datasets containing both genotype and transcriptome information: 1) The Huaxi cattle dataset involved 3 economically important traits: live weight, carcass weight, and net meat weight; 2) The rice dataset included 3 agronomic traits: yield, grain, and kilo-grain weight (KGW). Five-fold cross-validation was employed, and Pearson correlation coefficients were used to evaluate the accuracy of estimated breeding values. We first compared the prediction performance using single-omics data as input, and then applied an autoencoder to perform dimensionality reduction and construct latent matrices as new relationship matrix for model training. The results showed that using transcriptomic data instead of genomic data as model input improved prediction performance, with accuracy increases of 44.2% and 27.4% in the rice and Huaxi cattle datasets, respectively. Furthermore, incorporating latent matrices extracted via autoencoders further enhanced prediction accuracy by 4.10% in rice and 6.81% in Huaxi cattle compared to traditional genomic relationship matrix. Correlation analysis revealed that the latent matrix exhibited strong nonlinear relationships with the original omics data. Using transcriptomic data as model input and incorporating relationship matrices constructed via autoencoders can improve the accuracy of selection, provide valuable insights for sustained genetic improvement in breeding programs.

Key words: multi-omics data, features prescreening, machine learning, genomic prediction

中图分类号: