畜牧兽医学报 ›› 2020, Vol. 51 ›› Issue (9): 2068-2078.doi: 10.11843/j.issn.0366-6964.2020.09.004

• 遗传育种 • 上一篇    下一篇

基因型填充策略研究

邓天宇, 杜立新*, 王立贤, 赵福平*   

  1. 中国农业科学院北京畜牧兽医研究所, 农业部动物遗传育种与繁殖(家禽)重点实验室, 北京 100193
  • 收稿日期:2020-03-12 出版日期:2020-09-25 发布日期:2020-09-25
  • 通讯作者: 王立贤,主要从事猪遗传育种研究,E-mail:iaswlx@263.net;赵福平,主要从事统计基因组学研究,E-mail:zhaofuping@caas.cn
  • 作者简介:邓天宇(1993-),男,辽宁朝阳人,硕士生,主要从事动物遗传育种研究,E-mail:970997375@qq.com
  • 基金资助:
    国家自然科学基金(31572357);国家生猪产业技术体系(CARS-35)

Study on the Strategies of Genotype Imputation

DENG Tianyu, DU Lixin*, WANG Lixian, ZHAO Fuping*   

  1. Key Laboratory of Animal Genetics, Breeding and Reproduction(poultry) of Ministry of Agriculture, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
  • Received:2020-03-12 Online:2020-09-25 Published:2020-09-25

摘要: 基因组数据在畜禽遗传育种中的应用越来越广泛,基因型填充作为基因组数据处理的重要工具,填充结果的好坏直接影响后续分析,为了得到好的填充结果,需要制定完善的填充策略。本研究通过模拟数据探讨参考群体大小、目标群体与参考群体间遗传关系(距离)远近、目标位点数目(比例)、最小等位基因频率以及填充算法等因素对基因型填充效果的影响。结果表明,目标位点数目与填充效果呈显著的正相关(P<0.05),是影响基因型填充准确性的主要因素;参考群体大小是影响Beagle5.1填充错误率的主要因素,目标位点数目是影响Minimac4填充错误率的主要因素;目标群体和参考群体的遗传距离对Beagle5.1填充效果的影响较Minimac4更为显著;一般情况下,最小等位基因频率越高的位点填充错误率越高;在参考群体个体数量少且目标位点数目多的情况下,Minimac4的填充速度优于Beagle5.1,但随参考群体个体数目增加有逆趋势。在保证填充质量的前提下,Beagle5.1对本研究中几种因素的标准要求相对较低。相对地,当目标群体位点数目较低,参考群体个体数目较多时,Beagle5.1的填充效果更好,而Minimac4更适合参考群体个体数目较少,目标群体位点数目较高的填充中。本研究针对不同的填充目的制定了不同策略,为基因型填充标准提供了参考。

关键词: 基因型填充, 模拟数据, 参考群体大小, 填充算法, 错误率

Abstract: Genomic data is more and more widely used in livestock breeding. Genotype imputation is an important tool to handle missing values in genotypic data, and the quality of imputation results directly affects the subsequent analysis. To obtain good imputation results, a comprehensive imputation strategy needs to be formulated. We studied on the effects of several factors on genotype imputation by simulation. The factors included reference population size, genetic relationship (distance) between the target population and the reference population, the number of target sites (proportion), the minimum allele frequency (MAF), and the imputation algorithm. The results showed that the number of target sites was the main factor affecting the genotype imputation, and it showed significantly positive correlation with the quality of imputation(P<0.05). The reference population size was the main factor affecting the imputation error rate in Beagle5.1. Correspondingly, the number of target sites was the main factor affecting the imputation error rate in Minimac4. Genetic distance between the target population and the reference population had a more significant effect on the imputation quality of Beagle5.1 than Minimac4. In general, the imputation error rate increased as the increases of MAF in a site. When the number of individuals in the reference population was small and the number of target sites was large, the speed of Minimac4 was superior to Beagle5.1, but there was a reverse trend as the reference population size increased. On the premise of ensuring the imputation quality, Beagle5.1 had relatively lower requirements for the above factors. In contrast, when the number of target sites was low and reference population size was large, the imputation effect of Beagle5.1 was better, while Minimac4 was more suitable for the imputation of a small reference population size and a higher number of target sites. In this study, different strategies were formulated for different imputation purposes, and the study results would provide a reference for genotype imputation.

Key words: genotype imputation, simulation data, reference population size, imputation method, error rate

中图分类号: