CN117802091A

Movatterモバイル変換

Info

Publication number: CN117802091A
Application number: CN202311854898.4A
Authority: CN
Inventors: 王健; 孟和; 赵洪昌; 周浩; 孙国波; 董飚; 朱文奇; 穆晓恵; 李晓鸣; 王军; 赵孟丽; 杨文豪; 张干生; 纪荣超
Original assignee: Taizhou Fengda Agriculture And Animal Husbandry Technology Co ltd; Jiangsu Agri Animal Husbandry Vocational College
Current assignee: Taizhou Fengda Agriculture And Animal Husbandry Technology Co ltd; Jiangsu Agri Animal Husbandry Vocational College
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-02

Abstract

Translated fromChinese

本发明涉及生物科学领域，具体来说，涉及一种使用多种测序技术手段组装高质量鹅基因组序列的方法。该方法采用太平洋生物科学（PacBio）HiFi读取、Ont纳米孔超长读取、Illumina短读取以及染色质构想捕获（Hi‑C）多种测序技术手段，成功组装了高质量的鹅基因组序列。组装完成的鹅T2T染色体水平基因组对未来鹅的遗传改良和遗传机制解析奠定了重要研究基础。

The present invention relates to the field of biological sciences, and more specifically, to a method for assembling a high-quality goose genome sequence using a variety of sequencing technologies. The method uses a variety of sequencing technologies, including Pacific Biosciences (PacBio) HiFi reads, Ont nanopore ultra-long reads, Illumina short reads, and chromatin concept capture (Hi-C), to successfully assemble a high-quality goose genome sequence. The assembled goose T2T chromosome-level genome has laid an important research foundation for the genetic improvement of geese and the analysis of genetic mechanisms in the future.

Description

Translated fromChinese

一种地方鹅T2T基因组组装方法A method for assembling the local goose T2T genome

技术领域Technical field

本发明涉及分子标记技术领域，具体涉及地方鹅T2T基因组组装方法。The present invention relates to the technical field of molecular markers, and specifically to a local goose T2T genome assembly method.

背景技术Background technique

家鹅（Anser cygnoides domesticus）是一种重要的农业家禽，其肉用、蛋用和观赏等多种用途使其成为广泛饲养的物种。约在6000多年前，鹅与鸡、鸭一同被驯化，成为最早被人类驯养的家禽之一。鹅具有快速生长、强大的抗病能力和高度发达的肝脂储存特性，且适应于粗饲料的饲养环境。相较于其他陆生家禽（如鸡），鹅具备独特的生物学特征，例如，对某些禽类病毒的低感受性，尽管可能作为病毒携带者存在，但很少表现出感染症状，从而成为禽类病毒的天然储存库。此外，鹅肝脏的高度脂肪积累能力以及不易发生肝纤维或坏死的特点，提示其具有独特的脂质储存和代谢特性，对于人类脂质代谢紊乱的研究提供重要参考。随着基因组学研究的发展，对家鹅基因组的组装和解析成为深入了解其遗传特性和生物学功能的重要途径。在过去的几年中，鹅基因组测序的进展使得我们能够更全面地探索其基因组结构和功能。Domestic geese (Anser cygnoides domesticus) are an important agricultural poultry, and their multiple uses such as meat, eggs and ornamental use make them a widely raised species. About 6,000 years ago, geese were domesticated together with chickens and ducks, becoming one of the earliest poultry domesticated by humans. Geese have rapid growth, strong disease resistance and highly developed liver lipid storage characteristics, and are adapted to the feeding environment of roughage. Compared with other terrestrial poultry (such as chickens), geese have unique biological characteristics, such as low susceptibility to certain avian viruses. Although they may exist as virus carriers, they rarely show symptoms of infection, thus becoming avian viruses. Natural reservoir of viruses. In addition, the high fat accumulation capacity of goose liver and the fact that it is not prone to liver fibrosis or necrosis suggest that it has unique lipid storage and metabolic characteristics, providing an important reference for the study of human lipid metabolism disorders. With the development of genomics research, the assembly and analysis of the domestic goose genome has become an important way to deeply understand its genetic characteristics and biological functions. Over the past few years, advances in goose genome sequencing have allowed us to more fully explore its genome structure and function.

Lu等人在2015年首次对鹅的基因组进行了测序和分析（Lu et al. 2015），使用第二代测序数据，并借助SOAPdenovo 软件（Li R et al,. 2010）进行组装，获得了1.12 Gb的鹅基因组草图。该基因组草图包含了1,049条Scaffolds序列，其中Scaffold N50达到了5.2 Mb。随后，Gao 等人于2016年公布了一只雌性四川白鹅的基因组序列图谱。通过对家鹅的祖先鸿雁（Anser cygnoides）进行基因重测序，发现两者在3.4-6.3百万年前分化出来(Gao et al,. 2016)。另外，在2020年，鹅的染色体水平基因组也被发布。研究者发表了一个1.11Gb大小的天府鹅基因组，其Contig N50和Scaffold N50值分别为1.85Mb和33.12Mb。该基因组组装包含39条伪染色体（2n=78），占鹅全基因组大小的约88.36%（Li et al.,2020）。近两年来，还陆续发布了兴国灰鹅（Ouyang et al., 2022）、狮头鹅（Zhao et al.,2023）等高质量的染色体水平参考基因组，为促进鹅的育种和生物学研究提供了宝贵的遗传资源和数据基础。Lu et al. sequenced and analyzed the goose genome for the first time in 2015 (Lu et al. 2015). Using second-generation sequencing data and assembly with the help of SOAPdenovo software (Li R et al. 2010), 1.12 was obtained. Gb's draft goose genome. The draft genome contains 1,049 Scaffolds sequences, of which the Scaffold N50 reaches 5.2 Mb. Subsequently, Gao et al. published the genome sequence map of a female Sichuan white goose in 2016. By resequencing the genes of Anser cygnoides, the ancestor of domestic geese, it was found that the two diverged 3.4-6.3 million years ago (Gao et al,. 2016). In addition, in 2020, the chromosome-level genome of the goose was also released. The researchers published a 1.11Gb Tianfu goose genome, with Contig N50 and Scaffold N50 values of 1.85Mb and 33.12Mb respectively. The genome assembly contains 39 pseudochromosomes (2n=78), accounting for approximately 88.36% of the total goose genome size (Li et al., 2020). In the past two years, high-quality chromosome-level reference genomes such as Xingguo gray goose (Ouyang et al., 2022) and lion-headed goose (Zhao et al., 2023) have also been released to provide information for promoting goose breeding and biological research. provided valuable genetic resources and data foundation.

然而，由于过去技术的限制，现有鹅基因组中仍存在大量缺失区域，主要涉及着丝粒、端粒和其他高度重复的区域，这些区域中包含了许多重要遗传信息。端粒是染色体末端的高度重复DNA序列，能够保护染色体免受退化的影响（Shay et al., 2019）。着丝粒是另一个独特的染色体结构域，作为染色体分离时着丝点的装配位点（Wu et al., 2011）。着丝粒DNA序列通常由卫星DNA组成，代表了真核基因组中进化最快的序列（Francesca.,2022）。随着测序技术的发展，超长牛津纳米孔技术（ONT）和太平洋生物科学公司（PacBio）的高覆盖深度（HiFi）数据被广泛应用于填补动植物基因组的空白。通过整合第三代DNA测序技术和第二代Hi-C数据，可以实现完整、无缺口的鹅基因组组装。近期，家鸡的T2T基因组已经发布，填补了先前基因组的大部分空白，并揭示了鸡端粒和着丝粒的结构特征（Huanget al., 2023）。然而，尚未有报道关于鹅无缺口参考基因组的完成。在本研究中，我们首次组装了无缺口的家鹅基因组，采用多种组装策略，利用高覆盖率和准确的长读序列数据。该组装首次揭示了鹅高度重复区域（着丝粒和端粒）的结构特征，为更好地解析鹅基因组的结构特征和功能提供了基础。However, due to limitations of past technologies, there are still a large number of missing regions in the existing goose genome, mainly involving centromeres, telomeres and other highly repetitive regions, which contain a lot of important genetic information. Telomeres are highly repetitive DNA sequences at the ends of chromosomes that protect chromosomes from degradation (Shay et al., 2019). Centromeres are another unique chromosomal domain that serve as centromere assembly sites during chromosome segregation (Wu et al., 2011). Centromeric DNA sequences typically consist of satellite DNA and represent the most rapidly evolving sequences in eukaryotic genomes (Francesca., 2022). With the development of sequencing technology, ultra-long Oxford Nanopore Technology (ONT) and Pacific Biosciences (PacBio) High Coverage Depth (HiFi) data are widely used to fill gaps in animal and plant genomes. By integrating third-generation DNA sequencing technology and second-generation Hi-C data, a complete, gap-free goose genome assembly can be achieved. Recently, the T2T genome of domestic chicken has been released, filling most of the gaps in previous genomes and revealing the structural characteristics of chicken telomeres and centromeres (Huang et al., 2023). However, the completion of the goose ungap reference genome has not yet been reported. In this study, we assembled the gap-free domestic goose genome for the first time, using multiple assembly strategies to take advantage of high coverage and accurate long-read sequence data. This assembly revealed for the first time the structural characteristics of highly repetitive regions (centromeres and telomeres) in goose, providing a basis for better analysis of the structural characteristics and functions of the goose genome.

发明内容Contents of the invention

本发明旨在提供一种地方鹅T2T基因组组装方法，而组装完成的鹅T2T染色体水平基因组对未来鹅的遗传改良和遗传机制解析奠定了重要研究基础。The present invention aims to provide a local goose T2T genome assembly method, and the assembled goose T2T chromosome-level genome lays an important research foundation for future genetic improvement and genetic mechanism analysis of geese.

本发明提供了一种地方鹅T2T基因组组装方法，包括以下步骤：The invention provides a local goose T2T genome assembly method, which includes the following steps:

步骤1：样品采集和测序Step 1: Sample collection and sequencing

（1）采集太湖鹅保种群中一只成年雌性太湖鹅，收集翅静脉血液，胸肌和六种器官组织样本。随后进行样本DNA和RNA的提取。(1) Collect an adult female Taihu goose from the Taihu goose conservation population, and collect wing vein blood, breast muscle and six organ tissue samples. Sample DNA and RNA are then extracted.

（2）DNA文库构建和测序，将步骤（1）中提取的血液样本，利用三代长读长测序和二代测序相结合获得基因组完整片段。(2) DNA library construction and sequencing. Use the blood sample extracted in step (1) to obtain complete genome fragments by combining third-generation long-read sequencing and second-generation sequencing.

（3）Hi-C测序文库构建和测序，将步骤（1）中胸肌组织在甲醛溶液中进行交联反应，以供Hi-C建库测序。(3) Hi-C sequencing library construction and sequencing. The chest muscle tissue in step (1) is cross-linked in a formaldehyde solution for Hi-C library construction and sequencing.

（4）RNA文库构建和测序，将步骤（1）中六种组织进行二代转录组测序，为提高基因注释准确性，将六种组织等量混合，进行三代全长转录组测序。(4) RNA library construction and sequencing. The six tissues in step (1) were subjected to second-generation transcriptome sequencing. In order to improve the accuracy of gene annotation, the six tissues were mixed in equal amounts and third-generation full-length transcriptome sequencing was performed.

步骤2：基因组序列图谱构建Step 2: Genome sequence map construction

（1）利用K-mer法基于二代短片测序数据对太湖鹅基因组大小进行了评估。(1) The K-mer method was used to evaluate the genome size of Taihu goose based on second-generation short-form sequencing data.

（2）通过联合Hifiasm（v 0.18.5）和NextDenovo (v2.4.0)软件进行基因组组装。(2) Genome assembly was performed by combining Hifiasm (v 0.18.5) and NextDenovo (v2.4.0) software.

（3）使用quarTeT软件，对组装的scaffold序列进行了缺口填补。(3) The gaps in the assembled scaffold sequence were filled using quarTeT software.

（4）使用BUSCO（v 5.4.5）调用metaeuk (v 6.a5d39d9)软件进行基因结构预测，并利用HMMER（v3.3.2）将预测的基因序列与真核生物鸟类参考数据集进行比对。通过分析预测基因序列与参考序列的对齐程度和覆盖度等信息，评估了太湖鹅基因组组装的完整性，即基因组中是否包含这些保守基因序列。(4) Use BUSCO (v 5.4.5) to call metaeuk (v 6.a5d39d9) software to predict gene structure, and use HMMER (v3.3.2) to compare the predicted gene sequence with the eukaryotic bird reference data set . By analyzing information such as the alignment degree and coverage of the predicted gene sequences with the reference sequences, the completeness of the Taihu goose genome assembly was evaluated, that is, whether the genome contains these conserved gene sequences.

（5）使用RepeatMasker软件（v 4.1.5）对鹅基因组的重复序列进行了注释。(5) Repeat sequences of the goose genome were annotated using RepeatMasker software (v 4.1.5).

（6）在鹅基因组中鉴定端粒和着丝点（centromere）的过程中，将动物“TTAGGG”作为鹅的端粒识别序列，并利用quarTeT软件（v 1.1.3）的TeloExplorer功能进行端粒鉴定。(6) In the process of identifying telomeres and centromeres in the goose genome, the animal "TTAGGG" was used as the telomere recognition sequence of the goose, and the TeloExplorer function of quarTeT software (v 1.1.3) was used for telomere identification.

（7）为了研究家禽中鹅与鸭、鸡在核型层面上的相似性，使用NGenmoesyn软件（v1.39）对组装好的鹅染色体基因组数据与鸭和鸡染色体基因组进行了共线性比对。(7) In order to study the similarity at the karyotype level between goose, duck, and chicken among poultry, NGenmoesyn software (v1.39) was used to conduct a collinear comparison between the assembled goose chromosome genome data and the duck and chicken chromosome genomes.

优选的，步骤1中所述的六种器官组织样本包括脑、心脏、肝脏、脾脏、肺、肾脏。Preferably, the six organ tissue samples described in step 1 include brain, heart, liver, spleen, lung, and kidney.

优选的，步骤1中所述的样本DNA通过根血液/细胞/组织基因组DNA提取试剂盒（TIANGEN®DP304）。样本组织的总RNA提取过程严格按照天根TRNzol Universal总RNA提取试剂盒（TIANGEN®DP424）的使用说明书进行操作。Preferably, the sample DNA described in step 1 is passed through the Root Blood/Cell/Tissue Genomic DNA Extraction Kit (TIANGEN® DP304). The total RNA extraction process from sample tissues was strictly carried out in accordance with the instructions for use of the Tiangen TRNzol Universal Total RNA Extraction Kit (TIANGEN® DP424).

优选的，步骤1中所述的DNA文库构建和测序，包括采用三代超长测序，HiFi测序和二代短片测序基因文库构建。Preferably, the DNA library construction and sequencing described in step 1 include the construction of gene libraries using third-generation ultra-long sequencing, HiFi sequencing and second-generation short fragment sequencing.

优选的，步骤1中所述的Hi-C测序文库构建和测序流程包括裂解液重新重悬球团，并使用NEB缓冲液对细胞重悬。随后用稀SDS裂解液对细胞核进行溶解，使用四碱基酶MboI对DNA进行酶切，并利用生物素-14-dctp标记DNA末端，在完成标记后使用T4 DNA聚合酶去除生物素。随后，使用T4 DNA连接酶进行连接操作。最后，经过DNA纯化处理后，在IlluminaHiseq平台上进行了双端150bp测序。Preferably, the Hi-C sequencing library construction and sequencing process described in step 1 includes resuspending the pellets in lysate and using NEB buffer to resuspend the cells. The nuclei were then lysed with dilute SDS lysis solution, the DNA was digested using the four-base enzyme MboI, and the DNA ends were labeled with biotin-14-dctp. After the labeling was completed, T4 DNA polymerase was used to remove the biotin. Subsequently, T4 DNA ligase was used for ligation. Finally, after DNA purification, paired-end 150bp sequencing was performed on the Illumina Hiseq platform.

优选的，步骤1中所述的RNA文库构建和测序，使用EasyPure RNA Kit (Transgen)从器官组织中分离出总RNA。随后，采用NEBNext® UltraTM RNA Library Prep Kit forIllumina®(NEB, lpswich, MA, USA)对样本RNA进行测序文库制备。最后，在IlluminaHiSeq Xten平台上进行了双端（2×125bp）测序。针对混合样本的全长转录本文库构建和测序，采用Pacbio Sequel系统（Pacific Biosciences, CA, USA）进行全长转录本测序。Preferably, for the RNA library construction and sequencing described in step 1, EasyPure RNA Kit (Transgen) is used to isolate total RNA from organ tissue. Subsequently, the sample RNA was prepared for sequencing library using NEBNext® UltraTM RNA Library Prep Kit for Illumina® (NEB, lpswich, MA, USA). Finally, paired-end (2×125bp) sequencing was performed on the Illumina HiSeq Xten platform. For full-length transcript library construction and sequencing of mixed samples, the Pacbio Sequel system (Pacific Biosciences, CA, USA) was used for full-length transcript sequencing.

优选的，步骤2中所述的基因组大小评估，通过双端测序文库数据进行统计分析，使用Jellyfish工具获取了K-mer的分布情况。随后，利用 GenomeScope（v 2.0）根据K-mer分布情况进行建模，从而初步揭示了太湖鹅基因组的特征。Preferably, the genome size assessment described in step 2 is performed through statistical analysis of paired-end sequencing library data, and the distribution of K-mers is obtained using the Jellyfish tool. Subsequently, GenomeScope (v 2.0) was used to conduct modeling based on the K-mer distribution, thereby preliminarily revealing the characteristics of the Taihu goose genome.

优选的，步骤2中所述的基因组组装，首先，分别使用Hifi数据、Hifi+Hi-C数据以及Hifi+ONT超长读+Hi-C数据进行了基因组的组装。另外，采用Hifi+ONT超长读+Hi-C数据使用NextDenovo进行组装。为了进一步提高组装质量，采用run_purge_dups.py（v 1.2.4）工具去除重复的contigs。最终，根据N50值的评估，选择了Hifi+Ont+Hi-C的组装结果作为后续分析的数据。考虑到ONT三代超长测序存在准确性偏低的问题，使用Hifi数据对ONT数据进行了纠错处理。优选的，步骤2中所述的补空缺，在填补过程中，使用了以下参数：“-GapFiller -g *fasta -t 30 -l 5000 -i 60”，并参考了已用多方法组装的基因组数据。Preferably, the genome assembly described in step 2 first uses Hifi data, Hifi+Hi-C data, and Hifi+ONT ultra-long read+Hi-C data to assemble the genome. In addition, Hifi+ONT ultra-long read+Hi-C data is used for assembly using NextDenovo. In order to further improve the assembly quality, the run_purge_dups.py (v 1.2.4) tool is used to remove duplicate contigs. Finally, based on the evaluation of the N50 value, the assembly results of Hifi+Ont+Hi-C were selected as the data for subsequent analysis. Considering that ONT third-generation ultra-long sequencing has low accuracy, Hifi data was used to correct errors in the ONT data. Preferably, the gap filling described in step 2 uses the following parameters during the filling process: "-GapFiller -g *fasta -t 30 -l 5000 -i 60", and refers to the genome that has been assembled using multiple methods data.

本发明有以下有益效果：The invention has the following beneficial effects:

（1）填补了现有鹅参考基因组中大部分染色体上的空白区域，其中33条常染色体达到了完全无间隙的水平，为鹅的遗传研究提供了更全面的基因组参考。(1) It fills in the blank areas on most chromosomes in the existing goose reference genome, and 33 of the autosomes have reached a completely gap-free level, providing a more comprehensive genome reference for goose genetic research.

（2）成功组装了高质量的鹅基因组序列，包括常染色体和性染色体，为研究鹅的性别决定和生殖机制提供了重要的基础。(2) The high-quality goose genome sequence, including autosomes and sex chromosomes, was successfully assembled, providing an important basis for studying the sex determination and reproductive mechanisms of geese.

（3）通过对基因组的注释，识别出大量的基因和mRNA，为研究鹅的生物学特征、生长和发育过程以及疾病抵抗能力等方面提供了重要的资源。(3) Through annotation of the genome, a large number of genes and mRNAs were identified, providing important resources for studying the biological characteristics, growth and development processes, and disease resistance of geese.

附图说明Description of drawings

图1为实施例1的样本照片和基因组质量评估图表：（A）太湖鹅形态学照片；（B）基因组大小估计基因组Scope2；（C）太湖鹅全基因组Hi-C热图；FIG1 is a sample photo and genome quality assessment chart of Example 1: (A) morphological photo of Taihu goose; (B) genome size estimation genome Scope2; (C) Taihu goose whole genome Hi-C heat map;

图2为实施例1的太湖鹅41条染色体基因组组装的circos图：环从外到内表示（a）太湖鹅基因组染色体，（b）GC密度，（c）外显子密度，（d）CDS密度，（e）lncRNA密度，（f）mRNA密度，（g）基因密度，b-g为100kb;最内层的圆是不同染色体上同源基因的共线图；Figure 2 is a circos diagram of the Taihu goose genome assembly of 41 chromosomes in Example 1: the circle represents from outside to inside (a) Taihu goose genome chromosomes, (b) GC density, (c) exon density, (d) CDS Density, (e) lncRNA density, (f) mRNA density, (g) gene density, b-g are 100kb; the innermost circle is the collinear diagram of homologous genes on different chromosomes;

图3为实施例1的组装鹅基因组的着丝粒、端粒和间隙分布图，染色体热图表示基因密度，波浪线表示重复区域的密度；Figure 3 is a distribution diagram of centromeres, telomeres and gaps of the assembled goose genome in Example 1. The chromosome heat map represents the gene density, and the wavy lines represent the density of the repeated regions;

图4为实施例1的鹅染色体与鸭和鸡基因组的全基因组比对。Figure 4 is a complete genome alignment of the goose chromosomes and duck and chicken genomes in Example 1.

具体实施方式Detailed ways

下面结合附图来具体描述本发明的优选实施例，其中，附图构成本发明一部分，并与本发明的实施例一起用于阐释本发明的原理，并非用于限定本发明的范围。The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The drawings constitute a part of the present invention and are used together with the embodiments of the present invention to illustrate the principles of the present invention and are not intended to limit the scope of the present invention.

一种地方鹅T2T基因组组装方法，包括以下步骤：A local goose T2T genome assembly method includes the following steps:

步骤1：样品采集和测序Step 1: Sample collection and sequencing

采集太湖鹅保种群中一只成年雌性太湖鹅，收集翅静脉血液，胸肌和脑、心脏、肝脏、脾脏、肺、肾脏组织样本。随后进行样本DNA和RNA的提取。样本DNA通过根血液/细胞/组织基因组DNA提取试剂盒（TIANGEN®DP304）。样本组织的总RNA提取过程严格按照天根TRNzol Universal总RNA提取试剂盒（TIANGEN®DP424）的使用说明书进行操作。An adult female Taihu goose from the Taihu goose conservation population was collected, and wing vein blood, breast muscle and tissue samples of brain, heart, liver, spleen, lung and kidney were collected. The sample DNA and RNA are then extracted. Sample DNA was passed through the Root Blood/Cell/Tissue Genomic DNA Extraction Kit (TIANGEN® DP304). The total RNA extraction process from sample tissues was strictly carried out in accordance with the instructions for use of the Tiangen TRNzol Universal Total RNA Extraction Kit (TIANGEN® DP424).

DNA文库构建和测序采用三代超长测序，HiFi测序和二代短片测序基因文库构建。将提取的血液样本，利用三代长读长测序和二代测序相结合获得基因组完整片段。DNA library construction and sequencing adopt third-generation ultra-long sequencing, HiFi sequencing and second-generation short sequencing gene library construction. The extracted blood samples were combined with third-generation long-read sequencing and second-generation sequencing to obtain complete genome fragments.

将胸肌组织在甲醛溶液中进行交联反应，以供Hi-C建库测序。The chest muscle tissue was cross-linked in formaldehyde solution for Hi-C library construction and sequencing.

Hi-C测序文库构建和测序流程包括裂解液重新重悬球团，并使用NEB缓冲液对细胞重悬。随后用稀SDS裂解液对细胞核进行溶解，使用四碱基酶MboI对DNA进行酶切，并利用生物素-14-dctp标记DNA末端，在完成标记后使用T4 DNA聚合酶去除生物素。随后，使用T4DNA连接酶进行连接操作。最后，经过DNA纯化处理后，在Illumina Hiseq平台上进行了双端150bp测序。The Hi-C sequencing library construction and sequencing workflow includes resuspension of the pellet in lysate and resuspension of cells in NEB buffer. The nuclei were then lysed with dilute SDS lysis solution, the DNA was digested using the four-base enzyme MboI, and the DNA ends were labeled with biotin-14-dctp. After the labeling was completed, T4 DNA polymerase was used to remove the biotin. Subsequently, T4 DNA ligase was used for ligation. Finally, after DNA purification, paired-end 150bp sequencing was performed on the Illumina Hiseq platform.

将六种组织进行二代转录组测序，为提高基因注释准确性，将六种组织等量混合，进行三代全长转录组测序。使用EasyPure RNA Kit （Transgen）从器官组织中分离出总RNA。随后，采用NEBNext® UltraTM RNA Library Prep Kit for Illumina®（NEB,lpswich, MA, USA）对样本RNA进行测序文库制备。最后，在Illumina HiSeq Xten平台上进行了双端（2×125bp）测序。针对混合样本的全长转录本文库构建和测序，采用PacbioSequel系统（Pacific Biosciences, CA, USA）进行全长转录本测序。The six tissues were subjected to second-generation transcriptome sequencing. In order to improve the accuracy of gene annotation, the six tissues were mixed in equal amounts and subjected to third-generation full-length transcriptome sequencing. Total RNA was isolated from organ tissues using EasyPure RNA Kit (Transgen). Subsequently, the sample RNA was prepared for sequencing library using NEBNext® UltraTM RNA Library Prep Kit for Illumina® (NEB, lpswich, MA, USA). Finally, paired-end (2×125bp) sequencing was performed on the Illumina HiSeq Xten platform. For the construction and sequencing of full-length transcript libraries of mixed samples, the PacbioSequel system (Pacific Biosciences, CA, USA) was used for full-length transcript sequencing.

步骤2：基因组序列图谱构建Step 2: Genome sequence map construction

利用K-mer法基于二代短片测序数据对太湖鹅基因组大小进行了评估。通过双端测序文库数据进行统计分析，使用Jellyfish工具获取了K-mer的分布情况。随后，利用GenomeScope（v 2.0）根据K-mer 分布情况进行建模，从而初步揭示了太湖鹅基因组的特征。The K-mer method was used to evaluate the genome size of Taihu goose based on second-generation short-form sequencing data. Statistical analysis was performed on paired-end sequencing library data, and the distribution of K-mers was obtained using the Jellyfish tool. Subsequently, GenomeScope (v 2.0) was used to conduct modeling based on the K-mer distribution, thereby preliminarily revealing the characteristics of the Taihu goose genome.

通过联合Hifiasm（v 0.18.5）和NextDenovo（v2.4.0）软件进行基因组组装。首先，分别使用Hifi数据、Hifi+Hi-C数据以及Hifi+ONT超长读+Hi-C数据进行了基因组的组装。另外，采用Hifi+ONT超长读+Hi-C数据使用NextDenovo进行组装。为了进一步提高组装质量，采用run_purge_dups.py（v 1.2.4）工具去除重复的contigs。最终，根据N50值的评估，选择了Hifi+Ont+Hi-C的组装结果作为后续分析的数据。考虑到ONT三代超长测序存在准确性偏低的问题，使用Hifi数据对ONT数据进行了纠错处理。Genome assembly was performed by combining Hifiasm (v 0.18.5) and NextDenovo (v2.4.0) software. First, the genome was assembled using Hifi data, Hifi+Hi-C data, and Hifi+ONT ultra-long read+Hi-C data. In addition, Hifi+ONT ultra-long read+Hi-C data is used for assembly using NextDenovo. In order to further improve the assembly quality, the run_purge_dups.py (v 1.2.4) tool is used to remove duplicate contigs. Finally, based on the evaluation of the N50 value, the assembly results of Hifi+Ont+Hi-C were selected as the data for subsequent analysis. Considering that ONT third-generation ultra-long sequencing has the problem of low accuracy, Hifi data was used to correct errors in the ONT data.

使用quarTeT软件，对组装的scaffold序列进行了缺口填补。在填补过程中，使用了以下参数：“-GapFiller -g *fasta -t 30 -l 5000 -i 60”，并参考了已用多方法组装的基因组数据。Gap filling was performed on the assembled scaffold sequences using quarTeT software. During the imputation process, the following parameters were used: "-GapFiller -g *fasta -t 30 -l 5000 -i 60" and reference was made to genomic data that had been assembled using multiple methods.

使用BUSCO（v 5.4.5）调用metaeuk（v 6.a5d39d9）软件进行基因结构预测，并利用HMMER（v3.3.2）将预测的基因序列与真核生物鸟类参考数据集进行比对。通过分析预测基因序列与参考序列的对齐程度和覆盖度等信息，评估了太湖鹅基因组组装的完整性，即基因组中是否包含这些保守基因序列。BUSCO (v 5.4.5) was used to call metaeuk (v 6.a5d39d9) software for gene structure prediction, and the predicted gene sequences were aligned with the eukaryotic bird reference dataset using HMMER (v3.3.2). The completeness of the Taihu goose genome assembly was evaluated by analyzing the alignment and coverage between the predicted gene sequences and the reference sequences, that is, whether the genome contained these conserved gene sequences.

使用RepeatMasker软件（v 4.1.5）对鹅基因组的重复序列进行了注释。The repetitive sequences in the goose genome were annotated using RepeatMasker software (v 4.1.5).

在鹅基因组中鉴定端粒和着丝点（centromere）的过程中，将动物“TTAGGG”作为鹅的端粒识别序列，并利用quarTeT软件（v 1.1.3）的TeloExplorer功能进行端粒鉴定。In the process of identifying telomeres and centromeres in the goose genome, the animal "TTAGGG" was used as the telomere recognition sequence of goose, and the TeloExplorer function of quarTeT software (v 1.1.3) was used for telomere identification.

为了研究家禽中鹅与鸭、鸡在核型层面上的相似性，使用NGenmoesyn软件（v1.39）对组装好的鹅染色体基因组数据与鸭和鸡染色体基因组进行了共线性比对。In order to study the similarity between geese, ducks and chickens at the karyotype level in poultry, the assembled goose chromosome genome data were collinearly compared with the duck and chicken chromosome genomes using NGenmoesyn software (v1.39).

实施例1一种地方鹅T2T基因组组装方法Example 1 A method for assembling local goose T2T genome

1. 样本采集和测序1. Sample collection and sequencing

1.1样本DNA和RNA的采集和提取1.1 Collection and extraction of sample DNA and RNA

研究样本选自国家水禽基因库（江苏）太湖鹅保种群中一只成年雌性太湖鹅（图1A）。鹅基因组组装所采用的策略如图 1B所示。在屠宰前，我们使用5ml抗凝采血管（BDVacutainer ® EDTA）从翅静脉中抽取样本血液，随后提取其中的DNA进行后续测序分析。为了获得基因组片段的完整片段，我们采用了三代长读长测序和二代测序相结合的技术方法。此外，把样本胸肌组织切成小块，并置于甲醛溶液中进行交联反应，以供Hi-C建库测序使用。同时，采集了样本脑、心脏、肝脏、脾脏、肺、肾脏六种组织，并将其分别切割成小块，装入1.8ml冻存管（Nunc CryoTube）中，然后迅速冷冻于液氮罐中，并暂存于-80℃超低温冰箱（Hair DW-86L728J），以进行二代转录组测序。此外，为提高基因注释的准确性，将采集的六种组织样本按等量混合，以进行三代全长转录组测序。所有上述采样实验操作均符合江苏农牧科技职业学院动物福利委员会的规章要求（动物伦理批号22110313195050999）。The study sample was selected from an adult female Taihu goose from the Taihu goose conservation population in the National Waterfowl Gene Bank (Jiangsu) (Figure 1A). The strategy used for goose genome assembly is shown in Figure 1B. Before slaughter, we used a 5ml anticoagulated blood collection tube (BDVacutainer ® EDTA) to extract blood samples from the wing veins, and then extracted the DNA for subsequent sequencing analysis. In order to obtain complete fragments of genome fragments, we used a technical method that combines third-generation long-read sequencing and second-generation sequencing. In addition, the sample chest muscle tissue was cut into small pieces and placed in formaldehyde solution for cross-linking reaction for Hi-C library construction and sequencing. At the same time, six types of tissue samples including brain, heart, liver, spleen, lung, and kidney were collected, cut into small pieces, put into 1.8ml cryopreservation tubes (Nunc CryoTube), and then quickly frozen in a liquid nitrogen tank. , and temporarily stored in a -80°C ultra-low temperature refrigerator (Hair DW-86L728J) for second-generation transcriptome sequencing. In addition, to improve the accuracy of gene annotation, the six collected tissue samples were mixed in equal amounts for three-generation full-length transcriptome sequencing. All the above sampling experimental operations are in compliance with the regulations and requirements of the Animal Welfare Committee of Jiangsu Vocational College of Agriculture and Animal Husbandry Science and Technology (animal ethics batch number 22110313195050999).

样本DNA提取过程严格遵循天根血液/细胞/组织基因组DNA提取试剂盒（TIANGEN®DP304）的操作说明。DNA提取后，使用Nanodrop 2000分光光度计对DNA进行质量检测。样本DNA质量合格参数设置为：OD值（260/280）在1.8-2.0之间，并且浓度大于100ng/μl。最后，利用用配好的2%琼脂糖凝胶进行电泳，将通过DNA条带检测合格的样本DNA置于-80℃冰箱（Hair DW-86L728J）中保存。样本组织的总RNA提取过程严格按照天根TRNzol Universal总RNA提取试剂盒（TIANGEN®DP424）的使用说明书进行操作。RNA提取后，对RNA进行浓度和纯度测定。检测合格后，将样本RNA置于-80℃冰箱（Hair DW-86L728J）中储存。The sample DNA extraction process strictly followed the operating instructions of the Tiangen Blood/Cell/Tissue Genomic DNA Extraction Kit (TIANGEN® DP304). After DNA extraction, DNA quality was tested using a Nanodrop 2000 spectrophotometer. The sample DNA quality qualification parameters are set as follows: OD value (260/280) is between 1.8-2.0, and the concentration is greater than 100ng/μl. Finally, the prepared 2% agarose gel was used for electrophoresis, and the sample DNA that passed the DNA band detection was stored in a -80°C refrigerator (Hair DW-86L728J). The total RNA extraction process from sample tissues was strictly carried out in accordance with the instructions for use of the Tiangen TRNzol Universal Total RNA Extraction Kit (TIANGEN® DP424). After RNA extraction, the concentration and purity of the RNA are determined. After passing the test, the sample RNA was stored in a -80°C refrigerator (Hair DW-86L728J).

1.2 DNA文库构建和测序1.2 DNA library construction and sequencing

样本基因组三代超长测序，遵循了Oxford Nanopore Technologies （ONT）公司提供的标准protocol。首先，使用Megaruptor （Diagenode, USA）对基因组DNA随机切割。随后，采用Nanopore SQK-LSK 109 （Oxford Nanopore technologies, USA）套件进行适配器制备和连接，并对连接好的DNA文库再次进行Qubit 3.0 Fluorometer检测。最后，将样本加载到Nanopore Flow cells R9.4上，在PromethION平台上进行测序。最终的测序结果统计见表1，共得到577,228条reads，总碱基数量达到52,490,712,237bp，reads的平均长度为90,935.9bp，N50长度为100,823bp，GC含量为42.82%。The three-generation ultra-long sequencing of the sample genome followed the standard protocol provided by Oxford Nanopore Technologies (ONT). First, genomic DNA was randomly cut using Megaruptor (Diagenode, USA). Subsequently, Nanopore SQK-LSK 109 (Oxford Nanopore technologies, USA) kit was used for adapter preparation and ligation, and the ligated DNA library was tested again with Qubit 3.0 Fluorometer. Finally, the samples were loaded onto Nanopore Flow cells R9.4 and sequenced on the PromethION platform. The final sequencing result statistics are shown in Table 1. A total of 577,228 reads were obtained, the total number of bases reached 52,490,712,237bp, the average length of reads was 90,935.9bp, the N50 length was 100,823bp, and the GC content was 42.82%.

样本的HiFi测序采用了PacBio单分子实时循环一致测序（CCS）文库制备方法。首先，使用Covaris g-TUBEs （Covaris）将总共100μg高质量基因组DNA进行剪切，以获得目标大小约为20kb的片段。随后，使用Agilent 2100 Bioanalyzer DNA 12000芯片（AgilentTechnologies）对剪切后的基因组DNA进行大小分布检测，确保其符合要求。接下来，采用PacBio DNA模板制备套件2.0（Pacific Biosciences of California, Inc.,CA）构建测序文库，以在PacBio RS II机器（Pacific Bioscences of California, Inc.）上进行HiFi测序。最后，将构建好的文库加载至一个SMRT CELL上进行测序。最终共获得了4,261,430条reads测序数据（表1），总碱基数达到71,413,769,333bp，reads的平均长度为16,758bp，N50长度为16,838bp，GC含量为42.61%。HiFi sequencing of samples used the PacBio single-molecule real-time cycle consensus sequencing (CCS) library preparation method. First, a total of 100 μg of high-quality genomic DNA was sheared using Covaris g-TUBEs (Covaris) to obtain fragments of target size approximately 20 kb. Subsequently, the Agilent 2100 Bioanalyzer DNA 12000 chip (Agilent Technologies) was used to detect the size distribution of the sheared genomic DNA to ensure that it met the requirements. Next, the PacBio DNA Template Preparation Kit 2.0 (Pacific Biosciences of California, Inc., CA) was used to construct a sequencing library for HiFi sequencing on the PacBio RS II machine (Pacific Bioscences of California, Inc.). Finally, the constructed library was loaded onto a SMRT CELL for sequencing. Finally, a total of 4,261,430 read sequencing data were obtained (Table 1), with the total number of bases reaching 71,413,769,333bp, the average read length was 16,758bp, the N50 length was 16,838bp, and the GC content was 42.61%.

样本的二代短片段测序基因组文库构建过程如下：首先，使用Covaris超声仪（Covaris, USA）对高质量的基因组DNA进行随机切割。然后，采用Truseq nano DNA HT文库制备试剂盒（Illumina, USA）构建Illumina测序文库，目标插入大小为350bp。最后，将纯化处理的文库加载到Illumina NovaSeq 6000平台上进行测序。在测序完成后，共获得了385,826,042条序列，总计57,873,906,300bp的测序数据，GC含量为43.51%。The construction process of the second-generation short fragment sequencing genomic library of the sample is as follows: First, high-quality genomic DNA is randomly cut using a Covaris ultrasonic instrument (Covaris, USA). Then, the Truseq nano DNA HT library preparation kit (Illumina, USA) was used to construct an Illumina sequencing library with a target insert size of 350 bp. Finally, the purified library was loaded onto the Illumina NovaSeq 6000 platform for sequencing. After the sequencing was completed, a total of 385,826,042 sequences were obtained, with a total of 57,873,906,300 bp of sequencing data and a GC content of 43.51%.

1.3 Hi-C测序文库构建和测序1.3 Hi-C sequencing library construction and sequencing

样本的Hi-C测序文库的构建和测序基于标准流程，并进行了一些修改。首先，使用4%甲醛溶液对胸肌组织进行室温下交联处理。随后，取20μl裂解缓冲液将球团重新重悬，并使用100μl NEB缓冲液对细胞核进行重悬。接下来，采用稀SDS裂解液对细胞核进行溶解。然后，使用四碱基酶MboI对DNA进行酶切，并利用生物素-14-dctp标记DNA末端，在完成标记后使用T4 DNA聚合酶去除生物素。随后，使用T4 DNA连接酶进行连接操作。最后，经过DNA纯化处理后，在Illumina Hiseq平台上进行了双端150bp测序。测序结果如表1所示：共获得了1,075,285,592条reads，总碱基数据量达到161,292,838,800bp，reads的平均长度为90,935.80bp，N50长度为100,823bp，GC含量平均为42.82%。Construction and sequencing of Hi-C sequencing libraries of samples were based on standard procedures with some modifications. First, the chest muscle tissue was cross-linked at room temperature using 4% formaldehyde solution. Subsequently, resuspend the pellet in 20 μl of lysis buffer and resuspend the nuclei in 100 μl of NEB buffer. Next, use dilute SDS lysis buffer to lyse the cell nuclei. Then, the four-base enzyme MboI was used to digest the DNA, and biotin-14-dctp was used to label the DNA ends. After the labeling was completed, T4 DNA polymerase was used to remove the biotin. Subsequently, T4 DNA ligase was used for ligation. Finally, after DNA purification, paired-end 150bp sequencing was performed on the Illumina Hiseq platform. The sequencing results are shown in Table 1: a total of 1,075,285,592 reads were obtained, the total base data amount reached 161,292,838,800bp, the average read length was 90,935.80bp, the N50 length was 100,823bp, and the average GC content was 42.82%.

1.4 RNA文库构建和测序1.4 RNA library construction and sequencing

对于6个样本的RNA测序文库构建和测序，首先，使用EasyPure RNA Kit（Transgen）从脑、心脏、肝脏、脾脏、肺、胸肌组织分别中分离出总RNA。随后，采用NEBNext®UltraTM RNA Library Prep Kit for Illumina®(NEB, lpswich, MA, USA)对样本RNA进行测序文库制备。最后，在Illumina HiSeq Xten平台上进行了双端（2×125bp）测序。具体的测序结果请参见表1。其中，心脏组织测序获得reads数量最高，达到45,882,692条，而脾脏组织获得的reads数量最低，为38,462,044条。六个组织平均总reads数据量为6,393,354,550bp，GC含量为46.04%。For the construction and sequencing of RNA sequencing libraries for the six samples, first, the total RNA was isolated from the brain, heart, liver, spleen, lung, and chest muscle tissues using the EasyPure RNA Kit (Transgen). Subsequently, the sample RNA was used to prepare the sequencing library using the NEBNext®UltraTM RNA Library Prep Kit for Illumina® (NEB, lpswich, MA, USA). Finally, paired-end (2×125bp) sequencing was performed on the Illumina HiSeq Xten platform. For specific sequencing results, please see Table 1. Among them, the number of reads obtained by sequencing the heart tissue was the highest, reaching 45,882,692, while the number of reads obtained by the spleen tissue was the lowest, at 38,462,044. The average total reads data volume of the six tissues was 6,393,354,550bp, and the GC content was 46.04%.

针对混合样本的全长转录本文库构建和测序，采用Pacbio Sequel系统（PacificBiosciences, CA, USA）进行全长转录本测序。根据Isoform Sequencing (Iso-Seq)协议，首先使用NEBNext Single Cell/Low Input cDNA Synthesis&Amplification Module对样品进行cDNA合成和扩增。然后，使用PacBio SMRTbell Express Template Prep Kit 2.0对样品进行处理，包括连接适配器和添加SMRTbell序列。接下来，通过ProNex® Size-Selective Purification System进行大小选择纯化，去除低质量和短片段的序列，以完成Iso-Seq文库制备。最后，在Sequel Sequel System (Pacific Biosciences)上进行全长转录本测序，以获取高质量的全长转录本序列信息。总计获得48,373,842条reads，总数据量达到84,725,302,734bp, 平均reads长度为1,751.50bp，N50长度为2,447bp，GC含量为46.05%。For the construction and sequencing of full-length transcript libraries of mixed samples, the Pacbio Sequel system (PacificBiosciences, CA, USA) was used for full-length transcript sequencing. According to the Isoform Sequencing (Iso-Seq) protocol, the NEBNext Single Cell/Low Input cDNA Synthesis&Amplification Module is first used to perform cDNA synthesis and amplification on the sample. The samples were then processed using PacBio SMRTbell Express Template Prep Kit 2.0, including connecting adapters and adding SMRTbell sequences. Next, size-selective purification is performed through the ProNex® Size-Selective Purification System to remove low-quality and short fragment sequences to complete the Iso-Seq library preparation. Finally, full-length transcript sequencing was performed on the Sequel Sequel System (Pacific Biosciences) to obtain high-quality full-length transcript sequence information. A total of 48,373,842 reads were obtained, the total data volume reached 84,725,302,734bp, the average read length was 1,751.50bp, the N50 length was 2,447bp, and the GC content was 46.05%.

2. 基因组序列图谱构建2. Genome sequence map construction

2.1基因组大小评估2.1 Genome size assessment

本研究利用K-mer法基于二代短片段测序数据对太湖鹅基因组大小进行了评估。通过对双端测序文库数据进行统计分析，使用Jellyfish工具获取了K-mer的分布情况。随后，利用 GenomeScope（v 2.0）根据K-mer 分布情况进行建模，从而初步揭示了太湖鹅基因组的特征。在图1C 中，蓝色线表示实际观测到的太湖鹅基因组测序序列中K-mer的分布情况。同时，棕色线表示由于测序错误引起的序列中的K-mer，由于测序错误是随机的，这些K-mer通常具有较低的频数。最终，GenomeScope根据这些信息进行建模，并估计太湖鹅基因组的长度约为1.12Gb，基因组杂合度约为0.5%。基于基因组从头拼接结果显示，太湖鹅属于高杂合度基因组。In this study, the K-mer method was used to evaluate the genome size of the Taihu goose based on the second-generation short fragment sequencing data. The distribution of K-mers was obtained using the Jellyfish tool by statistically analyzing the double-end sequencing library data. Subsequently, GenomeScope (v 2.0) was used to model the distribution of K-mers, thereby preliminarily revealing the characteristics of the Taihu goose genome. In Figure 1C, the blue line represents the distribution of K-mers in the actual observed Taihu goose genome sequencing sequence. At the same time, the brown line represents the K-mers in the sequence caused by sequencing errors. Since sequencing errors are random, these K-mers usually have a lower frequency. Finally, GenomeScope modeled based on this information and estimated that the length of the Taihu goose genome is about 1.12Gb and the genome heterozygosity is about 0.5%. Based on the results of de novo genome assembly, the Taihu goose belongs to a high heterozygous genome.

2.2基因组组装2.2 Genome assembly

本研究联合Hifiasm（v 0.18.5）和NextDenovo（v2.4.0）软件进行组装。使用Hifiasm软件进行基因组组装。首先，分别使用Hifi数据、Hifi+Hi-C数据以及Hifi+ONT超长读+Hi-C数据进行了基因组的组装。另外，采用Hifi+ONT超长读+Hi-C数据使用NextDenovo进行组装。组装结果见表2，其中，NextDenovo的组装效果最佳，具有contigs 244条，N50长度为33,928,929bp，被选择进行下游分析。为了进一步提高组装质量，采用run_purge_dups.py（v 1.2.4）工具去除重复的contigs。最终，根据N50值的评估，选择了Hifi+Ont+Hi-C的组装结果作为后续分析的数据。考虑到ONT三代超长测序存在准确性偏低的问题，使用Hifi数据对ONT数据进行了纠错处理。具体操作包括使用meryl软件（v 1.4）统计kmer出现的次数，利用winnowmap软件（v 2.03）将组装好的基因组与Hifi数据进行重新比对，再经过falconc软件（v 1.15.0）进行二次过滤和删除嵌合比对片段。最后，使用racon软件（v1.5.0）进行三轮纠错，得到经过HiFi纠错后的基因组组装序列。接下来，运用Chromap软件（v 0.2.5）和yahs（v 1.2a.1）软件套件，结合Hi-C数据，对基因组进行高质量组装，获得完整的scaffold序列。为了标识和比对组装好的scaffold序列，将其与已知的狮头鹅基因组（GCA_025388735.1）进行比对分析，通过比对，确定了scaffold序列与狮头鹅基因组中各染色体的对应关系，并根据匹配的1-38号常染色体和Z染色体进行了重新命名。This study was assembled using Hifiasm (v 0.18.5) and NextDenovo (v2.4.0) software. Genome assembly was performed using Hifiasm software. First, the genome was assembled using Hifi data, Hifi+Hi-C data, and Hifi+ONT ultra-long read+Hi-C data. In addition, Hifi+ONT ultra-long read+Hi-C data is used for assembly using NextDenovo. The assembly results are shown in Table 2. Among them, NextDenovo has the best assembly effect, with 244 contigs and an N50 length of 33,928,929 bp, and was selected for downstream analysis. In order to further improve the assembly quality, the run_purge_dups.py (v 1.2.4) tool is used to remove duplicate contigs. Finally, based on the evaluation of the N50 value, the assembly results of Hifi+Ont+Hi-C were selected as the data for subsequent analysis. Considering that ONT third-generation ultra-long sequencing has low accuracy, Hifi data was used to correct errors in the ONT data. Specific operations include using meryl software (v 1.4) to count the number of kmer occurrences, using winnowmap software (v 2.03) to re-align the assembled genome with Hifi data, and then using falconc software (v 1.15.0) for secondary filtering and deletion of chimeric aligned fragments. Finally, racon software (v1.5.0) was used to perform three rounds of error correction to obtain the genome assembly sequence after HiFi error correction. Next, Chromap software (v 0.2.5) and yahs (v 1.2a.1) software suites were used, combined with Hi-C data, to perform high-quality assembly of the genome and obtain a complete scaffold sequence. In order to identify and compare the assembled scaffold sequence, it was compared with the known lion-head goose genome (GCA_025388735.1). Through comparison, the corresponding relationship between the scaffold sequence and each chromosome in the lion-head goose genome was determined. , and was renamed based on the matching autosomes 1-38 and Z chromosome.

为了获得较为完整的鹅W染色体序列信息，我们又进行了W染色体辅助组装工作。在已发布的鹅基因组版本中，由于缺乏W染色体的序列信息。为此，以鹅的近缘物种——鸭的基因组为参考，利用ragtag.py软件（v2.1.0）的"scaffold"模块，将尚未拼接的scaffolds拼贴成鹅的W染色体。通过这一策略，我们成功地组装出了一条长度为17.35Mb的W染色体。W染色体由18条scaffolds共同组成，其中scaffold_42是W染色体的主要部分，占据了全长的9.63%。最终我们成功组装出了38条常染色体和W、Z两条性染色体，是目前最完整的鹅基因组（图2）。需要强调的是，由于性染色体结构的复杂性，性染色体的组装难度远高于常染色体。因此，我们采用了辅助组装的方法，并借助鸭W染色体基因组的相关信息，才能获得较为完整的鹅W染色体序列。这项工作对于进一步研究鹅的性别决定机制和遗传特性具有重要的学术价值。In order to obtain more complete sequence information of the goose W chromosome, we also performed assisted assembly of the W chromosome. In the published version of the goose genome, sequence information for the W chromosome is lacking. To this end, the genome of duck, a closely related species of goose, was used as a reference, and the "scaffold" module of ragtag.py software (v2.1.0) was used to collage the unspliced scaffolds into the W chromosome of goose. Through this strategy, we successfully assembled a W chromosome with a length of 17.35Mb. The W chromosome is composed of 18 scaffolds, of which scaffold_42 is the main part of the W chromosome, accounting for 9.63% of the total length. In the end, we successfully assembled 38 autosomes and two sex chromosomes, W and Z, which is the most complete goose genome currently (Figure 2). It should be emphasized that due to the complexity of the sex chromosome structure, the assembly of sex chromosomes is much more difficult than that of autosomes. Therefore, we adopted an auxiliary assembly method and relied on the relevant information of the duck W chromosome genome to obtain a relatively complete goose W chromosome sequence. This work has important academic value for further research on the sex determination mechanism and genetic characteristics of geese.

2.3补空缺2.3 Fill the vacancies

使用quarTeT软件（v 1.1.3）对组装的scaffold序列进行了缺口填补。在填补过程中，使用了以下参数：“-GapFiller -g *fasta -t 30 -l 5000 -i 60”，并参考了已用多方法组装的基因组数据。该工具利用四分体比对信息来填补缺口，并借助其他相关已知基因组信息提高填补的准确性。经过缺口填补后，除了两条性染色体上存在少量缺口外，我们成功将33条常染色体完全闭合。图2展示了缺口在各染色体上的分布情况。Gap filling was performed on the assembled scaffold sequences using quarTeT software (v 1.1.3). During the imputation process, the following parameters were used: "-GapFiller -g *fasta -t 30 -l 5000 -i 60" and reference was made to genomic data that had been assembled using multiple methods. The tool uses tetrad alignment information to fill gaps and improves filling accuracy with the help of other relevant known genome information. After gap filling, we successfully closed all 33 autosomes except for a few gaps on the two sex chromosomes. Figure 2 shows the distribution of gaps on each chromosome.

2.4基因组完整性评估2.4 Genome integrity assessment

使用BUSCO（v 5.4.5）（Seppey et al., 2019）调用metaeuk（v 6.a5d39d9）软件进行基因结构预测，并利用HMMER（v3.3.2）将预测的基因序列与真核生物鸟类参考数据集进行比对。通过分析预测基因序列与参考序列的对齐程度和覆盖度等信息，评估了太湖鹅基因组组装的完整性，即基因组中是否包含这些保守基因序列。根据比对结果的统计，确定了在组装的基因组中存在单拷贝基因（S）和多拷贝基因（D）的情况。其中，96.5%的单拷贝基因能够完整比对到基因组上，0.4%的多拷贝基因完整存在于基因组中。此外，我们还使用Quast（v 5.2.0）软件对基因组的关键指标进行了评估。结果显示，太湖鹅基因组大小为1,197,991,206bp，scaffold N50达到81,007,908bp。与已发布的染色体水平鹅基因组相比，我们组装结果中的scaffolds数量明显最少，仅有73条。值得注意的是，本次组装的scaffold N50长度超过了80M，这一结果明显优于先前的基因组版本。详细比较结果如表3所示。Use BUSCO (v 5.4.5) (Seppey et al., 2019) to call metaeuk (v 6.a5d39d9) software for gene structure prediction, and use HMMER (v3.3.2) to compare the predicted gene sequence with the eukaryotic bird reference Data sets are compared. By analyzing information such as the alignment degree and coverage of the predicted gene sequences with the reference sequences, the completeness of the Taihu goose genome assembly was evaluated, that is, whether the genome contains these conserved gene sequences. Based on the statistics of the comparison results, the presence of single-copy genes (S) and multi-copy genes (D) in the assembled genome was determined. Among them, 96.5% of single-copy genes can be completely mapped to the genome, and 0.4% of multi-copy genes are completely present in the genome. Additionally, we evaluated key genome metrics using Quast (v 5.2.0) software. The results showed that the genome size of Taihu goose is 1,197,991,206bp, and the scaffold N50 reaches 81,007,908bp. Compared with the published goose genome at the chromosome level, the number of scaffolds in our assembly is obviously the smallest, only 73. It is worth noting that the length of the scaffold N50 assembled this time exceeds 80M, which is significantly better than the previous genome version. The detailed comparison results are shown in Table 3.

2.5基因注释2.5 Gene annotation

使用RepeatMasker软件（v 4.1.5）对鹅基因组的重复序列进行了注释。根据结果统计（见表 4），在已注释的重复序列中，散在重复序列占整个鹅基因组全长的 8.92 %，总长度约为106.89Mb。其中，约77.17Mb（6.44%）为逆转录因子，而3.66Mb为DNA转座子。此外，太湖鹅基因组上中约有4.87%的序列属于长散在重复序列（Long interspersed nuclearelements, LINEs），这是基因组中所比重最大的重复序列种类。值得注意的是，其中鸟类逆转座子CR1（Chicken repeat 1）的丰度最高，几乎占所有LINEs 的100 %。此外，1.49%的太湖鹅基因组序列属于长末端重复序列（long terminal repeats, LTR），而0.08 %属于短散在重复序列（Small interspersed nuclear elements, SINEs）。在进行重复序列屏蔽后，我们使用Liftoff软件（v 1.6.3）参考NCBI goose genome （GCF_002166845.1）及其注释信息以及转录组数据集，对太湖鹅基因组进行了编码基因和mRNA注释，注释结果显示，共注释到34898个基因和62248个mRNA。Repeat sequences of the goose genome were annotated using RepeatMasker software (v 4.1.5). According to the result statistics (see Table 4), among the annotated repeated sequences, interspersed repeated sequences account for 8.92% of the entire goose genome length, with a total length of approximately 106.89Mb. Among them, about 77.17Mb (6.44%) are retroelements, and 3.66Mb are DNA transposons. In addition, about 4.87% of the sequences in the Taihu goose genome belong to long interspersed nuclear elements (LINEs), which is the largest proportion of repeated sequence types in the genome. It is worth noting that the avian retrotransposon CR1 (Chicken repeat 1) has the highest abundance, accounting for almost 100% of all LINEs. In addition, 1.49% of the Taihu goose genome sequence belongs to long terminal repeats (LTR), while 0.08% belongs to short interspersed nuclear elements (SINEs). After masking repetitive sequences, we used Liftoff software (v 1.6.3) to annotate the coding genes and mRNA of the Taihu goose genome by referring to the NCBI goose genome (GCF_002166845.1) and its annotation information and transcriptome data sets. The annotation results It shows that a total of 34898 genes and 62248 mRNAs were annotated.

2.6端粒和着丝粒鉴定2.6 Telomere and centromere identification

在鹅基因组中鉴定端粒和着丝点（centromere）的过程中，我们将动物“TTAGGG”作为鹅的端粒识别序列，并利用quarTeT软件（v 1.1.3）的TeloExplorer功能进行端粒鉴定。结果显示，位于3号染色体两端末端10000bp窗口内有最多的端粒重复序列，分别有1101和1793个，具体的端粒分布示意图可见图3。对于着丝点的鉴定，我们采用了centromics软件（(https://github.com/ShuaiNIEgithub/Centromics)），并利用ont和hifi数据集以及Hi-C数据对已组装的基因组进行着丝粒鉴定。根据结果中Hic和TR-CL2（长度测序捕获染色体构象的固定）数据的峰值，确定染色体上着丝粒的位置。着丝粒的位置已在染色体模式图中标注出来（图3）。In the process of identifying telomeres and centromeres in the goose genome, we used the animal "TTAGGG" as the telomere recognition sequence of the goose, and used the TeloExplorer function of the quarTeT software (v 1.1.3) to identify telomeres. The results showed that the most telomere repeat sequences were located in the 10,000bp window at the ends of chromosome 3, with 1101 and 1793, respectively. The specific telomere distribution diagram can be seen in Figure 3. For the identification of centromeres, we used the centromics software (https://github.com/ShuaiNIEgithub/Centromics) and used the ont and hifi data sets and Hi-C data to identify centromeres in the assembled genome. According to the peaks of the Hic and TR-CL2 (length sequencing captures the fixation of chromosome conformation) data in the results, the position of the centromere on the chromosome was determined. The position of the centromere has been marked in the chromosome pattern diagram (Figure 3).

2.7物种间基因组共线性2.7 Genome collinearity between species

为了研究家禽中鹅与鸭、鸡在核型层面上的相似性，我们使用NGenmoesyn软件（v1.39）对组装好的鹅染色体基因组数据与鸭和鸡染色体基因组进行了共线性比对。如图4所示，大部分鸭的长片段染色体（1-9号染色体）在鹅基因组中都能找到相应一一对应的染色体，尤其是在Z和W染色体上具有高度的相似性。这与鸭、鹅作为水禽具有相似的生活习性和分类学归属相符合。然而，与鹅相比，鸡的基因组与鹅基因组在线性比对结果中仅有少部分区域具有一致性。尽管鸡和鹅都属于家禽类群，但它们在生活习性和进化关系上存在显著差异。这表明鸭与鹅之间具有更近的亲缘关系。In order to study the similarities between goose, duck, and chicken in poultry at the karyotype level, we used NGenmoesyn software (v1.39) to conduct a collinear comparison between the assembled goose chromosome genome data and the duck and chicken chromosome genomes. As shown in Figure 4, most duck long-segment chromosomes (chromosomes 1-9) can be found in one-to-one corresponding chromosomes in the goose genome, especially on the Z and W chromosomes, which have a high degree of similarity. This is consistent with ducks and geese having similar living habits and taxonomic affiliation as waterfowl. However, compared with goose, only a few regions of the chicken genome are consistent with the goose genome in linear alignment results. Although chickens and geese both belong to the poultry group, they have significant differences in their living habits and evolutionary relationships. This suggests that ducks and geese are more closely related.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。The above description is only a preferred specific implementation manner of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by any technician familiar with the technical field within the technical scope disclosed by the present invention should be covered within the protection scope of the present invention.