Introduction
The snake pipefish
Entelurus aequoreus (Linnaeus 1758) is a member of the family Syngnathidae, which currently includes over 300 species of seahorses and pipefishes
[1].
E. aequoreus shares typical features with other pipefishes, such as the unique, elongated body plan and fused jaws
[2]. However, unlike most pipefishes, which are found in benthic habitats, the snake pipefish inhabits more open and deeper seagrass environments and occurs even in pelagic waters
[2]. They are ambush predators on small crustaceans and other invertebrates, thereby indirectly contributing to the overall biodiversity and stability of these fragile habitats
[3]. Adult snake pipefishes are poor swimmers equipped with small fins. They rely on their elongated, thin bodies for crypsis in eelgrass habitats
[4–6].
The snake pipefish historically ranged from the waters of the Azores northwards to the waters of Norway and Iceland and eastward to the Baltic Sea
[7]. However, since 2003, the species has expanded its distribution
[8] into the Arctic waters of Spitsbergen
[9], the Barents Sea, and the Greenland Sea
[10]. Simultaneously, population sizes seem to increase within its former range, as indicated by substantially increased catch rates
[11,12]. Several factors have been proposed to cause this expansion and population growth, including rising sea temperatures, an increased potential for long-distance dispersal of juveniles via ocean currents
[4,7], and an increased reproductive success facilitated by the dispersal of invasive seaweeds
[6,8–10,13]. The latter explanation has been confirmed by local field experiments in the northern Wadden Sea, suggesting a mutual co-occurrence of the invasive Japanese seaweed (
Sargassum muticum) and the snake pipefish
[5]. Studies based on mtDNA marker regions did not discern any population structure thus far and suggest a previous population expansion in the Pleistocene, around 50–100 thousand years ago (kya)
[6]. Yet, a comprehensive analysis of demographic events is better conducted using genomic data, thus requiring a high-quality reference genome, ideally of the same species or at least a closely related one.
Previously, genomes of Syngnathidae have been used to study the evolution of highly specialized morphologies and life-history traits unique to pipefishes and seahorses
[14–16]. For instance, the transition to male pregnancy was associated with major genomic restructuring events and parallel modifications of the adaptive immune system. There is a remarkable variability in genome sizes within the family, with estimates ranging from 350 Mbp to 1.8 Gbp
[14]. The major shifts in body shape are assumed to be related to gene-family loss and expansion events, along with higher rates of protein and nucleotide evolution
[16]. Genomic data obtained using a direct sequencing approach of ultra-conserved elements improved the understanding of the phylogeny of pipefishes
[15] and identified a likely radiation of the group in the waters of the modern Indo-Pacific Ocean. Nevertheless, high-quality genomes of Syngnathidae are only available for a few species. According to the NCBI genome database, only 7% of the known species diversity has genome sequences available.
A draft genome of the snake pipefish was previously assembled using a combination of paired-end and mate-pair sequencing techniques, yielding an assembly with low continuity (N50 3.5 kbp, BUSCO C: 21%) and a large difference between the estimated and assembled genome sizes (1.8 Gbp vs. 557 Mbp)
[14]. To obtain a higher quality, near chromosome-scale genome assembly for the snake pipefish, essential for future population, conservation, and evolutionary genomics studies of fish, we used long-read sequencing technologies. This allowed us to gain insights into the genetic properties of the species and to perform demographic analyses based on the Pairwise Sequentially Markovian Coalescent (PSMC) framework
[17]. The data generation and analyses presented here were conducted during a six-week master course in 2021 at the Goethe University, Frankfurt am Main, Germany. The concept of high-quality genome sequencing in a course setting has so far yielded three reference-quality fish genomes and has proven to be a successful approach to introducing the technology to a new generation of scientists
[18–21].
Material and methods
Sampling, DNA extraction, and sequencing
A single individual of
Entelurus aequoreus (Linnaeus 1758) (NCBI:
txid42861, marinespecies.org:taxname:
127379) was caught by trawling during an annual monitoring expedition to the Dogger Bank in the North Sea in July 2021 (trawl start coordinates 54.993633, 2.940833; end coordinates 55.0077, 2.929867) with the permission of the Maritime Policy Unit of the UK Foreign and Commonwealth Office. The study complied with the ‘Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from Their Utilization’. The sample was initially frozen at −20 °C and later stored at −80 °C.
High-molecular-weight genomic DNA was extracted from muscle tissue, following the protocol by Mayjonade
et al. [33], with the addition of Proteinase K. We evaluated the quantity and quality of the DNA with the Genomic DNA ScreenTape on the Agilent 2200 TapeStation system (Agilent Technologies), as well as with the Qubit
® dsDNA BR Assay Kit.
For long-read sequencing, a PacBio SMRT Bell CLR library was prepared using the SMRTbell Express Prep kit v3.0 kit (Pacific Biosciences – PacBio, Menlo Park, CA, USA) and sequenced on the PacBio Sequel IIe platform. A proximity-ligation library was compiled with muscle tissue following the Dovetail™ Omni-C protocol (Dovetail Genomics, Santa Cruz, California, USA). In addition, a standard whole-genome 150 bp paired-end Illumina library was prepared using the NEBNext Ultra II library preparation kit (New England Biolabs Inc., Ipswich, USA). Finally, the proximity ligation and the paired-end library were shipped to Novogene (UK) for sequencing on the Illumina NovaSeq 6000 platform (RRID:SCR_016387).
Pre-processing and genome size estimation
The PacBio subreads were converted from BAM into FASTQ format using the PacBio Secondary Analysis Tool BAM2fastx v.1.3.0
[34]. Quality control, trimming, and filtering of the Illumina reads were performed using fastp v0.23.1 (RRID:
SCR_016962)
[35] with the settings
“-g -3 -l 40 -y -Y 30 -q 15 -u 40 -c -p -j -h -R -w N”. To estimate the genome size of the snake pipefish, we performed
k-mer profiling using the standard short-read Illumina data. We first ran Jellyfish v2.3.0 (RRID:
SCR_005491)
[36] to generate a histogram of
k-mers with a length of 21 bp. Subsequently, we used this data to obtain a genome profile using GenomeScope v2.0 (RRID:
SCR_017014)
[37]. We further tested alternative
k-mer lengths between 17- and 25-mers. No significant differences in the estimated genome size were detected except for the 17-mer, which resulted in a smaller genome size estimation of ∼500 Mbp.
Genome assembly and polishing
We assembled the genome from the PacBio long-read data using WTDBG v.2.5 (RRID:
SCR_017225)
[38]. The resulting assembly was first polished using the PacBio data with Flye v.2.9 (RRID:
SCR_017016)
[39], using Minimap v.2.17
[40] for mapping. Afterwards, we conducted two rounds of short-read polishing by mapping reads onto the assembly with BWA-MEM v.0.7.17 (RRID:
SCR_010910)
[41], followed by error correction with Pilon v1.23 (RRID:
SCR_014731)
[42].
Assembly quality control and scaffolding
The polished assembly contigs were anchored into chromosome-scale scaffolds utilizing the generated proximity-ligation Omni-C data. First, the data were mapped and filtered to the assembly following the Arima Hi-C mapping pipeline used by the Vertebrate Genome Project
[43]. In brief, reads were mapped using BWA-MEM v.0.7.17
[41], the mapped reads were filtered with samtools v.1.14 (RRID:
SCR_002105)
[44], and the duplicated reads were removed with “MarkDuplicates” in Picard v.2.26.10 (RRID:
SCR_006525)
[45]. The filtered mapped reads were then used for proximity-ligation scaffolding in YaHs v.1.1
[46]. Gaps in the scaffolded assembly were closed with TGS-GapCloser v.1.1.1 (RRID:
SCR_017633)
[47] using a subset (25%) of the PacBio subreads due to computational constraints. To further improve the assembly’s contiguity, scaffolding and gap-closing were performed a second time using a different subset of PacBio reads for gap-closing. The PacBio read subsets were generated with Seqtk v.1.3 (RRID:
SCR_018927)
[48] using the random number generator seeds 11 and 18. Gene set completeness was analyzed with BUSCO v.5.4.7
[49] using the Actinopterygii set of core genes (
actinopterygii_odb10). Assembly continuity was evaluated using QUAST v5.0.2 (RRID:
SCR_001228)
[50], and mapping rates were assessed by QualiMap v2.2.1 (RRID:
SCR_001209)
[51]. Finally, BlobToolsKit v.4.0.6
[52] performed contamination screening.
Repeat landscape analysis and genome annotation
The TE annotation was done in three steps. First, we used RepeatMasker v4.1.5
[53] to annotate and hard-mask known Actinopterygii repeats from Repbase (RRID:
SCR_021169), which comprises a database of eukaryotic repetitive DNA element sequences
[54]. Secondly, a
de novo library of TE was created from the hard-masked genome assembly using RepeatModeler v2.0.4
[55], which includes RECON v1.08 (RRID:
SCR_021170)
[56], RepeatScout v1.0.6 (RRID:
SCR_014653)
[57], as well as LTRharvest and LTR_retriever (RRID:
SCR_018970 and RRID:
SCR_017623, respectively)
[58,59]. Finally, predicted repeats were annotated with a second run of RepeatMasker on the hard-masked assembly obtained in the first run. The results of both RepeatMasker runs were then combined. A summary of TEs and the relative abundance of repeat classes in the genome are shown in Table
2 and Figure
2.
Table 2
Repeat content of the genome assembly. Class: class of the repetitive regions. Count: number of occurrences of the repetitive region. bpMasked: number of base pairs masked; %Masked: percentage of base pairs masked. LINE: Long Interspersed Nuclear Elements (include retroposons); LTR: Long Terminal Repeat elements (including retroposons); SINE: Short Interspersed Nuclear Elements; RC: Rolling Circle.
| Class | Count | bpMasked | %masked |
|---|
| ARTEFACT | 4 | 84 | 0.00% |
| DNA | 2,765,297 | 372,407,739 | 22.40% |
| LINE | 850,222 | 167,337,419 | 10.06% |
| LTR | 177,214 | 55,439,687 | 3.33% |
| PLE | 1 | 0 | 0.00% |
| RC | 32,348 | 3,385,084 | 0.20% |
| SINE | 435,464 | 32,709,572 | 1.95% |
| Unknown | 3,628,328 | 534,216,084 | 32.14% |
| Low complexity | 127,733 | 3,095,322 | 0.19% |
| Satellite | 21,221 | 7,145,469 | 0.43% |
| Simple repeat | 1,437,090 | 61,077,339 | 3.67% |
| rRNA | 4,394 | 534,599 | 0.03% |
| scRNA | 5 | 504 | 0.00% |
| snRNA | 695 | 46,845 | 0.00% |
| tRNA | 6,029 | 533,812 | 0.03% |
| Total | 9,486,045 | 1,237,929,559 | 74.93% |
The genome was annotated using the BRAKER3 pipeline (RRID:
SCR_018964)
[60–65], combining a
de novo gene calling and a homology-based gene annotation. For protein references, we combined the vertebrate-specific protein collection from OrthoDB (RRID:
SCR_011980) and the protein collection of the greater pipefish (
Syngnathus acus) genome
[30] made by the NCBI (see:
GCF_901709675.1, last accessed 12th October 2023). To further filter genes based on the support of introns and using extrinsic homology evidence, we used TSEBRA
[66] with an “intron_support=0.1”. The resulting set of proteins was tested for completeness using BUSCO v.5.4.7
[49] in “protein mode” and run against the Actinopterygii-specific set of core genes. Finally, functional annotation was done using InterProScan v5 (RRID:
SCR_005829)
[67].
Variant calling and demographic inference
The preprocessed short reads were mapped to the final assembly using BWA-MEM v.0.7.17
[41], followed by the removal of duplicate reads with “MarkDuplicates” in Picard v.2.26.10
[45] and the evaluation of the mapping quality using Qualimap v2.2.1
[51]. Indels in the BAM files were first identified and then realigned with “RealignerTargetCreator” and “IndelRealigner” as part of the Genome Analysis Toolkit v3.8-1
[68]. Subsequently, samtools v.1.14
[44] was used to check and remove unmapped, secondary, QC-failed, duplicated, and supplementary reads, keeping only reads mapped in proper pairs in non-repetitive regions of the 28 chromosome-scale scaffolds.
Sambamba v 1.0.0 (RRID:
SCR_024328)
[69] was used to estimate site depth statistics. Minimum and maximum thresholds for the global site depth were set to d ± (5 × MAD), where d is the global site depth distribution median and MAD is the median absolute deviation. Variant calling was performed using the bcftools v1.17 (RRID:
SCR_005227)
[70] commands “mpileup” and “call” [-m]. Variants were then filtered with bcftools “filter” [-e “DP < d − (5 × MAD) ∥ DP > d + (5 × MAD) ∥ QUAL < 30”], thus removing sites with low quality and out of range depth. Finally, bcftools was used to estimate the genome-wide heterozygosity as the proportion of heterozygous sites using the “stats” command.
Long-term changes in the effective population size (
Ne) over time were estimated using the PSMC model
[17]. This analysis used the diploid consensus genome sequences generated by bcftools v1.17
[70] with the script “
vcfutils.pl” from the processed BAM files, as described above. Sites with read-depth up to a third of the average depth or above twice each sample’s median depth and with a consensus base quality < 30 were removed. PSMC was executed using 25 iterations, employing a maximum 2
N0-scaled coalescent time of 15, an initial θ∕ρ ratio of 5, and 64 atomic time intervals (4 + 25 × 2 + 4 + 6) to infer the scaled mutation rate, the recombination rate, and the free population size parameters, respectively. We performed 100 bootstrap replicates by randomly sampling with replacement 1 Mb blocks from the consensus sequence for all individuals. A mutation rate 𝜇 of 1.7 × 10
−9 per site per generation
[71] and a generation length of 2.5 years
[72] were employed for plotting.