Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
NotificationsYou must be signed in to change notification settings

LiaOb21/arabidopsis_pangenome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

DOI

The reference-freeArabidopsis thaliana pangenome

This repository collects all the commands and script used for the reference-freeArabidopsis thaliana pangenome. This work is part of the PhD project carried out by Lia Obinu.

Table of contents

Building the reference-freeArabidopsis thaliana pangenome graph

Get the assemblies

We found in total 93 suitable assemblies, which are at chromosome level or complete genome level. They are coming from the following BioProjects: PRJEB55353, PRJEB55632, PRJEB50694, PRJEB30763, PRJEB31147, PRJEB37252, PRJEB37257, PRJEB37258, PRJEB37260, PRJEB37261, PRJEB40125, PRJEB51511, PRJNA777107, PRJNA779205, PRJNA828135, PRJNA834751, PRJNA10719, PRJNA915353, PRJNA311266, PRJCA005809, PRJCA007112.

Quality check and trimming of the assemblies

We should exclude short contigs from the pangenome, but we need to set a threshold, and for this we must quality inspect the assemblies and check the length distribution. We can do this using QUAST:

quast accessions/*.fna.gz accessions/*.fasta.gz -o ./output_quast -r accessions/GCA_000001735.2_TAIR10.1_genomic.fna.gz -t 40 --eukaryote --large -k --plots-format png

Based on QUAST results, we can decided to trim the contigs that are smaller than 5kbp. The assemblies that will suffer a loss will be the following according to QUAST:

Assembly# contigs (>= 0 bp)# contigs (>= 5000 bp)Total length (>= 0 bp)Total length (>= 5000 bp)
GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic10978119626746119530365
GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic11183120129838120059751
GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic10271119749512119657772
GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic10575120338059120246237
GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic9474120289901120224609
GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic184155122202079122112347
GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic142115120795191120728700

For the filtering we can usebioawk.

bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.fna > GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.trim.5k.fna#check:grep '>' GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.fna | wc -l---109grep '>' GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.trim.5k.fna | wc -l---78
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.fna > GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.trim.5k.fna#check:grep '>' GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.fna | wc -l---111grep '>' GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.trim.5k.fna | wc -l---83
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.fna > GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.trim.5k.fna#check:grep '>' GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.fna | wc -l---102grep '>' GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.trim.5k.fna | wc -l---71
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.fna > GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.trim.5k.fna#check:grep '>' GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.fna | wc -l---105grep '>' GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.trim.5k.fna | wc -l---75
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.fna > GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.trim.5k.fna #check:grep '>' GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.fna | wc -l---94grep '>' GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.trim.5k.fna | wc -l---74
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.fna > GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.trim.5k.fna#check:grep '>' GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.fna | wc -l---184grep '>' GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.trim.5k.fna | wc -l---155
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.fna > GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.trim.5k.fna#check:grep '>' GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.fna | wc -l---142grep '>' GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.trim.5k.fna | wc -l---115

Now we should zip the .trim.5k.fna files:

for i in *trim.5k.fna ; do (echo "gzip $i"); done | bash

From now on, we are going to used the trimmed version of these assemblies.

Renaming the assemblies

We can follow theS. cerevisiae example:

ls *.gz | cut -f 1 -d '.' | uniq | while read f; do    echo $f    zcat $f.* > $f.fadone

To change the sequence names according toPanSN-spec, we can usefastix:

ls *.fa | while read f; do    sample_name=$(echo $f | cut -f 1 -d '.');    echo ${sample_name}    fastix -p "${sample_name}#1#" $f >> Arabidopsis.pg.in.fastadone

With"${sample_name}#1#" we specify haplotype_id equals to 1 for all the assemblies, as they are all haploid.The names of the scaffolds are now following the pattern:

[sample_name][delim][haplotype_id][delim][contig_or_scaffold_name]

Format fasta headers and remove organelles from assemblies

We should remove all what comes after\t from the fasta headers as this can led to errors in subsequent steps:

cat Arabidopsis.pg.in.fasta | sed -E '/^>/s/( +|\t).*//' > Arabidopsis.adj.pg.in.fasta

We then should remove organelle scaffolds from our input file. Only three assemblies have Mt e Plt, and the corresponding scaffolds are:

>GCA_000001735#1#BK010421.1>GCA_000001735#1#AP000423.1>GCA_023115395#1#CP096029.1>GCA_023115395#1#CP096030.1>GCA_904420315#1#LR881472.1>GCA_904420315#1#LR881471.1

We need to put these headers in a file that we will callMt_and_Plt_headers.txt.

We can now exclude these scaffolds from our previous multifasta input file:

awk '(NR==FNR) { toRemove[$1]; next }     /^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }    p' Mt_and_Plt_headers.txt Arabidopsis.adj.pg.in.fasta > Arabidopsis.pg.input.fasta

We need to compress our input and to index it:

bgzip -@ 16 Arabidopsis.pg.input.fastasamtools faidx Arabidopsis.pg.input.fasta.gz

Sequence Partitioning using partition-before-pggb

We can now runpartition-before-pggb on our multifasta (note: if it is not possible to estimate the divergence, you should try different settings to individuate the right value to set for-p):

partition-before-pggb -i Arabidopsis.pg.input.fasta.gz -o output_pbp_p90 -n 93 -t 100 -Y '#' -p 90 -s 10k -V 'GCA_000001735:#:1000' -m -D ./temp
  • -i input
  • -o output directory
  • -n number of haplotypes
  • -p percent identity for mapping/alignment
  • -s segment length for mapping [default: 5000]
  • -V to produce vcf against ref
  • -m generate MultiQC report of graphs' statistics and visualizations,automatically runs odgi stats
  • -D temporary files directory

Running pggb on each community

We can now run pggb on each community! :-)

To build the graph of each community we will usepggb -p 95.

community 0 - chr1

pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.0.fa \     -o output_pbp_p90/community.0.out \     -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \     -k 23 -f 0 -B 10000000 \     -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \     -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100

community 1 - chr2

pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.1.fa \     -o output_pbp_p90/community.1.out \     -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \     -k 23 -f 0 -B 10000000 \     -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \     -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100

community 2 - chr3

pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.2.fa \     -o output_pbp_p90/community.2.out \     -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \     -k 23 -f 0 -B 10000000 \     -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \     -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100

community 3 - chr4

pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.3.fa \     -o output_pbp_p90/community.3.out \     -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \     -k 23 -f 0 -B 10000000 \     -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \     -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100

community 4 - chr5

pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.4.fa \     -o output_pbp_p90/community.4.out \     -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \     -k 23 -f 0 -B 10000000 \     -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \     -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100

Pangenome annotation - reference based

Get odgi untangle for annotation

To useodgi untangle for annotating the pangenome graph, we will switch tountangle_for_annotation branch.

cd odgigit checkout untangle_for_annotation && git pull && git submodule update --init --recursivecmake -H. -Bbuild && cmake --build build -- -j 16

Prepare the annotation files for injection

Genes

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/735/GCA_000001735.2_TAIR10.1/GCA_000001735.2_TAIR10.1_genomic.gff.gzgunzip GCA_000001735.2_TAIR10.1_genomic.gff.gz

Now, we will inject only the genes, so we need to extract only gene rows from the gff of TAIR10 and split them based on chromosomes.

Let's have a look to the gff file:

head GCA_000001735.2_TAIR10.1_genomic.gff---##gff-version 3#!gff-spec-version 1.21#!processor NCBI annotwriter#!genome-build TAIR10.1#!genome-build-accession NCBI_Assembly:GCA_000001735.2##sequence-region CP002684.1 1 30427671##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3702CP002684.1      Genbank region  1       30427671        .       +       .       ID=CP002684.1:1..30427671;Dbxref=taxon:3702;Name=1;chromosome=1;ecotype=Columbia;gbkey=Src;mol_type=genomic DNACP002684.1      Genbank gene    3631    5899    .       +       .       ID=gene-AT1G01010;Dbxref=Araport:AT1G01010,TAIR:AT1G01010;Name=NAC001;gbkey=Gene;gene=NAC001;gene_biotype=protein_coding;gene_synonym=ANAC001,NAC domain containing protein 1,T25K16.1,T25K16_1;locus_tag=AT1G01010CP002684.1      Genbank mRNA    3631    5899    .       +       .       ID=rna-gnl|JCVI|mRNA.AT1G01010.1;Parent=gene-AT1G01010;Dbxref=Araport:AT1G01010,TAIR:AT1G01010;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;locus_tag=AT1G01010;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1

community 0 - chr1

In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:

awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002684.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr1.genes.adj.TAIR10.bedhead chr1.genes.adj.TAIR10.bed | column -t---GCA_000001735#1#CP002684.1  3631   5899   AT1G01010:3631-5899GCA_000001735#1#CP002684.1  6788   9130   AT1G01020:6788-9130GCA_000001735#1#CP002684.1  11101  11372  AT1G03987:11101-11372GCA_000001735#1#CP002684.1  11649  13714  AT1G01030:11649-13714GCA_000001735#1#CP002684.1  23121  31227  AT1G01040:23121-31227GCA_000001735#1#CP002684.1  23312  24099  AT1G03993:23312-24099GCA_000001735#1#CP002684.1  28500  28706  AT1G01046:28500-28706GCA_000001735#1#CP002684.1  31170  33171  AT1G01050:31170-33171GCA_000001735#1#CP002684.1  32727  33009  AT1G03997:32727-33009GCA_000001735#1#CP002684.1  33365  37871  AT1G01060:33365-37871

Make sure that community 0 is chromosome 1 of TAIR10:

odgi paths -i Arabidopsis.pg.input.community.0.og -L | grep 'GCA_000001735'---GCA_000001735#1#CP002684.1

community 1 - chr2

In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:

awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002685.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr2.genes.adj.TAIR10.bedhead chr2.genes.adj.TAIR10.bed | column -t---GCA_000001735#1#CP002685.1  1025   3173   AT2G01008:1025-3173GCA_000001735#1#CP002685.1  2805   3176   AT2G03855:2805-3176GCA_000001735#1#CP002685.1  3706   5513   AT2G01010:3706-5513GCA_000001735#1#CP002685.1  5782   5945   AT2G01020:5782-5945GCA_000001735#1#CP002685.1  6571   6672   AT2G01021:6571-6672GCA_000001735#1#CP002685.1  7090   7505   AT2G03865:7090-7505GCA_000001735#1#CP002685.1  7669   9399   AT2G03875:7669-9399GCA_000001735#1#CP002685.1  9648   9767   AT2G01023:9648-9767GCA_000001735#1#CP002685.1  9849   10177  AT2G00340:9849-10177GCA_000001735#1#CP002685.1  50288  51487  AT2G01035:50288-51487

Make sure that community 1 is chromosome 2 of TAIR10:

odgi paths -i Arabidopsis.pg.input.community.1.og -L | grep 'GCA_000001735'---GCA_000001735#1#CP002685.1

community 2 - chr3

In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:

awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002686.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr3.genes.adj.TAIR10.bedhead chr3.genes.adj.TAIR10.bed | column -t---GCA_000001735#1#CP002686.1  1609   4159   AT3G01015:1609-4159GCA_000001735#1#CP002686.1  4342   4818   AT3G01010:4342-4818GCA_000001735#1#CP002686.1  5104   6149   AT3G01020:5104-6149GCA_000001735#1#CP002686.1  6657   7772   AT3G01030:6657-7772GCA_000001735#1#CP002686.1  8723   12697  AT3G01040:8723-12697GCA_000001735#1#CP002686.1  13046  15906  AT3G01050:13046-15906GCA_000001735#1#CP002686.1  15934  16320  AT3G00970:15934-16320GCA_000001735#1#CP002686.1  16631  18909  AT3G01060:16631-18909GCA_000001735#1#CP002686.1  19409  20806  AT3G01070:19409-20806GCA_000001735#1#CP002686.1  25355  27712  AT3G01080:25355-27712

Make sure that community 2 is chromosome 3 of TAIR10:

odgi paths -i Arabidopsis.pg.input.community.2.og -L | grep 'GCA_000001735'---GCA_000001735#1#CP002686.1

community 3 - chr4

In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:

awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002687.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr4.genes.adj.TAIR10.bedhead chr4.genes.adj.TAIR10.bed | column -t---GCA_000001735#1#CP002687.1  1180   1536   AT4G00005:1180-1536GCA_000001735#1#CP002687.1  2895   10504  AT4G00020:2895-10504GCA_000001735#1#CP002687.1  10815  13359  AT4G00026:10815-13359GCA_000001735#1#CP002687.1  13527  14413  AT4G00030:13527-14413GCA_000001735#1#CP002687.1  14627  16079  AT4G00040:14627-16079GCA_000001735#1#CP002687.1  17639  20183  AT4G00050:17639-20183GCA_000001735#1#CP002687.1  21351  29064  AT4G00060:21351-29064GCA_000001735#1#CP002687.1  29072  31426  AT4G00070:29072-31426GCA_000001735#1#CP002687.1  32748  33756  AT4G00080:32748-33756GCA_000001735#1#CP002687.1  33800  33872  AT4G00085:33800-33872

Make sure that community 3 is chromosome 4 of TAIR10:

odgi paths -i Arabidopsis.pg.input.community.3.og -L | grep 'GCA_000001735'---GCA_000001735#1#CP002687.1

community 4 - chr5

In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:

awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002688.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr5.genes.adj.TAIR10.bedhead chr5.genes.adj.TAIR10.bed | column -t---GCA_000001735#1#CP002688.1  2      303    AT5G00730:2-303GCA_000001735#1#CP002688.1  995    5156   AT5G01010:995-5156GCA_000001735#1#CP002688.1  5256   5907   AT5G01015:5256-5907GCA_000001735#1#CP002688.1  5339   5593   AT5G01017:5339-5593GCA_000001735#1#CP002688.1  5917   8467   AT5G01020:5917-8467GCA_000001735#1#CP002688.1  9780   13235  AT5G01030:9780-13235GCA_000001735#1#CP002688.1  13128  16236  AT5G01040:13128-16236GCA_000001735#1#CP002688.1  18086  20887  AT5G01050:18086-20887GCA_000001735#1#CP002688.1  22684  24934  AT5G01060:22684-24934GCA_000001735#1#CP002688.1  25012  25908  AT5G01070:25012-25908

Make sure that community 4 is chromosome 5 of TAIR10:

odgi paths -i Arabidopsis.pg.input.community.4.og -L | grep 'GCA_000001735'---GCA_000001735#1#CP002688.1

Pseudogenes

The original file fromMascagni et al., 2021 was modified as explained in scripts/bed_to_gff3.ipynb. The coordinates of pseudogenes annotated in the reverse strain were then converted in forward strain to allow compatibility with the downstream analysis. This is how the final annotations file looks like:

head pseudogenes.txt---ChromosomeendstartcoverageGenewise_scorepositionsizesizeStrandTypePaterSpeciesStop_codonFrameshiftChr1614462690.0956.39UTR129plusFRAGAT1G01010.1A.thaliana00Chr147087473250.0674.09Intergenic206plusFRAGAT1G02730.1A.thaliana13Chr189932901530.0224.75Intergenic51plusFRAGAT5G41740.2A.thaliana13Chr12590112591510.0648.95Intergenic84plusFRAGAT1G01690.1A.thaliana00Chr12665622674950.9968.96UTR979plusDUPAT4G00525.1A.thaliana00Chr14260184261760.0617.28Intergenic36plusFRAGAT3G04430.1A.thaliana11Chr15783055784840.0447.13Intron_cds102plusFRAGAT1G05120.1A.thaliana00Chr15824015825850.4652.69Intergenic236plusSEAT4G02160.1A.thaliana01Chr15826345829710.57117.33Intergenic247plusSEAT5G61710.1A.thaliana01

community 0 - chr1

awk '$1 == "Chr1"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr1/GCA_000001735#1#CP002684.1/g' > Pseudogenes_TAIR10_chr1.bedhead Pseudogenes_TAIR10_chr1.bed ---GCA_000001735#1#CP002684.161446269GCA_000001735#1#CP002684.1:6144-6269GCA_000001735#1#CP002684.14708747325GCA_000001735#1#CP002684.1:47087-47325GCA_000001735#1#CP002684.18993290153GCA_000001735#1#CP002684.1:89932-90153GCA_000001735#1#CP002684.1259011259151GCA_000001735#1#CP002684.1:259011-259151GCA_000001735#1#CP002684.1266562267495GCA_000001735#1#CP002684.1:266562-267495GCA_000001735#1#CP002684.1426018426176GCA_000001735#1#CP002684.1:426018-426176GCA_000001735#1#CP002684.1578305578484GCA_000001735#1#CP002684.1:578305-578484GCA_000001735#1#CP002684.1582401582585GCA_000001735#1#CP002684.1:582401-582585GCA_000001735#1#CP002684.1582634582971GCA_000001735#1#CP002684.1:582634-582971GCA_000001735#1#CP002684.1893047893493GCA_000001735#1#CP002684.1:893047-893493

community 1 - chr2

awk '$1 == "Chr2"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr2/GCA_000001735#1#CP002685.1/g' > Pseudogenes_TAIR10_chr2.bedhead Pseudogenes_TAIR10_chr2.bed ---GCA_000001735#1#CP002685.126292769GCA_000001735#1#CP002685.1:2629-2769GCA_000001735#1#CP002685.129393079GCA_000001735#1#CP002685.1:2939-3079GCA_000001735#1#CP002685.1126312126472GCA_000001735#1#CP002685.1:126312-126472GCA_000001735#1#CP002685.1167479167678GCA_000001735#1#CP002685.1:167479-167678GCA_000001735#1#CP002685.1184200184421GCA_000001735#1#CP002685.1:184200-184421GCA_000001735#1#CP002685.1214567214668GCA_000001735#1#CP002685.1:214567-214668GCA_000001735#1#CP002685.1249464249634GCA_000001735#1#CP002685.1:249464-249634GCA_000001735#1#CP002685.1287831287959GCA_000001735#1#CP002685.1:287831-287959GCA_000001735#1#CP002685.1338842339105GCA_000001735#1#CP002685.1:338842-339105GCA_000001735#1#CP002685.1343391343537GCA_000001735#1#CP002685.1:343391-343537

community 2 - chr3

awk '$1 == "Chr3"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr3/GCA_000001735#1#CP002686.1/g' > Pseudogenes_TAIR10_chr3.bedhead Pseudogenes_TAIR10_chr3.bed ---GCA_000001735#1#CP002686.110091200GCA_000001735#1#CP002686.1:1009-1200GCA_000001735#1#CP002686.1430020430130GCA_000001735#1#CP002686.1:430020-430130GCA_000001735#1#CP002686.1484416484737GCA_000001735#1#CP002686.1:484416-484737GCA_000001735#1#CP002686.1630152630235GCA_000001735#1#CP002686.1:630152-630235GCA_000001735#1#CP002686.1792374792475GCA_000001735#1#CP002686.1:792374-792475GCA_000001735#1#CP002686.1799113799223GCA_000001735#1#CP002686.1:799113-799223GCA_000001735#1#CP002686.1832148832279GCA_000001735#1#CP002686.1:832148-832279GCA_000001735#1#CP002686.1863502863939GCA_000001735#1#CP002686.1:863502-863939GCA_000001735#1#CP002686.1864622864843GCA_000001735#1#CP002686.1:864622-864843GCA_000001735#1#CP002686.1987088990672GCA_000001735#1#CP002686.1:987088-990672

community 3 - chr4

awk '$1 == "Chr4"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr4/GCA_000001735#1#CP002687.1/g' > Pseudogenes_TAIR10_chr4.bedhead Pseudogenes_TAIR10_chr4.bed ---GCA_000001735#1#CP002687.11125311351GCA_000001735#1#CP002687.1:11253-11351GCA_000001735#1#CP002687.16213662339GCA_000001735#1#CP002687.1:62136-62339GCA_000001735#1#CP002687.17024370416GCA_000001735#1#CP002687.1:70243-70416GCA_000001735#1#CP002687.19905699229GCA_000001735#1#CP002687.1:99056-99229GCA_000001735#1#CP002687.19935599464GCA_000001735#1#CP002687.1:99355-99464GCA_000001735#1#CP002687.1102138102411GCA_000001735#1#CP002687.1:102138-102411GCA_000001735#1#CP002687.1102614103287GCA_000001735#1#CP002687.1:102614-103287GCA_000001735#1#CP002687.1103362103703GCA_000001735#1#CP002687.1:103362-103703GCA_000001735#1#CP002687.1105289105375GCA_000001735#1#CP002687.1:105289-105375GCA_000001735#1#CP002687.1121936122058GCA_000001735#1#CP002687.1:121936-122058

community 4 - chr5

awk '$1 == "Chr5"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr5/GCA_000001735#1#CP002688.1/g' > Pseudogenes_TAIR10_chr5.bedhead Pseudogenes_TAIR10_chr5.bed ---GCA_000001735#1#CP002688.11795218035GCA_000001735#1#CP002688.1:17952-18035GCA_000001735#1#CP002688.11803818085GCA_000001735#1#CP002688.1:18038-18085GCA_000001735#1#CP002688.1453356453511GCA_000001735#1#CP002688.1:453356-453511GCA_000001735#1#CP002688.1453793454342GCA_000001735#1#CP002688.1:453793-454342GCA_000001735#1#CP002688.1498633498719GCA_000001735#1#CP002688.1:498633-498719GCA_000001735#1#CP002688.1595390595590GCA_000001735#1#CP002688.1:595390-595590GCA_000001735#1#CP002688.1597848597991GCA_000001735#1#CP002688.1:597848-597991GCA_000001735#1#CP002688.1660001660384GCA_000001735#1#CP002688.1:660001-660384GCA_000001735#1#CP002688.1672914673057GCA_000001735#1#CP002688.1:672914-673057GCA_000001735#1#CP002688.1679891680126GCA_000001735#1#CP002688.1:679891-680126

Inject genes from the reference into the pangenome graph

Genes

community 0 - chr1

odgi inject -i Arabidopsis.pg.input.community.0.og -b chr1.genes.adj.TAIR10.bed -o Arabidopsis.pg.community.0.inject.adj.og -t 100 -P

Extract gene names from injected pangenome:

odgi paths -i Arabidopsis.pg.community.0.inject.adj.og -L | grep 'AT1G' > chr1.genenames.txt# security checkwc -l chr1.genenames.txt ---8771 chr1.genenames.txtwc -l chr1.genes.adj.TAIR10.bed---8771 chr1.genes.adj.TAIR10.bed

community 1 - chr2

odgi inject -i Arabidopsis.pg.input.community.1.og -b chr2.genes.adj.TAIR10.bed -o Arabidopsis.pg.community.1.inject.adj.og -t 50 -P

Extract gene names from injected pangenome:

odgi paths -i Arabidopsis.pg.community.1.inject.adj.og -L | grep 'AT2G' > chr2.genenames.txt# security checkwc -l chr2.genenames.txt ---5265 chr2.genenames.txtwc -l chr2.genes.adj.TAIR10.bed---5265 chr2.genes.adj.TAIR10.bed

community 2 - chr3

odgi inject -i Arabidopsis.pg.input.community.2.og -b chr3.genes.adj.TAIR10.bed -o Arabidopsis.pg.community.2.inject.adj.og -t 50 -P

Extract gene names from injected pangenome:

odgi paths -i Arabidopsis.pg.community.2.inject.adj.og -L | grep 'AT3G' > chr3.genenames.txt# security checkwc -l chr3.genenames.txt ---6544 chr3.genenames.txtwc -l chr3.genes.adj.TAIR10.bed---6544 chr3.genes.adj.TAIR10.bed

community 3 - chr4

odgi inject -i Arabidopsis.pg.input.community.3.og -b chr4.genes.adj.TAIR10.bed -o Arabidopsis.pg.community.3.inject.adj.og -t 50 -P

Extract gene names from injected pangenome:

odgi paths -i Arabidopsis.pg.community.3.inject.adj.og -L | grep 'AT4G' > chr4.genenames.txt# security checkwc -l chr4.genenames.txt ---5007 chr4.genenames.txtwc -l chr4.genes.adj.TAIR10.bed---5007 chr4.genes.adj.TAIR10.bed

community 4 - chr5

odgi inject -i Arabidopsis.pg.input.community.4.og -b chr5.genes.adj.TAIR10.bed -o Arabidopsis.pg.community.4.inject.adj.og -t 50 -P

Extract gene names from injected pangenome:

odgi paths -i Arabidopsis.pg.community.4.inject.adj.og -L | grep 'AT5G' > chr5.genenames.txt# security checkwc -l chr5.genenames.txt ---7469 chr5.genenames.txtwc -l chr5.genes.adj.TAIR10.bed---7469 chr5.genes.adj.TAIR10.bed

Pseudogenes

community 0 - chr1

odgi inject -i ../Arabidopsis.pg.input.community.0.og -b Pseudogenes_TAIR10_chr1.bed -o Arabidopsis.pg.community.0.inject.pseudogenes.og -t 50 -P

Extract gene names from injected pangenome:

odgi paths -i Arabidopsis.pg.community.0.inject.pseudogenes.og -L | grep 'GCA_000001735#1#CP002684.1:' > chr1.pseudogenenames.txt# security checkwc -l chr1.pseudogenenames.txt ---1083 chr1.pseudogenenames.txtwc -l Pseudogenes_TAIR10_chr1.bed---1083 Pseudogenes_TAIR10_chr1.bed

community 1 - chr2

odgi inject -i ../Arabidopsis.pg.input.community.1.og -b Pseudogenes_TAIR10_chr2.bed -o Arabidopsis.pg.community.1.inject.pseudogenes.og -t 50 -P

Extract gene names from injected pangenome:

odgi paths -i Arabidopsis.pg.community.1.inject.pseudogenes.og -L | grep 'GCA_000001735#1#CP002685.1:' > chr2.pseudogenenames.txt# security checkwc -l chr2.pseudogenenames.txt ---904 chr2.pseudogenenames.txtwc -l Pseudogenes_TAIR10_chr2.bed---904 Pseudogenes_TAIR10_chr2.bed

community 2 - chr3

odgi inject -i ../Arabidopsis.pg.input.community.2.og -b Pseudogenes_TAIR10_chr3.bed -o Arabidopsis.pg.community.2.inject.pseudogenes.og -t 50 -P

Extract gene names from injected pangenome:

odgi paths -i Arabidopsis.pg.community.2.inject.pseudogenes.og -L | grep 'GCA_000001735#1#CP002686.1:' > chr3.pseudogenenames.txt# security checkwc -l chr3.pseudogenenames.txt ---965 chr3.pseudogenenames.txtwc -l Pseudogenes_TAIR10_chr3.bed---965 Pseudogenes_TAIR10_chr3.bed

community 3 - chr4

odgi inject -i ../Arabidopsis.pg.input.community.3.og -b Pseudogenes_TAIR10_chr4.bed -o Arabidopsis.pg.community.3.inject.pseudogenes.og -t 50 -P

Extract gene names from injected pangenome:

odgi paths -i Arabidopsis.pg.community.3.inject.pseudogenes.og -L | grep 'GCA_000001735#1#CP002687.1:' > chr4.pseudogenenames.txt# security checkwc -l chr4.pseudogenenames.txt ---804 chr4.pseudogenenames.txtwc -l Pseudogenes_TAIR10_chr4.bed---804 Pseudogenes_TAIR10_chr4.bed

community 4 - chr5

odgi inject -i ../Arabidopsis.pg.input.community.4.og -b Pseudogenes_TAIR10_chr5.bed -o Arabidopsis.pg.community.4.inject.pseudogenes.og -t 50 -P

Extract gene names from injected pangenome:

odgi paths -i Arabidopsis.pg.community.4.inject.pseudogenes.og -L | grep 'GCA_000001735#1#CP002688.1:' > chr5.pseudogenenames.txt# security checkwc -l chr5.pseudogenenames.txt ---919 chr5.pseudogenenames.txtwc -l Pseudogenes_TAIR10_chr5.bed---919 Pseudogenes_TAIR10_chr5.bed

Untangle genes across all the assemblies of the pangenome

Genes

community 0 - chr1

odgi stepindex -i Arabidopsis.pg.community.0.inject.adj.og -a 0 -o Arabidopsis.pg.community.0.inject.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.0.inject.adj.og -a Arabidopsis.pg.community.0.inject.index -R chr1.genenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.0.inject.paf

community 1 - chr2

odgi stepindex -i Arabidopsis.pg.community.1.inject.adj.og -a 0 -o Arabidopsis.pg.community.1.inject.index -t 50 -Podgi untangle -i Arabidopsis.pg.community.1.inject.adj.og -a Arabidopsis.pg.community.1.inject.index -R chr2.genenames.txt -j 0 -t 50 -P -p > Arabidopsis.pg.community.1.inject.paf

community 2 - chr3

odgi stepindex -i Arabidopsis.pg.community.2.inject.adj.og -a 0 -o Arabidopsis.pg.community.2.inject.index -t 50 -Podgi untangle -i Arabidopsis.pg.community.2.inject.adj.og -a Arabidopsis.pg.community.2.inject.index -R chr3.genenames.txt -j 0 -t 50 -P -p > Arabidopsis.pg.community.2.inject.paf

community 3 - chr4

odgi stepindex -i Arabidopsis.pg.community.3.inject.adj.og -a 0 -o Arabidopsis.pg.community.3.inject.index -t 50 -Podgi untangle -i Arabidopsis.pg.community.3.inject.adj.og -a Arabidopsis.pg.community.3.inject.index -R chr4.genenames.txt -j 0 -t 50 -P -p > Arabidopsis.pg.community.3.inject.paf

community 4 - chr5

odgi stepindex -i Arabidopsis.pg.community.4.inject.adj.og -a 0 -o Arabidopsis.pg.community.4.inject.index -t 50 -Podgi untangle -i Arabidopsis.pg.community.4.inject.adj.og -a Arabidopsis.pg.community.4.inject.index -R chr5.genenames.txt -j 0 -t 50 -P -p > Arabidopsis.pg.community.4.inject.paf

Pseudogenes

community 0 - chr1

odgi stepindex -i Arabidopsis.pg.community.0.inject.pseudogenes.og -a 0 -o Arabidopsis.pg.community.0.inject.pseudogenes.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.0.inject.pseudogenes.og -a Arabidopsis.pg.community.0.inject.pseudogenes.index -R chr1.pseudogenenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.0.inject.paf

community 1 - chr2

odgi stepindex -i Arabidopsis.pg.community.1.inject.pseudogenes.og -a 0 -o Arabidopsis.pg.community.1.inject.pseudogenes.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.1.inject.pseudogenes.og -a Arabidopsis.pg.community.1.inject.pseudogenes.index -R chr2.pseudogenenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.1.inject.paf

community 2 - chr3

odgi stepindex -i Arabidopsis.pg.community.2.inject.pseudogenes.og -a 0 -o Arabidopsis.pg.community.2.inject.pseudogenes.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.2.inject.pseudogenes.og -a Arabidopsis.pg.community.2.inject.pseudogenes.index -R chr3.pseudogenenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.2.inject.paf

community 3 - chr4

odgi stepindex -i Arabidopsis.pg.community.3.inject.pseudogenes.og -a 0 -o Arabidopsis.pg.community.3.inject.pseudogenes.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.3.inject.pseudogenes.og -a Arabidopsis.pg.community.3.inject.pseudogenes.index -R chr4.pseudogenenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.3.inject.paf

community 4 - chr5

odgi stepindex -i Arabidopsis.pg.community.4.inject.pseudogenes.og -a 0 -o Arabidopsis.pg.community.4.inject.pseudogenes.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.4.inject.pseudogenes.og -a Arabidopsis.pg.community.4.inject.pseudogenes.index -R chr5.pseudogenenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.4.inject.paf

Untangle PAF processing

collapse and merge

community 0 - chr1

# genes~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.0.inject.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# pseudogenes# pseudogenes paf needs one more step before of the processing, due to the way we named pseudogenes. We have to exclude the alignments of pseudogenes against eachother. This is done automatically for genes, because their pattern doesn't match with the reference namegrep -v '^GCA_000001735#1#CP002684.1:' Arabidopsis.pg.community.0.inject.paf > Arabidopsis.pg.community.0.inject.filtered.paf~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.0.inject.filtered.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20

community 1 - chr2

# genes~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.1.inject.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# pseudogenes# pseudogenes paf needs one more step before of the processing, due to the way we named pseudogenes. We have to exclude the alignments of pseudogenes against eachother. This is done automatically for genes, because their pattern doesn't match with the reference namegrep -v '^GCA_000001735#1#CP002685.1:' Arabidopsis.pg.community.1.inject.paf > Arabidopsis.pg.community.1.inject.filtered.paf~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.1.inject.filtered.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20

community 2 - chr3

# genes ~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.2.inject.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# pseudogenes# pseudogenes paf needs one more step before of the processing, due to the way we named pseudogenes. We have to exclude the alignments of pseudogenes against eachother. This is done automatically for genes, because their pattern doesn't match with the reference namegrep -v '^GCA_000001735#1#CP002686.1:' Arabidopsis.pg.community.2.inject.paf > Arabidopsis.pg.community.2.inject.filtered.paf~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.2.inject.filtered.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20

community 3 - chr4

# genes~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.3.inject.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# pseudogenes# pseudogenes paf needs one more step before of the processing, due to the way we named pseudogenes. We have to exclude the alignments of pseudogenes against eachother. This is done automatically for genes, because their pattern doesn't match with the reference namegrep -v '^GCA_000001735#1#CP002687.1:' Arabidopsis.pg.community.3.inject.paf > Arabidopsis.pg.community.3.inject.filtered.paf~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.3.inject.filtered.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20

community 4 - chr5

# genes~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.4.inject.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# pseudogenes# pseudogenes paf needs one more step before of the processing, due to the way we named pseudogenes. We have to exclude the alignments of pseudogenes against eachother. This is done automatically for genes, because their pattern doesn't match with the reference namegrep -v '^GCA_000001735#1#CP002688.1:' Arabidopsis.pg.community.4.inject.paf > Arabidopsis.pg.community.4.inject.filtered.paf~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.4.inject.filtered.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20

genes classification

community 0 - chr1

mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_genes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff /GCA_000001735.2_TAIR10.1_genomic.gff -ft gene

To verify correctness of matrix:

Maximum value: 49.0 found for Feature: AT1G40104 and Assembly: GCA_949796415grep 'AT1G40104' filter_passed_features.csv | grep 'GCA_949796415' | wc -l49

community 1 - chr2

mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_genes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff /GCA_000001735.2_TAIR10.1_genomic.gff -ft gene

To verify correctness of matrix:

Maximum value: 129.0 found for Feature: AT2G01020 and Assembly: GCA_946412065grep 'AT2G01020' filter_passed_features.csv | grep 'GCA_946412065' | wc -l129

communty 2 - chr3

mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_genes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff /GCA_000001735.2_TAIR10.1_genomic.gff -ft gene

To verify correctness of matrix:

Maximum value: 42.0 found for Feature: AT3G41761 and Assembly: GCA_933208065grep 'AT3G41761' filter_passed_features.csv | grep 'GCA_933208065' | wc -l42

community 3 - chr4

mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_genes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff /GCA_000001735.2_TAIR10.1_genomic.gff -ft gene

To verify correctness of matrix:

Maximum value: 6.0 found for Feature: AT4G20680 and Assembly: GCA_949796755grep 'AT4G20680' filter_passed_features.csv | grep 'GCA_949796755' | wc -l6

community 4 - chr5

mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_genes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff /GCA_000001735.2_TAIR10.1_genomic.gff -ft gene

To verify correctness of matrix:

Maximum value: 8.0 found for Feature: AT5G05050 and Assembly: GCA_949796525grep 'AT5G05050' filter_passed_features.csv | grep 'GCA_949796525' | wc -l8

pseudogenes classification

community 0 - chr1

mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_pseudogenes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff pseudogenes.gff -ft pseudogene -chr 'CP002684.1' -ref 'GCA_000001735'

community 1 - chr2

mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_pseudogenes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff pseudogenes.gff -ft pseudogene -chr 'CP002685.1' -ref 'GCA_000001735'

community 2 - chr3

mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_pseudogenes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff pseudogenes.gff -ft pseudogene -chr 'CP002686.1' -ref 'GCA_000001735'

community 3 - chr4

mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_pseudogenes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff pseudogenes.gff -ft pseudogene -chr 'CP002687.1' -ref 'GCA_000001735'

community 4 - chr5

mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_pseudogenes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff pseudogenes.gff -ft pseudogene -chr 'CP002688.1' -ref 'GCA_000001735'

Pangenome annotation - non-reference sequences

Get the FASTA file for annotation

community 0 - chr1

1. Extract non-reference sequences

  • find the reference paths using odgi paths + grep 'reference_name' and place it inreference.txt
  • extract non reference sequences using odgi paths with the flag--non-reference-ranges
mkdir extract_non_reference_pathscd extract_non_reference_paths/odgi paths -i ../Arabidopsis.pg.input.community.0.og -L | grep 'GCA_000001735' > reference.txtodgi paths -i ../Arabidopsis.pg.input.community.0.og --non-reference-ranges reference.txt > non_reference.bed

2. Filter non_reference.bed

filter_private_bed.py script is in the scripts directory.It takes the output from the previous command and separates sequences shorther than threshold from sequences longer than threshold. It also prints the distribution of the paths length to stdout and to the log file.

~/software/scripts/filter_private_bed.py -i non_reference.bed -o non_reference_filtered.bed -thr 500---size_category    count            1 16361432         2-50  3136982        51-99   168284      100-499    93759      500-999    22737    1000-4999    25706    5000-9999     5370  10000-19999     1905  20000-99999     1253100000-499999      280500000-999999        7    from_1Mbp        0

3. Apply bedtools slop and merge

Bedtools needs a genome file for slop. We should use the fasta.fai of the multifasta used for building the pangenome with pggb.I will extend regions smaller than 500 bp of 500bp in both ends.

  • filtered_out.bed is the bed file containing all the regions smaller than 500 bp
  • non_reference_filtered.bed is the bed file containing all the regions larger than 500 bp

I will enlarge only the sequences contained in filtered_out.bed

bedtools slop -i filtered_out.bed -g ../../Arabidopsis.pg.input.fasta.gz.fai -b 500 > expanded_regions.bedhead expanded_regions.bed ---GCA_001651475#1#CM004359.1      0       789GCA_001651475#1#CM004359.1      0       882GCA_001651475#1#CM004359.1      0       887GCA_001651475#1#CM004359.1      0       912GCA_001651475#1#CM004359.1      0       1013GCA_001651475#1#CM004359.1      36      1037GCA_001651475#1#CM004359.1      104     1105GCA_001651475#1#CM004359.1      752     1753GCA_001651475#1#CM004359.1      933     1934GCA_001651475#1#CM004359.1      997     1998head filtered_out.bed---GCA_001651475#1#CM004359.1      0       289GCA_001651475#1#CM004359.1      381     382GCA_001651475#1#CM004359.1      385     387GCA_001651475#1#CM004359.1      411     412GCA_001651475#1#CM004359.1      413     513GCA_001651475#1#CM004359.1      536     537GCA_001651475#1#CM004359.1      604     605GCA_001651475#1#CM004359.1      1252    1253GCA_001651475#1#CM004359.1      1433    1434GCA_001651475#1#CM004359.1      1497    1498

Right, now the smaller regions have been expanded. Now I will join the expanded regions (expanded_regions.bed) with the regions that were already longer than 500 bp (non_reference_filtered.bed):

cat expanded_regions.bed non_reference_filtered.bed | bedtools sort -i - > slop_final.bed

Now we can apply bedtools merge to exclude overlappings and join close sequences:

bedtools merge -i slop_final.bed -d 100 > non_reference_chr1_for_annotation.bed
  • -d 100 means that if there are gaps of 100bp between two sequences they are merged.

4. Extract the FASTA

Now, with this bed file we should be able to extract the fasta that we want to annotate.

bedtools getfasta -fi ../../Arabidopsis.pg.input.fasta.gz -bed non_reference_chr1_for_annotation.bed -fo non_reference_chr1_for_annotation.fasta -name
head non_reference_chr1_for_annotation.fasta --->::GCA_001651475#1#CM004359.1:0-1998GTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTTAGGTTTTAGGGTTCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAAACCCTTAAACCCTAAACCCTAAACCCTAAATAAAGCGCTGTGGGATCAATCATTGCATTTTTCCATAGGATAGATAGGGCCGACAAGATCATCAGGGAAGAAGTCAAATCACATCCGAATTCAATTGTTCTTTTCCTAAACCCTAAACCCTAAACACTAAACCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTGTAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGGGATGGTCCTTTAGCATTTATTCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAGCCAAAAGGATTGGTGGTTTGAAGACACATCATATCAAAAAAGCTATCGCCTCGACGATGCTCTATTTCTATCCTTGTAGCACACATTTTGGCACTCAAAAAAGTATTTTTAGATGCTTGTTTTGCTTCTTTGAAGTAGTTTCTCTTTGCAAAATTCCTCTTTTTTTAGAGTGATTTGGATGATTCAAGACTTCTCGGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCATTTTACCTTTGTCGTAGATATTAGGTAATCTGTAAGTCAACTCATATACAACTCATAATTTAAAAGAAAATTATGATCGACACACGTTTACACATAAAATCTGTAAATCAACTCATATACCCGTTATTCTCACAATCATATGCTTTCTAAAAGCAAAAGTATATGTCAACAATTGGTTATAAATTATTAGAAGTTTTCCACTTATGACTTAAGAACTTGTGAAGCAGAAAGTGGCAACACCCCCCACCTCCCCCCCCCCCCACCCCCCAAATTGAGAAGTCAATTTTATATAATTTAATCAAATAAATAAGTTTATGGTTAAGAGTTTTTTACTCTCTTTATTTTTCTTTTTCTTTTTGAGACATACTGAAAAAAGTTGTAATTATTAATGATAGTTCTGTGATTCCTCCATGAATCACATCTGCTTGATTTTTCTTTCATAAATTTATAAGTAATACATTCTTATAAAATGGTCAGAGAAACACCAAAGATCCCGAGATTTCTTCTCACTTACTTTTTTTCTATCTATCTAGATTATATAAATGAGATGTTGAATTAGAGGAACCTTTGATTCAATGATCATAGAAAAATTAGGTAAAGAGTCAGTGTCGTTATGTTATGGAAGATGTG>::GCA_001651475#1#CM004359.1:6363-7364TATATGTAATAATAATTAGTGCATCGTTTTGTGGTGTAGTTTATATAAATAAAGTGATATATAGTCTTGTATAAGAAAGGGATTTTACATGAGACCCAAATATGAGTAAAGGGTGTTGGCTCAAAGATTCATTTAGCAACCAAAGTTGCATTTGCAAGGAAATGAAAAGGTGTTAACAATGTCACTGCGTAACATGACTTCGATACAAAATCTCAAACAATACATCTTCAACTGTGGATTATGATGGACTTGGGGTTGCAGGTGATATGTCTATAGAAAAACGGGTGGGAATGGAAAATGGCTGAAGAAGAAAGCTCCAACTAAGCATCATAACTGGATTGTTTTAGAGGAGATGAGTCAAAGGAATTCACAGTGGAACTATCTCAAGAACATGATCATTGGCTTCTTATTGTTCATCTCCATCATTGGCTGGATCATTCTGGTTCAAGAGGTCAAATTATATATACATAACGGATCTAAGAAGTATAGTGTAGTCAATTAAAAACAAAACGAGACTTGAAAATAAGCATAAGTATTATTAAGTTAACCCAATATTCGTTTCAATGCTTTAGTTATCATCAGAATTCTCAACATTTTCAGATATTTAACTTGCCTTTGGTTGCTTTCTCGCCATGGTAGTAGCATTCTCCGGATAAGAATCAAGGGGAGCCTCAACTTCGGCTTCAACCGTCTCCTCGTCTTTCCTATGACATTCACTTGGTGTTGCAACAATGTGTTGATGCCCCACATTCATAGGAGAGTCCCACATATCTCCAACATTCATAGGAGGGTTCCACATATGCCCACCATTCCAAGGAGGGTACCACCACATATGCCCAGCATTCCAAGAAAGATGGAGCGACGATATCGTGAGGAACAAGGTTTTTGTAGGGCAAATAATTGTAGGTGGTTGAGTTATATCCGCCCGCACAGAGTAACAGACCAACAATTAACTTTTGATATTTTAGTAAGGTCTAATTCAATTTTTGGTGGCGATAATA>::GCA_001651475#1#CM004359.1:8710-9736AGCTAGAATCAGACAGGAACTAGCAATGCTTGAAATCAAGAACTTGAATTGAAATAGTTTTTTACCTGAATATTGACAGTTGCTGGATTAATTGCATTGTAGAGGACGTGTCTATATACCTTTGGTCTGTGAAGGATTAAATCGATGAAAATAATCTGCCAAAGAAAACAATTAAAGAACCAAAAACCAAAATTGGAAAGAAATAGGGAAACACCCAAAAAGGGAAAGAAAGTGATTAAAACAGACCATGCGTTCACACTCGATGTACTCATCTGCTACTTCCTTGCAATTTCCCTAAATATAACAATATGATCAAAGATGGAAACTTTGAAGAAATTTAATAGAGAATCTTATAAACCCTAATTGGGTCAAAGAAGATCCATTAATACAAAAATCTTACGCATTTCATGAGACGAATGTTACCCGGAGAGTATTGAATGAACAATGACTTTACCCTAAAACCACATCCCACGCATCTGTGTTCACTCGCCGCCATTGCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCAAGAGAAGAAGAATACGGAGCAATTAGAGTCCGGGTCTGGGCTACTGTTTTAACCCTAAATGGGCTTATTCATGGGCCAAGTTTTTGAAGTCTTAACTTTAAATTTGTTAGGCCCACTTTTGCTCTAAGCCGGGGTATTTGTACCCCAAAATTTAAAAATCATATACACGTTGTAATTTATAAATAGTTCAATTTGGATCAAAATCTTGTCCATATGACATAGCATTTTAAAATGCGTAGGTTCATGAATGAAACATATTATAGGCCTCAGATAAAGATATACATATTAAGTCTAAATTATTTAGTCTTCAGAATTTACCACACTTACTGAAAAGTCTAGTGGTTCACTAATATTATTACTGTCGTGTTACTTTCTATATATAGTTCATGACTTGTGAGTTGTGATGGATAAGTTTATAAGAAAATAAATTATTTATTACAATTCAACAGTGAAGAAATTTATTTAGTTTGATTAAATAAGAAAGGTAAATAA>::GCA_001651475#1#CM004359.1:10945-11946TTTAACACTTAATTAACGCAATCTTACCATATATTCATGGACTACATGCAGAATAGTAATTCTCCCAACCTTTCTAGTTATTTACCTGAATGTGTTTATGTACATGGACCGGTAACCTCATGTATATATGCACATACTGACAATCTGACATACATATATATAGTAGATATGACAACAACAACAAAAAAAAAAAAAATTCCTTGTTCGTGAAGCATGATCTGAGAGTTCCTAGTTAGCATGTTGTGTGGGATCATACTTTTAATATGCTGCAAGTACCAGTCAATTTTAGTATGGGAAACTATAAACATGTATAATCAACCAATGAACACGTCAATAACCTATTGAACAGCTTAGGGTGAAAATTATGATCCGTAGAGACAGCATTTAAAAGTTCCTTACGTCCACGTAAAATAATATATCAATTTATACATATACATGTGTAAACTGTGTATATATAGGGTAGGTATATGTGTATATATATAGTAATTGACAAATGATTTTGGTTCTAACATATATTCTAAAAGTACTCATGAGTTTGTGAGATCTACACAAGATACCTGATTTGATAAAAATGGCTTCAACTTGCAATCCAAACCAAACCAAACAAAGTTAATAACCAAGGGTTAATAACAAAAACAAGAATCTAGAATTAGTAAAAAAATGAGAAATTAATGAACCTGTGATCATAAAAAAAGTCAAACAATGTGAAAACATATCATACCTTTTGTTCTTTTTAATATAATAATTGAATTACTAAATGGATGGATCAGTCCTTCTTCCATAGCTAGCTTCTCTTTATTTTCTCTGCCCATAACCTGCAGAAAATCTCTTTAACCGAGCAAAATTACAAGAGTAACCAAACAAACAAAATAGGACCATCAGAGAGAGAGAAAAGAGTGCCTTTTTTTGGACCTGCCCATTAGCTTAGAATGTACCATGAAACACTTTGTCAGTGTAGGGAGATAGGCACAGAGAGTACAATTCATGAAATTTATAAGCTT>::GCA_001651475#1#CM004359.1:13387-14389CTGCTGCTTGCTCCGATGTTGGAGGTTAATTCTTGGTCTCTGTCTTGTTCTTGGTCGGAACTTGTTGTTGTTGTCGGAGCCAGGGATAGATCCATTATGACTTATCTGATTTTCCTTGGAGAAATCTTGATTTACGAGATGGTTCTTGAATCTTTTTTTTTTTTTTTTTGGTTTTTGGCAAAACCTTGCTCCGTTTAGTTTTGAGTTGGCTCTTAGTGGGTCTTAGTGAATCTTGATGAAGTTTAAACCTTTCTCTTGTTCTTCAATTGGTGAATTGTTGCGGATGTTTTTCAAAGCCATTGACGATGATGATGACTCAACTCGAGATCTTCAACATACATTAGAGAGATAATATTAAAAAGAGGAAACGAAACAAAAATCTAGCCACAATTATAAAAAAGGGTAGATGAATTAGTGAAAAAAAATTCAAAGAATATATTAAAAGAAGGGAAGAAGGAAAAAGAAAATCAAAAGTTCTTGTGGTCTCACCTTATAATATAAGAGAGAGAGAGAGGGAGAGAGAGATGTTGATATTGAAGTCTTCTTTCTCTTGCTGGTTCGATTGTTTCTTAGATTTTCTTCTTTGACCATGAAAATTTAGAGGAAATATGAAAACCCTAGAATCGGAAGAAAACTATATATGTATATCTTTCCGTTGACTTTATATAGAATGAAATCAAGGAAAGAAAAGAGCTAAGCAAATCAGGACTTAGAACTCAAATTGGGTTCTTGCCAAACAAGAGGATCTCTCTCTTTCTCTTTATGAAGATGAAGAAAAAATGGTGGCTTGAGATTCTTGATAGAGTTTCAAGACTTAGAGAGAGAGAGCGAGATTGGTACCGAGTAATGCGTAGGGGCCCATGAATGATATGGTTCGGGTCATGTGAGCTTCTTGTCCTTTTCATTAGGCTTCGAATTAGCTCTTTGTATTGAAATTTCAAAATCTCTTTATACATATGTGTATGTATATGTTTAGTTGTCGACGGTGTTAATTTGAGAATC

This is the fasta for annotation.

remove :: from the fasta headers (they have been automatically added):

sed -i 's/^>::/>/' non_reference_chr1_for_annotation.fasta

community 1 - chr2

1. Extract non-reference sequences

  • find the reference paths using odgi paths + grep 'reference_name' and place it inreference.txt
  • extract non reference sequences using odgi paths with the flag--non-reference-ranges
mkdir extract_non_reference_pathscd extract_non_reference_paths/odgi paths -i ../Arabidopsis.pg.input.community.1.og -L | grep 'GCA_000001735' > reference.txtodgi paths -i ../Arabidopsis.pg.input.community.1.og --non-reference-ranges reference.txt > non_reference.bed

2. Filter non_reference.bed

filter_private_bed.py script is in the scripts folder.It takes the output from the previous command and separates sequences shorther than a given threshold from longer sequences. It also prints the distribution of the paths length to stdout and to the log file.

~/software/scripts/filter_private_bed.py -i non_reference.bed -o non_reference_filtered.bed -thr 500---size_category    count            1 11130257         2-50  2121266        51-99   119889      100-499    58584      500-999    15457    1000-4999    19663    5000-9999     4421  10000-19999     1694  20000-99999     1209100000-499999      256500000-999999       11    from_1Mbp        0

3. Apply bedtools slop and merge

Bedtools needs a genome file for slop. We can use the fasta.fai of the multifasta used for building the pangenome with pggb.I will extend regions smaller than 500 bp of 500bp in both ends.

  • filtered_out.bed is the bed file containing all the regions smaller than 500 bp
  • non_reference_filtered.bed is the bed file containing all the regions larger than 500 bp

I will enlarge only the sequences contained in filtered_out.bed

bedtools slop -i filtered_out.bed -g ../../Arabidopsis.pg.input.fasta.gz.fai -b 500 > expanded_regions.bedhead expanded_regions.bed ---GCA_001651475#1#CM004360.1      0       614GCA_001651475#1#CM004360.1      0       703GCA_001651475#1#CM004360.1      0       780GCA_001651475#1#CM004360.1      0       796GCA_001651475#1#CM004360.1      0       966GCA_001651475#1#CM004360.1      284     1285GCA_001651475#1#CM004360.1      516     1517GCA_001651475#1#CM004360.1      564     1565GCA_001651475#1#CM004360.1      674     1675GCA_001651475#1#CM004360.1      972     1973head filtered_out.bed---GCA_001651475#1#CM004360.1      113     114GCA_001651475#1#CM004360.1      202     203GCA_001651475#1#CM004360.1      279     280GCA_001651475#1#CM004360.1      295     296GCA_001651475#1#CM004360.1      465     466GCA_001651475#1#CM004360.1      784     785GCA_001651475#1#CM004360.1      1016    1017GCA_001651475#1#CM004360.1      1064    1065GCA_001651475#1#CM004360.1      1174    1175GCA_001651475#1#CM004360.1      1472    1473

Right, now the smaller regions have been expanded. Now I will join the expanded regions (expanded_regions.bed) with the regions that were already above 500 bp (non_reference_filtered.bed):

cat expanded_regions.bed non_reference_filtered.bed | bedtools sort -i - > slop_final.bed

Now we can apply bedtools merge to exclude overlappings and join close sequences:

bedtools merge -i slop_final.bed -d 100 > non_reference_chr2_for_annotation.bed
  • -d 100 means that if there are gaps of 100bp between two sequences they are merged.

4. Extract the FASTA

Now, with this bed file we should be able to extract the fasta that we want to annotate.

bedtools getfasta -fi ../../Arabidopsis.pg.input.fasta.gz -bed non_reference_chr2_for_annotation.bed -fo non_reference_chr2_for_annotation.fasta -name

remove :: from the fasta headers (they have been automatically added):

sed -i 's/^>::/>/' non_reference_chr2_for_annotation.fasta

community 2 - chr3

1. Extract non-reference sequences

  • find the reference paths using odgi paths + grep 'reference_name' and place it inreference.txt
  • extract non reference sequences using odgi paths with the flag--non-reference-ranges
mkdir extract_non_reference_pathscd extract_non_reference_paths/odgi paths -i ../Arabidopsis.pg.input.community.2.og -L | grep 'GCA_000001735' > reference.txtodgi paths -i ../Arabidopsis.pg.input.community.2.og --non-reference-ranges reference.txt > non_reference.bed

2. Filter non_reference.bed

filter_private_bed.py script is in the scripts folder.It takes the output from the previous command and separates sequences shorther than a given threshold from longer sequences. It also prints the distribution of the paths length to stdout and to the log file.

~/software/scripts/filter_private_bed.py -i non_reference.bed -o non_reference_filtered.bed -thr 500---size_category    count            1 12646306         2-50  2437703        51-99   107639      100-499    59027      500-999    17275    1000-4999    23841    5000-9999     5898  10000-19999     2585  20000-99999     1596100000-499999      324500000-999999       12    from_1Mbp        1

3. Apply bedtools slop and merge

Bedtools needs a genome file for slop. We can use the fasta.fai of the multifasta used for building the pangenome with pggb.

I will extend regions smaller than 500 bp of 500bp in both ends.

  • filtered_out.bed is the bed file containing all the regions smaller than 500 bp
  • non_reference_filtered.bed is the bed file containing all the regions larger than 500 bp

I will enlarge only the sequences contained in filtered_out.bed

bedtools slop -i filtered_out.bed -g ../../Arabidopsis.pg.input.fasta.gz.fai -b 500 > expanded_regions.bedhead expanded_regions.bed ---GCA_001651475#1#CM004361.1      0       594GCA_001651475#1#CM004361.1      0       677GCA_001651475#1#CM004361.1      0       734GCA_001651475#1#CM004361.1      0       742GCA_001651475#1#CM004361.1      0       800GCA_001651475#1#CM004361.1      0       808GCA_001651475#1#CM004361.1      0       816GCA_001651475#1#CM004361.1      0       823GCA_001651475#1#CM004361.1      0       837GCA_001651475#1#CM004361.1      0       841head filtered_out.bed---GCA_001651475#1#CM004361.1      92      94GCA_001651475#1#CM004361.1      168     177GCA_001651475#1#CM004361.1      233     234GCA_001651475#1#CM004361.1      241     242GCA_001651475#1#CM004361.1      292     300GCA_001651475#1#CM004361.1      306     308GCA_001651475#1#CM004361.1      315     316GCA_001651475#1#CM004361.1      322     323GCA_001651475#1#CM004361.1      336     337GCA_001651475#1#CM004361.1      340     341

Right, now the smaller regions have been expanded. Now I will join the expanded regions (expanded_regions.bed) with the regions that were already above 500 bp (non_reference_filtered.bed):

cat expanded_regions.bed non_reference_filtered.bed | bedtools sort -i - > slop_final.bed

Now we can apply bedtools merge to exclude overlappings and join close sequences:

bedtools merge -i slop_final.bed -d 100 > non_reference_chr3_for_annotation.bed
  • -d 100 means that if there are gaps of 100bp between two sequences they are merged.

4. Extract the FASTA

Now, with this bed file we should be able to extract the fasta that we want to annotate.

bedtools getfasta -fi ../../Arabidopsis.pg.input.fasta.gz -bed non_reference_chr3_for_annotation.bed -fo non_reference_chr3_for_annotation.fasta -name

remove :: from the fasta headers (they have been automatically added):

sed -i 's/^>::/>/' non_reference_chr3_for_annotation.fasta

community 3 - chr4

1. Extract non-reference sequences

  • find the reference paths using odgi paths + grep 'reference_name' and place it inreference.txt
  • extract non reference sequences using odgi paths with the flag--non-reference-ranges
mkdir extract_non_reference_pathscd extract_non_reference_paths/odgi paths -i ../Arabidopsis.pg.input.community.3.og -L | grep 'GCA_000001735' > reference.txtodgi paths -i ../Arabidopsis.pg.input.community.3.og --non-reference-ranges reference.txt > non_reference.bed

2. Filter non_reference.bed

filter_private_bed.py script is in the scripts folder.It takes the output from the previous command and separates sequences shorther than a given threshold from longer sequences. It also prints the distribution of the paths length to stdout and to the log file.

~/software/scripts/filter_private_bed.py -i non_reference.bed -o non_reference_filtered.bed -thr 500---size_category    count            1 11148357         2-50  2200442        51-99   112329      100-499    70063      500-999    15581    1000-4999    19389    5000-9999     4439  10000-19999     1948  20000-99999     1331100000-499999      392500000-999999       25    from_1Mbp        2

3. Apply bedtools slop and merge

Bedtools needs a genome file for slop. We can use the fasta.fai of the multifasta used for building the pangenome with pggb.I will extend regions smaller than 500 bp of 500bp in both ends.

  • filtered_out.bed is the bed file containing all the regions smaller than 500 bp
  • non_reference_filtered.bed is the bed file containing all the regions larger than 500 bp

I will enlarge only the sequences contained in filtered_out.bed

bedtools slop -i filtered_out.bed -g ../../Arabidopsis.pg.input.fasta.gz.fai -b 500 > expanded_regions.bedhead expanded_regions.bed ---GCA_001651475#1#CM004362.1      4567    5570GCA_001651475#1#CM004362.1      4571    5574GCA_001651475#1#CM004362.1      4575    5584GCA_001651475#1#CM004362.1      4586    5612GCA_001651475#1#CM004362.1      4614    5616GCA_001651475#1#CM004362.1      4617    5619GCA_001651475#1#CM004362.1      4620    5623GCA_001651475#1#CM004362.1      4624    5628GCA_001651475#1#CM004362.1      4630    5662GCA_001651475#1#CM004362.1      4673    5674head filtered_out.bed---GCA_001651475#1#CM004362.1      5067    5070GCA_001651475#1#CM004362.1      5071    5074GCA_001651475#1#CM004362.1      5075    5084GCA_001651475#1#CM004362.1      5086    5112GCA_001651475#1#CM004362.1      5114    5116GCA_001651475#1#CM004362.1      5117    5119GCA_001651475#1#CM004362.1      5120    5123GCA_001651475#1#CM004362.1      5124    5128GCA_001651475#1#CM004362.1      5130    5162GCA_001651475#1#CM004362.1      5173    5174

Right, now the smaller regions have been expanded. Now I will join the expanded regions (expanded_regions.bed) with the regions that were already above 500 bp (non_reference_filtered.bed):

cat expanded_regions.bed non_reference_filtered.bed | bedtools sort -i - > slop_final.bed

Now we can apply bedtools merge to exclude overlappings and join close sequences:

bedtools merge -i slop_final.bed -d 100 > non_reference_chr4_for_annotation.bed
  • -d 100 means that if there are gaps of 100bp between two sequences they are merged.

4. Extract FASTA

Now, with this bed file we should be able to extract the fasta that we want to annotate.

bedtools getfasta -fi ../../Arabidopsis.pg.input.fasta.gz -bed non_reference_chr4_for_annotation.bed -fo non_reference_chr4_for_annotation.fasta -name

remove :: from the fasta headers (they have been automatically added):

sed -i 's/^>::/>/' non_reference_chr4_for_annotation.fasta

community 4 - chr5

1. Extract non-reference sequences

  • find the reference paths using odgi paths + grep 'reference_name' and place it inreference.txt
  • extract non reference sequences using odgi paths with the flag--non-reference-ranges
mkdir extract_non_reference_pathscd extract_non_reference_paths/odgi paths -i ../Arabidopsis.pg.input.community.4.og -L | grep 'GCA_000001735' > reference.txtodgi paths -i ../Arabidopsis.pg.input.community.4.og --non-reference-ranges reference.txt > non_reference.bed

2. Filter non_reference.bed

filter_private_bed.py script is in the scripts folder.It takes the output from the previous command and separates sequences shorther than a given threshold from longer sequences. It also prints the distribution of the paths length to stdout and to the log file.

~/software/scripts/filter_private_bed.py -i non_reference.bed -o non_reference_filtered.bed -thr 500---size_category    count            1 15088021         2-50  2535317        51-99    98407      100-499   158317      500-999    21568    1000-4999    23741    5000-9999     6136  10000-19999     2140  20000-99999     1024100000-499999      180500000-999999       10    from_1Mbp        1

3. Apply bedtools slop and merge

Bedtools needs a genome file for slop. We can use the fasta.fai of the multifasta used for building the pangenome with pggb.I will extend regions smaller than 500 bp of 500bp in both ends.

  • filtered_out.bed is the bed file containing all the regions smaller than 500 bp
  • non_reference_filtered.bed is the bed file containing all the regions larger than 500 bp

I will enlarge only the sequences contained in filtered_out.bed

bedtools slop -i filtered_out.bed -g ../../Arabidopsis.pg.input.fasta.gz.fai -b 500 > expanded_regions.bedhead expanded_regions.bed ---GCA_001651475#1#CM004363.1      0       501GCA_001651475#1#CM004363.1      0       503GCA_001651475#1#CM004363.1      0       508GCA_001651475#1#CM004363.1      0       510GCA_001651475#1#CM004363.1      0       515GCA_001651475#1#CM004363.1      0       517GCA_001651475#1#CM004363.1      0       522GCA_001651475#1#CM004363.1      0       524GCA_001651475#1#CM004363.1      0       529GCA_001651475#1#CM004363.1      0       531head filtered_out.bed---GCA_001651475#1#CM004363.1      0       1GCA_001651475#1#CM004363.1      2       3GCA_001651475#1#CM004363.1      6       8GCA_001651475#1#CM004363.1      9       10GCA_001651475#1#CM004363.1      14      15GCA_001651475#1#CM004363.1      16      17GCA_001651475#1#CM004363.1      21      22GCA_001651475#1#CM004363.1      23      24GCA_001651475#1#CM004363.1      28      29GCA_001651475#1#CM004363.1      30      31

Right, now the smaller regions have been expanded. Now I will join the expanded regions (expanded_regions.bed) with the regions that were already above 500 bp (non_reference_filtered.bed):

cat expanded_regions.bed non_reference_filtered.bed | bedtools sort -i - > slop_final.bed

Now we can apply bedtools merge to exclude overlappings and join close sequences:

bedtools merge -i slop_final.bed -d 100 > non_reference_chr5_for_annotation.bed
  • -d 100 means that if there are gaps of 100bp between two sequences they are merged.

4. Extract the FASTA

Now, with this bed file we should be able to extract the fasta that we want to annotate.

bedtools getfasta -fi ../../Arabidopsis.pg.input.fasta.gz -bed non_reference_chr5_for_annotation.bed -fo non_reference_chr5_for_annotation.fasta -name

remove :: from the fasta headers (they have been automatically added):

sed -i 's/^>::/>/' non_reference_chr5_for_annotation.fasta

Annotation

Get UniRef_100 and format as diamond database

wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz#get taxonomy files wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gzwget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zipunzip taxdmp.zip
# format dbdiamond makedb --in uniref100.fasta --taxonmap prot.accession2taxid.FULL.gz --taxonnodes nodes.dmp --taxonnames names.dmp -d uniref100 --threads 20

Download conversion table between UniProt names and Araport

We need to download the conversion table between UniProt names and Araport from TAIR10 website:https://www.arabidopsis.org/download/file?path=Proteins/Id_conversions/TAIR2UniprotMapping.txt.We will need this table to use theparse_exonerate_results.py script.

Extract attributes from UniRef100 FASTA headers

For the same script, we also need to extract informations from the uniref100 headers:

grep '>' uniref100.fasta > headers_uniref100sed -i 's/^>//' headers_uniref100awk '{split($0, a, " "); printf "%s\t", a[1]; for (i=2; i<=NF; i++) printf "%s%s", $i, (i==NF ? "" : "_"); print ""}' headers_uniref100 > info_headers.txtrm headers_uniref100

community 0 - chr1

1. Diamond

diamond blastx -d uniref100 -q non_reference_chr1_for_annotation.fasta -o diamond_chr1_top0_results.tsv --range-culling --top 0 -F 15 -f 6  qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxids sscinames sskingdoms skingdoms sphylums -t ./ -p 50 --log

2. Exonerate

We want to refine the annotations found by diamond blastx, and not repeat the whole analysis with exonerate.To do so, we should prepare some input files and run exonerate using the 'PW_exonerate.pl' script.

In order to speed up the analysis, we are going to split the diamond results in chunks, and with the help of seqtk we will create the protein and sequence fasta files, in order to parallelise exonerate (which doesn't allow multithreading).

For each chunk we need to create:

  • a file containing the whole sequences where diamond found proteins
  • a file containing the proteins that matched with our sequences from the database.

To automate this step I created thesplit_and_extract_for_exonerate.sh script.It splits diamond results in chunks and for each chunks seqtk extracts the fasta of the proteins and the fasta of the sequences.

split diamond results in chunks

Here's how we can use it:

~/software/scripts/split_and_extract_for_exonerate.sh -i ../diamond_chr1_top0_results.tsv -l 100000 -s ../non_reference_chr1_for_annotation.fasta -p uniref100.fasta
run exonerate

Then, we can run exonerate in parallel on the different chunks with the following command for each chunk (change suffix for each chunk):

perl ~/software/scripts/PW_exonerate.pl --protein_fasta diamond_chunk_aa_proteins.fasta --genome_fasta diamond_chunk_aa_seqs.fasta --diamond_results diamond_chunk_aa --suffix aa# join all the results togethercat exonerate_results_aa.gff exonerate_results_ab.gff exonerate_results_ac.gff exonerate_results_ad.gff exonerate_results_ae.gff exonerate_results_af.gff exonerate_results_ag.gff exonerate_results_ah.gff exonerate_results_ai.gff exonerate_results_aj.gff exonerate_results_ak.gff > chr1_exonerate_results.gff# remove chunks to save spacerm exonerate_results_*

3. Parse exonerate results

To parse the results of exonerate we are going to useparse_exonerate.py.

# extract C4 alignmentsawk '/C4 Alignment:/,/vulgar/{if (!/vulgar/) print}' chr1_exonerate_results.gff > chr1_C4_alignments.txt# Parse the results ~/software/scripts/parse_exonerate.py -e chr1_exonerate_results.gff -c chr1_C4_alignments.txt -t TAIR2UniprotMapping.txt -i info_headers.txt

community 1 - chr2

1. Diamond

diamond blastx -d uniref100 -q non_reference_chr2_for_annotation.fasta -o diamond_chr2_top0_results.tsv --range-culling --top 0 -F 15 -f 6  qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxids sscinames sskingdoms skingdoms sphylums -t ./ -p 50 --log

2. Exonerate

split diamond results in chunks
~/software/scripts/split_and_extract_for_exonerate.sh -i ../diamond_chr2_top0_results.tsv -l 100000 -s ../non_reference_chr2_for_annotation.fasta -p uniref100.fasta
run exonerate

Then, we can run exonerate in parallel on the different chunks with the following command for each chunk (changing suffix for each chunk):

perl ~/software/scripts/PW_exonerate.pl --protein_fasta diamond_chunk_aa_proteins.fasta --genome_fasta diamond_chunk_aa_seqs.fasta --diamond_results diamond_chunk_aa --suffix aacat exonerate_results_aa.gff exonerate_results_ab.gff exonerate_results_ac.gff exonerate_results_ad.gff exonerate_results_af.gff exonerate_results_ag.gff exonerate_results_ah.gff > chr2_exonerate_results.gffrm exonerate_results_*

3. Parse exonerate results

awk '/C4 Alignment:/,/vulgar/{if (!/vulgar/) print}' chr2_exonerate_results.gff > chr2_C4_alignments.txt~/software/scripts/parse_exonerate.py -e chr2_exonerate_results.gff -c chr2_C4_alignments.txt -t ~/Arabidopsis/pangenome/ara_pan_from_April/TAIR2UniprotMapping.txt -i info_headers.txt

community 2 - chr3

1. Diamond

diamond blastx -d uniref100 -q non_reference_chr3_for_annotation.fasta -o diamond_chr3_top0_results.tsv --range-culling --top 0 -F 15 -f 6  qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxids sscinames sskingdoms skingdoms sphylums -t ./ -p 50 --log

2. Exonerate

split diamond results in chunks
~/software/scripts/split_and_extract_for_exonerate.sh -i ../diamond_chr3_top0_results.tsv -l 100000 -s ../non_reference_chr3_for_annotation.fasta -p uniref100.fasta
run exonerate

Then, we can run exonerate in parallel on the different chunks with the following command for each chunk (change suffix for each chunk):

perl ~/software/scripts/PW_exonerate.pl --protein_fasta diamond_chunk_aa_proteins.fasta --genome_fasta diamond_chunk_aa_seqs.fasta --diamond_results diamond_chunk_aa --suffix aacat exonerate_results_aa.gff exonerate_results_ab.gff exonerate_results_ac.gff exonerate_results_ad.gff exonerate_results_ae.gff exonerate_results_af.gff exonerate_results_ag.gff exonerate_results_ah.gff exonerate_results_ai.gff > chr3_exonerate_results.gffrm exonerate_results_*

3. Parse exonerate results

awk '/C4 Alignment:/,/vulgar/{if (!/vulgar/) print}' chr3_exonerate_results.gff > chr3_C4_alignments.txt~/software/scripts/parse_exonerate.py -e chr3_exonerate_results.gff -c chr3_C4_alignments.txt -t TAIR2UniprotMapping.txt -i info_headers.txt

community 3 - chr4

1. Diamond

diamond blastx -d uniref100 -q non_reference_chr4_for_annotation.fasta -o diamond_chr4_top0_results.tsv --range-culling --top 0 -F 15 -f 6  qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxids sscinames sskingdoms skingdoms sphylums -t ./ -p 50 --log

2. Exonerate

split diamond results in chunks
~/software/scripts/split_and_extract_for_exonerate.sh -i ../diamond_chr4_top0_results.tsv -l 100000 -s ../non_reference_chr4_for_annotation.fasta -p uniref100.fasta
run exonerate

Then, we can run exonerate in parallel on the different chunks with the following command for each chunk (changing suffix for each chunk):

perl ~/software/scripts/PW_exonerate.pl --protein_fasta diamond_chunk_aa_proteins.fasta --genome_fasta diamond_chunk_aa_seqs.fasta --diamond_results diamond_chunk_aa --suffix aacat exonerate_results_aa.gff exonerate_results_ab.gff exonerate_results_ac.gff exonerate_results_ad.gff exonerate_results_ae.gff exonerate_results_af.gff exonerate_results_ag.gff > chr4_exonerate_results.gffrm exonerate_results_*

3. Parse exonerate results

awk '/C4 Alignment:/,/vulgar/{if (!/vulgar/) print}' chr4_exonerate_results.gff > chr4_C4_alignments.txt~/software/scripts/parse_exonerate.py -e chr4_exonerate_results.gff -c chr4_C4_alignments.txt -t TAIR2UniprotMapping.txt -i info_headers.txt

community 4 - chr5

1. Diamond

diamond blastx -d uniref100 -q non_reference_chr5_for_annotation.fasta -o diamond_chr5_top0_results.tsv --range-culling --top 0 -F 15 -f 6  qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxids sscinames sskingdoms skingdoms sphylums -t ./ -p 50 --log

2. Exonerate

split
~/software/scripts/split_and_extract_for_exonerate.sh -i ../diamond_chr5_top0_results.tsv -l 100000 -s ../non_reference_chr5_for_annotation.fasta -p uniref100.fasta
run exonerate

Then, we can run exonerate in parallel on the different chunks with the following command for each chunk (changing suffix for each chunk):

perl ~/software/scripts/PW_exonerate.pl --protein_fasta diamond_chunk_aa_proteins.fasta --genome_fasta diamond_chunk_aa_seqs.fasta --diamond_results diamond_chunk_aa --suffix aacat exonerate_results_aa.gff exonerate_results_ab.gff exonerate_results_ac.gff exonerate_results_ad.gff exonerate_results_ae.gff exonerate_results_af.gff exonerate_results_ag.gff exonerate_results_ah.gff exonerate_results_ai.gff exonerate_results_aj.gff > chr5_exonerate_results.gffrm exonerate_results_*

3. Parse exonerate results

awk '/C4 Alignment:/,/vulgar/{if (!/vulgar/) print}' chr5_exonerate_results.gff > chr5_C4_alignments.txt~/software/scripts/parse_exonerate.py -e chr5_exonerate_results.gff -c chr5_C4_alignments.txt -t TAIR2UniprotMapping.txt -i info_headers.txt

Annotation review

These are the files of the results that we get fromparse_exonerate.py:

  • corresponding_to_tair.tsv --> these are the UniRef100 annotations that have surely a correspondent TAIR10 annotation. From these, we will need to exclude UniRef genes/pseudogenes which have not a unique correspondent TAIR locus. This is needed as itg may lead to uncorrect calculations of statistics and downstream analysis.
  • new_results_not_tair.tsv --> These are the "new genes" that we found with the analysis, meaning that they don't have a correspondent in TAIR10 annotations. To have very robust results, we need to filter them to keep only the results corresponding to reviewed proteins (i.e. those coming from UniProtKB/Swiss-Prot), as UniRef also contains proteins which were not reviewed.

Review results that have a corresponding TAIR annotation

Since the gene names were grouped by UniProt ID, and the corresponding TAIR names were concatenated in a single field separated by a, when more than one, we can simply remove those lines which contain a comma in the field containing TAIR names:

# chr1 awk -F'\t' '$26 !~ /,/' corresponding_to_tair.tsv > reviewed_corresponding_to_tair.tsv# chr2awk -F'\t' '$26 !~ /,/' corresponding_to_tair.tsv > reviewed_corresponding_to_tair.tsv# chr3awk -F'\t' '$26 !~ /,/' corresponding_to_tair.tsv > reviewed_corresponding_to_tair.tsv# chr4awk -F'\t' '$26 !~ /,/' corresponding_to_tair.tsv > reviewed_corresponding_to_tair.tsv# chr5awk -F'\t' '$26 !~ /,/' corresponding_to_tair.tsv > reviewed_corresponding_to_tair.tsv

Review results with no corresponding TAIR annotation

We need to reviewnew_results_not_tair.tsv by looking in uniprot if they have been confirmed or not.

We will proceed in this way:

  • extraction of the unique UniRef IDs fromnew_results_not_tair.tsv (for genes and pseudogenes, unclassified will be excluded)
  • go toUniProt ID mapping
  • select from UniRef100 to UniProtKB/Swiss-Prot --> this will give us automatically only reviewed results
  • upload IDs list, submit the job
  • costumise results (TSV format) to get also Gene Ontology info (we will need this later)
  • Be sure to include info about proteins review (it's under miscellaneous options)
  • include also taxonomic lineage

community 0 - chr1

grep 'gene' new_results_not_tair.tsv | cut -f 1 | sort | uniq > IDs_chr1.txt

The scriptreview_exonerate_results.py will drop all the results coming from proteins that were not reviewed.

~/software/scripts/review_exonerate_results.py -i chr1_reviewed_proteins.tsv -a new_results_not_tair.tsv -o reviewed_not_tair_results.tsv

community 1 - chr2

grep 'gene' new_results_not_tair.tsv | cut -f 1 | sort | uniq > IDs_chr2.txt~/software/scripts/review_exonerate_results.py -i chr2_reviewed_proteins.tsv -a new_results_not_tair.tsv -o reviewed_not_tair_results.tsv

community 2 - chr3

grep 'gene' new_results_not_tair.tsv | cut -f 1 | sort | uniq > IDs_chr3.txt~/software/scripts/review_exonerate_results.py -i chr3_reviewed_proteins.tsv -a new_results_not_tair.tsv -o reviewed_not_tair_results.tsv

community 3 - chr4

grep 'gene' new_results_not_tair.tsv | cut -f 1 | sort | uniq > IDs_chr4.txt~/software/scripts/review_exonerate_results.py -i chr4_reviewed_proteins.tsv -a new_results_not_tair.tsv -o reviewed_not_tair_results.tsv

community 4 - chr5

grep 'gene' new_results_not_tair.tsv | cut -f 1 | sort | uniq > IDs_chr5.txt~/software/scripts/review_exonerate_results.py -i chr5_reviewed_proteins.tsv -a new_results_not_tair.tsv -o reviewed_not_tair_results.tsv

Final annotation screening

Now, we can screen the whole pangenome, using all the data coming fromexonerate andodgi untangle.

genes

We'll need the following files:

  • filter_passed_features.csv --> obtained fromcore_dispensable_genes.py
  • pangenome_screening.csv --> obtained fromcore_dispensable_genes.py
  • reviewed_corresponding_to_tair.tsv --> obtained fromparse_exonerate.py and reviewed withreview_exonerate_results.py
  • reviewed_not_tair_results.tsv --> obtained fromparse_exonerate.py and reviewed withreview_exonerate_results.py

community 0 - chr1

~/software/scripts/final_genes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75

community 1 - chr2

~/software/scripts/final_genes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75

community 2 - chr3

~/software/scripts/final_genes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75

community 3 - chr4

~/software/scripts/final_genes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75

community 4 - chr5

~/software/scripts/final_genes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75

pseudogenes

We'll need the following files:

  • filter_passed_features.csv --> obtained fromcore_dispensable_pseudogenes.py
  • pangenome_screening.csv --> obtained fromcore_dispensable_pseudogenes.py
  • reviewed_corresponding_to_tair.tsv --> obtained fromparse_exonerate.py and reviewed withreview_exonerate_results.py
  • reviewed_not_tair_results.tsv --> obtained fromparse_exonerate.py and reviewed withreview_exonerate_results.py

community 0 - chr1

~/software/scripts/final_pseudogenes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75

community 1 - chr2

~/software/scripts/final_pseudogenes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75

community 2 - chr3

~/software/scripts/final_pseudogenes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75

community 3 - chr4

~/software/scripts/final_pseudogenes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75

chr5

~/software/scripts/final_pseudogenes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75

Gene ontology enrichment analysis

The universe for A. thaliana was obtained from:https://www.arabidopsis.org/download/list?dir=GO_and_PO_Annotations

The analysis was then performed on using TopGO as shown inscripts/TopGO.R.

Sequence-based pangenome analysis

Graphs characteristics

To obtain information about the structure of each graph, we can use the commandodgi stats:

odgi stats -i Arabidopsis.pg.input.community.0.og -m > chr1_odgi_stats.yamlodgi stats -i Arabidopsis.pg.input.community.1.og -m > chr2_odgi_stats.yamlodgi stats -i Arabidopsis.pg.input.community.2.og -m > chr3_odgi_stats.yaml odgi stats -i Arabidopsis.pg.input.community.3.og -m > chr4_odgi_stats.yamlodgi stats -i Arabidopsis.pg.input.community.4.og -m > chr5_odgi_stats.yaml

Obtain node length

We can useodgi paths --coverage-levels as in the following example:

odgi paths -i Arabidopsis.pg.input.community.0.og --coverage-levels 2,75,93 > chr1_nodes_classification.txt

odgi paths --coverage-levels classifies the nodes in genomic classes. In our case, we couldn't use this script for classification because the number of paths did not correspond to the number of assemblies included in the pangenome.

Node coverage matrices

odgi paths -i Arabidopsis.pg.input.community.0.og -H -D "#" -p1 -t 100 > chr1_nodes_composition.txtodgi paths -i Arabidopsis.pg.input.community.1.og -H -D "#" -p1 -t 100 > chr2_nodes_composition.txtodgi paths -i Arabidopsis.pg.input.community.2.og -H -D "#" -p1 -t 100 > chr3_nodes_composition.txtodgi paths -i Arabidopsis.pg.input.community.3.og -H -D "#" -p1 -t 100 > chr4_nodes_composition.txtodgi paths -i Arabidopsis.pg.input.community.4.og -H -D "#" -p1 -t 100 > chr5_nodes_composition.txt

Matrices processing

~/software/scripts/node_matrix_processing.py -i chr1_nodes_composition.txt -p chr1 -o chr1_matrix.csv -na 93 -sl 75~/software/scripts/node_matrix_processing.py -i chr2_nodes_composition.txt -p chr2 -o chr2_matrix.csv -na 93 -sl 75~/software/scripts/node_matrix_processing.py -i chr3_nodes_composition.txt -p chr3 -o chr3_matrix.csv -na 93 -sl 75~/software/scripts/node_matrix_processing.py -i chr4_nodes_composition.txt -p chr4 -o chr4_matrix.csv -na 93 -sl 75~/software/scripts/node_matrix_processing.py -i chr5_nodes_composition.txt -p chr5 -o chr5_matrix.csv -na 93 -sl 75

Node-based similarity

We will join the graph withodgi squeeze and subsequently we will runodgi similarity to obtain a similarity matrix.

odgi squeeze needs a file with the list of the graphs as input

nano graphs.txtArabidopsis.pg.input.community.0.ogArabidopsis.pg.input.community.1.ogArabidopsis.pg.input.community.2.ogArabidopsis.pg.input.community.3.ogArabidopsis.pg.input.community.4.og

Squeeze the graphs:

odgi squeeze -f graphs.txt -o squeezed_pangenome.og -t 100 -P

Obtain the matrix:

odgi similarity -i squeezed_pangenome.og -D '#' -d -t 100 -P > similarity_nodes.tsv

Growth curve simulation

# get dataset from gene matrixperl shuffle_count.pl --file concatenated_matrix.csv --iterations 10 --campionamenti 10-20-30-40-50-60-70-80-90-93 --mode NOT_REI --output out_shuffling.txt# get the figurepython3 curve_plot.py out_shuffling.txt # for the zoom of the "All" curvepython3 curve_plot2.py out_shuffling.txt

Similarity analysis

Gene based

Join not transposed matrices

cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/allmkdir gene_clustercd gene_clusterln -s ../../chr1/matrix_chr1.txtln -s ../../chr2/matrix_chr2.txtln -s ../../chr3/matrix_chr3.txtln -s ../../chr4/matrix_chr4.txtln -s ../../chr5/matrix_chr5.txt./combine_notransp_matrices.pymv concatenated_matrix.csv genes_concatenated_matrix.csvcp genes_concatenated_matrix.csv ~/Arabidopsis_pangenome/R_plots/Cluster/

make the figure

Rlibrary("ggplot2")library("vegan")library(viridis)library(ggtree)library(ape)setwd("/home/lia/Arabidopsis_pangenome/R_plots")# Load datamy_data <- read.csv("Cluster/genes_concatenated_matrix.csv", header = TRUE, sep = "\t")head(my_data)# Extract assembly namesassembly_names <- colnames(my_data)[2:94]# Load locationslocations <- read.csv("genebank_country_1.csv", sep="\t", stringsAsFactors = FALSE, header = TRUE)head(locations)colnames(locations) <- c("Assembly", "Location") country_to_colour <- c('Afghanistan' = "salmon2",'Belgium' = "gold3",'Cape Verde' = "darkgreen",'Czech Republic' = "grey46",'France' = "red", 'Germany' = "magenta",'Italy' = "brown",'Japan' = "springgreen4",'Lithuania' = "black",'Netherlands' = "green",'Poland' = "orange",'Madeira' = "darkviolet",'Romania' = "pink",'Spain' = 'deepskyblue3','Sweden' = "olivedrab",'Tanzania' = 'blue','United Kingdom' = 'mediumpurple','USA' = 'cyan')# Create a mapping from assembly names to countriesassembly_to_country <- setNames(locations$Location, locations$Assembly)# Map assembly names to colors through their countriesassembly_to_color <- sapply(assembly_names, function(assembly) {  country <- assembly_to_country[assembly]  color <- country_to_colour[country]  return(color)})# Calculate Jaccard distancedist.jaccard <- vegdist(t(my_data[, 2:94]), method = "jaccard")write.csv(as.matrix(dist.jaccard), "genes_PAV_jaccard_distance_matrix.csv", row.names = TRUE)# Perform hierarchical clusteringres.hc <- hclust(d = dist.jaccard, method = "ward.D2")# Calculate the cophenetic distance matrix from the clustering resultres.coph<-cophenetic(res.hc)# Compute the cophenetic correlationcophenetic_corr <- cor(as.dist(dist.jaccard), as.dist(res.coph))# Print cophenetic correlation to check the quality of the clusteringprint(cophenetic_corr)phylo_tree <- as.phylo(res.hc)# Prepare a data frame that ggtree can use to map colors to tips# The row names of 'my_data' are assumed to be the assembly names used in 'phylo_tree$tip.label'# Ensure these names match those in 'assembly_to_color'tip_colors <- data.frame(label = names(assembly_to_color), color = assembly_to_color)# Plot the tree with colored tipsp <- ggtree(phylo_tree, layout="circular") +  geom_tiplab(aes(color = label), size = 4) +  scale_color_manual(values = tip_colors$color) +  theme_tree2() +  theme(legend.position = "none",  # Adjust legend position        plot.background = element_blank(),  # Customize background        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),  # Remove grid lines        axis.text = element_blank(),  # Hide axis text        axis.ticks = element_blank())  # Hide axis ticksp# Save the plotggsave(filename = "Cluster/phylo_jaccard_ward_PAV_genes.png", plot = p, width = 16, height = 16, dpi = 300)

Pseudogene based

Join not transposed matrices

cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/allmkdir pseudogene_clustercd pseudogene_clusterln -s ../../chr1/matrix_chr1.txtln -s ../../chr2/matrix_chr2.txtln -s ../../chr3/matrix_chr3.txtln -s ../../chr4/matrix_chr4.txtln -s ../../chr5/matrix_chr5.txt./combine_notransp_matrices.pymv concatenated_matrix.csv pseudogenes_concatenated_matrix.csvcp pseudogenes_concatenated_matrix.csv ~/Arabidopsis_pangenome/R_plots/Cluster/

make the figure

mamba activate RRlibrary("ggplot2")library("vegan")library(ggtree)library(ape) setwd("/home/lia/Arabidopsis_pangenome/R_plots")# Load datamy_data <- read.csv("Cluster/pseudogenes_concatenated_matrix.csv", header = TRUE, sep = "\t")head(my_data)# Extract assembly namesassembly_names <- colnames(my_data)[2:94]# Load locationslocations <- read.csv("genebank_country_1.csv", sep="\t", stringsAsFactors = FALSE, header = TRUE)head(locations)colnames(locations) <- c("Assembly", "Location") country_to_colour <- c('Afghanistan' = "salmon2",'Belgium' = "gold3",'Cape Verde' = "darkgreen",'Czech Republic' = "grey46",'France' = "red", 'Germany' = "magenta",'Italy' = "brown",'Japan' = "springgreen4",'Lithuania' = "black",'Netherlands' = "green",'Poland' = "orange",'Madeira' = "darkviolet",'Romania' = "pink",'Spain' = 'deepskyblue3','Sweden' = "olivedrab",'Tanzania' = 'blue','United Kingdom' = 'mediumpurple','USA' = 'cyan')# Create a mapping from assembly names to countriesassembly_to_country <- setNames(locations$Location, locations$Assembly)# Map assembly names to colors through their countriesassembly_to_color <- sapply(assembly_names, function(assembly) {  country <- assembly_to_country[assembly]  color <- country_to_colour[country]  return(color)})# Calculate Jaccard distancedist.jaccard <- vegdist(t(my_data[, 2:94]), method = "jaccard")write.csv(as.matrix(dist.jaccard), "pseudogenes_PAV_jaccard_distance_matrix.csv", row.names = TRUE)# Hierarchical clusteringres.hc <- hclust(d = dist.jaccard, method = "ward.D2")# Calculate the cophenetic distance matrix from the clustering resultres.coph<-cophenetic(res.hc)# Compute the cophenetic correlationcophenetic_corr <- cor(as.dist(dist.jaccard), as.dist(res.coph))# Print cophenetic correlation to check the quality of the clusteringprint(cophenetic_corr)# Convert to phylo objectphylo_tree <- as.phylo(res.hc)# Prepare a data frame that ggtree can use to map colors to tips# The row names of 'my_data' are assumed to be the assembly names used in 'phylo_tree$tip.label'# Ensure these names match those in 'assembly_to_color'tip_colors <- data.frame(label = names(assembly_to_color), color = assembly_to_color)# Plot the treep <- ggtree(phylo_tree, layout="circular") +  geom_tiplab(aes(color = label), size = 4) +  scale_color_manual(values = tip_colors$color) +  theme_tree2() +  theme(legend.position = "none",  # Adjust legend position        plot.background = element_blank(),  # Customize background        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),  # Remove grid lines        axis.text = element_blank(),  # Hide axis text        axis.ticks = element_blank())  # Hide axis ticksp# Save the plotggsave(filename = "Cluster/phylo_jaccard_ward_PAV_pseudogenes.png", plot = p, width = 16, height = 16, dpi = 300)

Node based

For this figure we generated the input file usingodgi similarity after joining all the graphs from each chromosomes withodgi squeeze.To generate the phylogenetic tree we will follow these instructions:https://hackmd.io/@AndreaGuarracino/SyhbiKuE2#Primate-chromosome-6

mamba activate RRlibrary(readr)library(dplyr)library(tidyr)library(tibble)library(ape)library(ggtree)library(ggplot2)setwd ("/home/lia/Arabidopsis_pangenome/FINAL/nodes_similarity")path_dist_tsv <- 'similarity_nodes.tsv'# Read sparse matrixsparse_matrix_df <- read_tsv(path_dist_tsv)# Prepare distance matrixjaccard_dist_df <- sparse_matrix_df %>%  arrange(group.a, group.b) %>%  select(group.a, group.b, jaccard.distance) %>%  pivot_wider(names_from = group.b, values_from = jaccard.distance) %>%  column_to_rownames(var = "group.a")dist.jaccard <- as.dist(jaccard_dist_df)write.csv(as.matrix(dist.jaccard), "nodes_jaccard_distance_matrix.csv", row.names = TRUE)# Perform hierarchical clusteringres.hc <- hclust(d = dist.jaccard)# Calculate the cophenetic distance matrix from the clustering resultres.coph<-cophenetic(res.hc)# Compute the cophenetic correlationcophenetic_corr <- cor(as.dist(dist.jaccard), as.dist(res.coph))# Print cophenetic correlation to check the quality of the clusteringprint(cophenetic_corr)# Convert hclust object to phylophylo_tree <- as.phylo(res.hc)# Load locationslocations <- read.csv("genebank_country_1.csv", sep="\t", stringsAsFactors = FALSE, header = TRUE)head(locations)colnames(locations) <- c("Assembly", "Location") #extract assembly namesassembly_names <- locations$Assemblycountry_to_colour <- c('Afghanistan' = "salmon2",'Belgium' = "gold3",'Cape Verde' = "darkgreen",'Czech Republic' = "grey46",'France' = "red", 'Germany' = "magenta",'Italy' = "brown",'Japan' = "springgreen4",'Lithuania' = "black",'Netherlands' = "green",'Poland' = "orange",'Madeira' = "darkviolet",'Romania' = "pink",'Spain' = 'deepskyblue3','Sweden' = "olivedrab",'Tanzania' = 'blue','United Kingdom' = 'mediumpurple','USA' = 'cyan')# Create a mapping from assembly names to countriesassembly_to_country <- setNames(locations$Location, locations$Assembly)# Map assembly names to colors through their countriesassembly_to_color <- sapply(assembly_names, function(assembly) {  country <- assembly_to_country[assembly]  color <- country_to_colour[country]  return(color)})# Prepare a data frame that ggtree can use to map colors to tips# The row names of 'my_data' are assumed to be the assembly names used in 'phylo_tree$tip.label'# Ensure these names match those in 'assembly_to_color'tip_colors <- data.frame(label = names(assembly_to_color), color = assembly_to_color)# Remove country information from tip_colors labelstip_colors$label <- sub("\\..*", "", tip_colors$label)# Ensure tip_colors is a named vectortip_colors_named <- setNames(tip_colors$color, tip_colors$label)# Plot the tree with color mappingp <- ggtree(phylo_tree, layout="circular") +  geom_tiplab(aes(label = label, color = label), size = 4) +  scale_color_manual(values = tip_colors_named) +  theme_tree2() +  theme(legend.position = "none",  # Adjust legend position        plot.background = element_blank(),  # Customize background        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),  # Remove grid lines        axis.text = element_blank(),  # Hide axis text        axis.ticks = element_blank())  # Hide axis ticks# Display the plotprint(p)# Save the plotggsave(filename = "phylo_jaccard_nodes.png", plot = p, width = 16, height = 16, dpi = 300)

Other data visualisation

Assembly based gene classification - PAV

Obtain the files

We will need the following files:

  • presence_absence_matrix.csv with the results from all the chromosomes
  • A file calledloci_classes.txt, where we have theFeature/Pater andLabel columns fromfinal_screening_GO.tsv with the results from all the chromosomes
  • matrix_transposed.csv, to be obtained withtranspose.pl
  • The dataset rearranged for R, to be obtained withcomp_node_path.pl
  1. Final screening and Presence-absence matrix per chromosome
# chr1cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/chr1cut -f 1,4 final_genes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' loci_classes.txt > chr1_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr1.txt# transpose matrixperl ../transpose.pl matrix_chr1.txt > chr1_transp_matrix.txt# chr2cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/chr2cut -f 1,4 final_genes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' loci_classes.txt > chr2_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr2.txt# transpose matrixperl ../transpose.pl matrix_chr2.txt > chr2_transp_matrix.txt# chr3cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/chr3cut -f 1,4 final_genes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' loci_classes.txt > chr3_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr3.txt# transpose matrixperl ../transpose.pl matrix_chr3.txt > chr3_transp_matrix.txt# chr4cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/chr4cut -f 1,4 final_genes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' loci_classes.txt > chr4_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr4.txt# transpose matrixperl ../transpose.pl matrix_chr4.txt > chr4_transp_matrix.txt# chr5cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/chr5cut -f 1,4 final_genes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' loci_classes.txt > chr5_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr5.txt# transpose matrixperl ../transpose.pl matrix_chr5.txt > chr5_transp_matrix.txt
  1. Join Transposed matrices
cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/allln -s ../chr1/chr1_transp_matrix.txtln -s ../chr2/chr2_transp_matrix.txtln -s ../chr3/chr3_transp_matrix.txtln -s ../chr4/chr4_transp_matrix.txtln -s ../chr5/chr5_transp_matrix.txt../combine_transposed_matrices.py
  1. Join loci classes
cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/allln -s ../chr1/chr1_loci_classes.txtln -s ../chr2/chr2_loci_classes.txtln -s ../chr3/chr3_loci_classes.txtln -s ../chr4/chr4_loci_classes.txtln -s ../chr5/chr5_loci_classes.txtfor f in chr*_loci_classes.txt; do tail -n +2 "$f"; done >> combined_loci_classes.txt
  1. R dataset
perl ../comp_node_path_PAV.pl combined_loci_classes.txt concatenated_matrix.csv > loci_comp_classes.txt

This last script produces alsodata_set_R.txt which is the file we are going to use for the figure. Let's rename the file:

mv data_set_R.txt R_data_genes_PAV_total.txt

Make the figure

Install required packages

mamba activate Rmamba install r-ggplot2mamba install r-factoextramamba install conda-forge::r-veganmamba install bioconda::bioconductor-ggtreemamba install conda-forge::r-apemamba install bioconda::bioconductor-ggtreeextra

Open R

R
setwd("/home/lia/Arabidopsis_pangenome/R_plots")
data<-read.table(file="./Assembly_based_genes/R_data_genes_PAV_total.txt", sep="\t", header=FALSE)head(data)
colnames(data)<-c("Assembly", "Class", "Count")head(data)
library("ggplot2")
custom_colors <- c("private" = "black", "dispensable" = "beige", "softcore" = "olivedrab", "core" = "firebrick")  # Replace with your actual classes and desired colors# Order the classes in a custom orderdata$Class <- factor(data$Class, levels = c("private", "dispensable", "softcore", "core"))  # Replace with your actual class order# Assign names to the colors based on the levels of Classnames(custom_colors) <- levels(data$Class)# Create the plotgene_plot_impr <- ggplot(data = data, aes(x = Assembly, y = Count, fill = Class)) +  geom_bar(stat = "identity") +  coord_flip() +  theme_minimal() +  # Set a white background  theme(text = element_text(size = 6),        #panel.grid.major = element_blank(),  # Remove major grid lines        panel.grid.minor = element_blank(),  # Remove minor grid lines        panel.background = element_blank(),  # Ensure the background is white        plot.background = element_blank()) +  scale_fill_manual(values = custom_colors)  # Apply the custom colors# Print the plotprint(gene_plot_impr)
ggsave(filename = "Assembly_based_genes/gene_plot_PAV_total.png", plot = gene_plot_impr, width = 10, height = 7, dpi = 300)

Assembly based gene classification - CNV

This will be based on untangle results only.

Obtain the files

We will need the following files:

  • matrix.csv with the results from all the chromosomes
  • A file calledloci_classes.txt, where we have theFeature/Pater andLabel columns frompangenome_screening.csv with the results from all the chromosomes
  • matrix_transposed.csv, to be obtained withtranspose.pl
  • The dataset rearranged for R, to be obtained withcomp_node_path.pl
  1. Pangenome screening and matrix per chromosome
# chr1cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/chr1cut -f 2,8 pangenome_screening.csv > chr1_loci_classes.txt# transpose matrixperl ../transpose.pl matrix.csv > chr1_transp_matrix.txt# chr2cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/chr2cut -f 2,8 pangenome_screening.csv > chr2_loci_classes.txt# transpose matrixperl ../transpose.pl matrix.csv > chr2_transp_matrix.txt# chr3cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/chr3cut -f 2,8 pangenome_screening.csv > chr3_loci_classes.txt# transpose matrixperl ../transpose.pl matrix.csv > chr3_transp_matrix.txt# chr4cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/chr4cut -f 2,8 pangenome_screening.csv > chr4_loci_classes.txt# transpose matrixperl ../transpose.pl matrix.csv > chr4_transp_matrix.txt# chr5cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/chr5cut -f 2,8 pangenome_screening.csv > chr5_loci_classes.txt# transpose matrixperl ../transpose.pl matrix.csv > chr5_transp_matrix.txt
  1. Join Transposed matrices
cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/mkdir allcd allln -s ../chr1/chr1_transp_matrix.txtln -s ../chr2/chr2_transp_matrix.txtln -s ../chr3/chr3_transp_matrix.txtln -s ../chr4/chr4_transp_matrix.txtln -s ../chr5/chr5_transp_matrix.txt../combine_transposed_matrices.py
  1. Join loci classes
cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/allln -s ../chr1/chr1_loci_classes.txtln -s ../chr2/chr2_loci_classes.txtln -s ../chr3/chr3_loci_classes.txtln -s ../chr4/chr4_loci_classes.txtln -s ../chr5/chr5_loci_classes.txtfor f in chr*_loci_classes.txt; do tail -n +2 "$f"; done >> combined_loci_classes.txt
  1. R dataset
perl ../comp_node_path_CNV.pl combined_loci_classes.txt concatenated_matrix.csv > loci_comp_classes.txt

This last script produces alsodata_set_R.txt which is the file we are going to use for the figure. Let's rename the file:

mv data_set_R.txt R_data_genes_CNV_untangle.txt

Make the figure

setwd("/home/lia/Arabidopsis_pangenome/R_plots")
data<-read.table(file="Assembly_based_genes/R_data_genes_CNV_untangle.txt", sep="\t", header=FALSE)head(data)
colnames(data)<-c("Assembly", "Class", "Count")head(data)
library("ggplot2")
custom_colors <- c("private" = "black", "dispensable" = "beige", "softcore" = "olivedrab", "core" = "firebrick")  # Replace with your actual classes and desired colors# Order the classes in a custom orderdata$Class <- factor(data$Class, levels = c("private", "dispensable", "softcore", "core"))  # Replace with your actual class order# Assign names to the colors based on the levels of Classnames(custom_colors) <- levels(data$Class)# Create the plotgene_plot_impr <- ggplot(data = data, aes(x = Assembly, y = Count, fill = Class)) +  geom_bar(stat = "identity") +  coord_flip() +  theme_minimal() +  # Set a white background  theme(text = element_text(size = 6),        #panel.grid.major = element_blank(),  # Remove major grid lines        panel.grid.minor = element_blank(),  # Remove minor grid lines        panel.background = element_blank(),  # Ensure the background is white        plot.background = element_blank()) +  scale_fill_manual(values = custom_colors)  # Apply the custom colors# Print the plotprint(gene_plot_impr)
ggsave(filename = "Assembly_based_genes/gene_plot_CNV_untangle.png", plot = gene_plot_impr, width = 10, height = 7, dpi = 300)

Assembly based pseudogene classification - PAV

Obtain the files

We will need the following files:

  • presence_absence_matrix.csv with the results from all the chromosomes
  • A file calledloci_classes.txt, where we have theFeature/Pater andLabel columns fromfinal_screening_GO.tsv with the results from all the chromosomes
  • matrix_transposed.csv, to be obtained withtranspose.pl
  • The dataset rearranged for R, to be obtained withcomp_node_path.pl
  1. Final screening and Presence-absence matrix per chromosome
# chr1cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/chr1cut -f 1,4 final_pseudogenes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' loci_classes.txt > chr1_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr1.txt# transpose matrixperl ../transpose.pl matrix_chr1.txt > chr1_transp_matrix.txt# chr2cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/chr2cut -f 1,4 final_pseudogenes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' loci_classes.txt > chr2_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr2.txt# transpose matrixperl ../transpose.pl matrix_chr2.txt > chr2_transp_matrix.txt# chr3cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/chr3cut -f 1,4 final_pseudogenes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' loci_classes.txt > chr3_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr3.txt# transpose matrixperl ../transpose.pl matrix_chr3.txt > chr3_transp_matrix.txt# chr4cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/chr4cut -f 1,4 final_pseudogenes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' loci_classes.txt > chr4_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr4.txt# transpose matrixperl ../transpose.pl matrix_chr4.txt > chr4_transp_matrix.txt# chr5cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/chr5cut -f 1,4 final_pseudogenes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' loci_classes.txt > chr5_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr5.txt# transpose matrixperl ../transpose.pl matrix_chr5.txt > chr5_transp_matrix.txt
  1. Join Transposed matrices
cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/mkdir allcd allln -s ../chr1/chr1_transp_matrix.txtln -s ../chr2/chr2_transp_matrix.txtln -s ../chr3/chr3_transp_matrix.txtln -s ../chr4/chr4_transp_matrix.txtln -s ../chr5/chr5_transp_matrix.txt../combine_transposed_matrices.py
  1. Join loci classes
cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/allln -s ../chr1/chr1_loci_classes.txtln -s ../chr2/chr2_loci_classes.txtln -s ../chr3/chr3_loci_classes.txtln -s ../chr4/chr4_loci_classes.txtln -s ../chr5/chr5_loci_classes.txtfor f in chr*_loci_classes.txt; do tail -n +2 "$f"; done >> combined_loci_classes.txt
  1. R dataset
perl ../comp_node_path_PAV.pl combined_loci_classes.txt concatenated_matrix.csv > loci_comp_classes.txt

This last script produces alsodata_set_R.txt which is the file we are going to use for the figure. Let's rename the file:

mv data_set_R.txt R_data_pseudogenes_PAV_total.txt

Make the figure

Install required packages

mamba activate Rmamba install r-ggplot2mamba install r-factoextra

Open R

R
setwd("/home/lia/Arabidopsis_pangenome/R_plots")
data<-read.table(file="./Assembly_based_pseudogenes/R_data_pseudogenes_PAV_total.txt", sep="\t", header=FALSE)head(data)
colnames(data)<-c("Assembly", "Class", "Count")head(data)
library("ggplot2")
custom_colors <- c("private" = "black", "dispensable" = "beige", "softcore" = "olivedrab", "core" = "firebrick")  # Replace with your actual classes and desired colors# Order the classes in a custom orderdata$Class <- factor(data$Class, levels = c("private", "dispensable", "softcore", "core"))  # Replace with your actual class order# Assign names to the colors based on the levels of Classnames(custom_colors) <- levels(data$Class)# Create the plotgene_plot_impr <- ggplot(data = data, aes(x = Assembly, y = Count, fill = Class)) +  geom_bar(stat = "identity") +  coord_flip() +  theme_minimal() +  # Set a white background  theme(text = element_text(size = 6),        #panel.grid.major = element_blank(),  # Remove major grid lines        panel.grid.minor = element_blank(),  # Remove minor grid lines        panel.background = element_blank(),  # Ensure the background is white        plot.background = element_blank()) +  scale_fill_manual(values = custom_colors)  # Apply the custom colors# Print the plotprint(gene_plot_impr)
ggsave(filename = "Assembly_based_pseudogenes/pseudogene_plot_PAV_total.png", plot = gene_plot_impr, width = 10, height = 7, dpi = 300)

Assembly based pseudogene classification - CNV

This will be based on untangle results only.

Obtain the files

We will need the following files:

  • matrix.csv with the results from all the chromosomes
  • A file calledloci_classes.txt, where we have theFeature/Pater andLabel columns frompangenome_screening.csv with the results from all the chromosomes
  • matrix_transposed.csv, to be obtained withtranspose.pl
  • The dataset rearranged for R, to be obtained withcomp_node_path.pl
  1. Pangenome screening and matrix per chromosome
# chr1cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/chr1cut -f 2,8 pangenome_screening.csv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' loci_classes.txt > chr1_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' matrix.csv > matrix_chr1.txt# transpose matrixperl ../transpose.pl matrix_chr1.txt > chr1_transp_matrix.txt# chr2cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/chr2cut -f 2,8 pangenome_screening.csv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' loci_classes.txt > chr2_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' matrix.csv > matrix_chr2.txt# transpose matrixperl ../transpose.pl matrix_chr2.txt > chr2_transp_matrix.txt# chr3cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/chr3cut -f 2,8 pangenome_screening.csv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' loci_classes.txt > chr3_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' matrix.csv > matrix_chr3.txt# transpose matrixperl ../transpose.pl matrix_chr3.txt > chr3_transp_matrix.txt# chr4cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/chr4cut -f 2,8 pangenome_screening.csv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' loci_classes.txt > chr4_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' matrix.csv > matrix_chr4.txt# transpose matrixperl ../transpose.pl matrix_chr4.txt > chr4_transp_matrix.txt# chr5cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/chr5cut -f 2,8 pangenome_screening.csv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' loci_classes.txt > chr5_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' matrix.csv > matrix_chr5.txt# transpose matrixperl ../transpose.pl matrix_chr5.txt > chr5_transp_matrix.txt2.  Join Transposed matrices

cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/mkdir allcd all

ln -s ../chr1/chr1_transp_matrix.txtln -s ../chr2/chr2_transp_matrix.txtln -s ../chr3/chr3_transp_matrix.txtln -s ../chr4/chr4_transp_matrix.txtln -s ../chr5/chr5_transp_matrix.txt

../combine_transposed_matrices.py

3.  Join loci classes

cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/all

ln -s ../chr1/chr1_loci_classes.txtln -s ../chr2/chr2_loci_classes.txtln -s ../chr3/chr3_loci_classes.txtln -s ../chr4/chr4_loci_classes.txtln -s ../chr5/chr5_loci_classes.txt

for f in chr*_loci_classes.txt; do tail -n +2 "$f"; done >> combined_loci_classes.txt

3.  R dataset

perl ../comp_node_path_CNV.pl combined_loci_classes.txt concatenated_matrix.csv > loci_comp_classes.txt

This last script produces also `data_set_R.txt` which is the file we are going to use for the figure. Let's rename the file:

mv data_set_R.txt R_data_pseudogenes_CNV_untangle.txt

mamba activate RR

### Make the figure```{r}setwd("/home/lia/Arabidopsis_pangenome/R_plots")
data<-read.table(file="Assembly_based_pseudogenes/R_data_pseudogenes_CNV_untangle.txt", sep="\t", header=FALSE)head(data)
colnames(data)<-c("Assembly", "Class", "Count")head(data)
library("ggplot2")
custom_colors <- c("private" = "black", "dispensable" = "beige", "softcore" = "olivedrab", "core" = "firebrick")  # Replace with your actual classes and desired colors# Order the classes in a custom orderdata$Class <- factor(data$Class, levels = c("private", "dispensable", "softcore", "core"))  # Replace with your actual class order# Assign names to the colors based on the levels of Classnames(custom_colors) <- levels(data$Class)# Create the plotgene_plot_impr <- ggplot(data = data, aes(x = Assembly, y = Count, fill = Class)) +  geom_bar(stat = "identity") +  coord_flip() +  theme_minimal() +  # Set a white background  theme(text = element_text(size = 6),        #panel.grid.major = element_blank(),  # Remove major grid lines        panel.grid.minor = element_blank(),  # Remove minor grid lines        panel.background = element_blank(),  # Ensure the background is white        plot.background = element_blank()) +  scale_fill_manual(values = custom_colors)  # Apply the custom colors# Print the plotprint(gene_plot_impr)
ggsave(filename = "Assembly_based_pseudogenes/pseudogene_plot_CNV_untangle.png", plot = gene_plot_impr, width = 10, height = 7, dpi = 300)

Assembly based node classification

we need to obtain a huge matrix:

cd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/all_graphsln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr1/node_matrix/chr1_matrix.csvln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr2/node_matrix/chr2_matrix.csvln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr3/node_matrix/chr3_matrix.csvln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr4/node_matrix/chr4_matrix.csvln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr5/node_matrix/chr5_matrix.csvscreen -S combine_matricesmamba activate pandascombine_transposed_matrices.pycd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr1/node_matrix/cut -f 1,4 PAV_CNV.csv > chr1_loci_classes.txtcd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr2/node_matrix/cut -f 1,4 PAV_CNV.csv > chr2_loci_classes.txtcd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr3/node_matrix/cut -f 1,4 PAV_CNV.csv > chr3_loci_classes.txtcd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr4/node_matrix/cut -f 1,4 PAV_CNV.csv > chr4_loci_classes.txtcd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr5/node_matrix/cut -f 1,4 PAV_CNV.csv > chr5_loci_classes.txtcd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/all_graphsln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr1/node_matrix/chr1_loci_classes.txtln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr2/node_matrix/chr2_loci_classes.txtln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr3/node_matrix/chr3_loci_classes.txtln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr4/node_matrix/chr4_loci_classes.txtln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr5/node_matrix/chr5_loci_classes.txtfor f in chr*_loci_classes.txt; do tail -n +2 "$f"; done >> combined_loci_classes.txtscreen -r combine_matrices./comp_node_path_PAV.pl combined_loci_classes.txt concatenated_matrix.csv > loci_comp_classes_pav.txtmv data_set_R.txt pav_data_set_R.txt

Make the figures

mamba activate RR
setwd("/home/lia/Arabidopsis_pangenome/R_plots")
data<-read.table(file="Assembly_based_nodes/pav_data_set_R.txt", sep="\t", header=FALSE)head(data)
colnames(data)<-c("Assembly", "Class", "Count")head(data)
library("ggplot2")library(scale)
custom_colors <- c("private" = "black", "dispensable" = "beige", "softcore" = "olivedrab", "core" = "firebrick")  # Replace with your actual classes and desired colors# Order the classes in a custom orderdata$Class <- factor(data$Class, levels = c("private", "dispensable", "softcore", "core"))  # Replace with your actual class order# Assign names to the colors based on the levels of Classnames(custom_colors) <- levels(data$Class)# Create the plotnode_plot_pav <- ggplot(data = data, aes(x = Assembly, y = Count, fill = Class)) +  geom_bar(stat = "identity") +  coord_flip() +  theme_minimal() +  # Set a white background  theme(text = element_text(size = 6),        #panel.grid.major = element_blank(),  # Remove major grid lines        panel.grid.minor = element_blank(),  # Remove minor grid lines        panel.background = element_blank(),  # Ensure the background is white        plot.background = element_blank()) +  scale_fill_manual(values = custom_colors) +  # Apply the custom colors  scale_y_continuous(labels = label_number())  # Format y-axis labels to avoid scientific notation# Print the plotprint(node_plot_pav)
ggsave(filename = "Assembly_based_nodes/nodes_plot_PAV.png", plot = node_plot_pav, width = 10, height = 7, dpi = 300)

Pie charts

Refer toscripts/python_plot.ipynb

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp