- Notifications
You must be signed in to change notification settings - Fork0
LiaOb21/arabidopsis_pangenome
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository collects all the commands and script used for the reference-freeArabidopsis thaliana pangenome. This work is part of the PhD project carried out by Lia Obinu.
- The reference-freeArabidopsis thaliana pangenome
- Table of contents
- Building the reference-freeArabidopsis thaliana pangenome graph
- Pangenome annotation - reference based
- Pangenome annotation - non-reference sequences
- Get the FASTA file for annotation
- Annotation
- Annotation review
- Final annotation screening
- Gene ontology enrichment analysis
- Sequence-based pangenome analysis
- Similarity analysis
- Other data visualisation
We found in total 93 suitable assemblies, which are at chromosome level or complete genome level. They are coming from the following BioProjects: PRJEB55353, PRJEB55632, PRJEB50694, PRJEB30763, PRJEB31147, PRJEB37252, PRJEB37257, PRJEB37258, PRJEB37260, PRJEB37261, PRJEB40125, PRJEB51511, PRJNA777107, PRJNA779205, PRJNA828135, PRJNA834751, PRJNA10719, PRJNA915353, PRJNA311266, PRJCA005809, PRJCA007112.
We should exclude short contigs from the pangenome, but we need to set a threshold, and for this we must quality inspect the assemblies and check the length distribution. We can do this using QUAST:
quast accessions/*.fna.gz accessions/*.fasta.gz -o ./output_quast -r accessions/GCA_000001735.2_TAIR10.1_genomic.fna.gz -t 40 --eukaryote --large -k --plots-format png
Based on QUAST results, we can decided to trim the contigs that are smaller than 5kbp. The assemblies that will suffer a loss will be the following according to QUAST:
Assembly | # contigs (>= 0 bp) | # contigs (>= 5000 bp) | Total length (>= 0 bp) | Total length (>= 5000 bp) |
---|---|---|---|---|
GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic | 109 | 78 | 119626746 | 119530365 |
GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic | 111 | 83 | 120129838 | 120059751 |
GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic | 102 | 71 | 119749512 | 119657772 |
GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic | 105 | 75 | 120338059 | 120246237 |
GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic | 94 | 74 | 120289901 | 120224609 |
GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic | 184 | 155 | 122202079 | 122112347 |
GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic | 142 | 115 | 120795191 | 120728700 |
For the filtering we can usebioawk
.
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.fna > GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.trim.5k.fna#check:grep '>' GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.fna | wc -l---109grep '>' GCA_900660825.1_Ath.Ler-0.MPIPZ.v1.0_genomic.trim.5k.fna | wc -l---78
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.fna > GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.trim.5k.fna#check:grep '>' GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.fna | wc -l---111grep '>' GCA_902460265.3_Arabidopsis_thaliana_An-1_chrom_genomic.trim.5k.fna | wc -l---83
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.fna > GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.trim.5k.fna#check:grep '>' GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.fna | wc -l---102grep '>' GCA_902460275.1_Arabidopsis_thaliana_Cvi-0_genomic.trim.5k.fna | wc -l---71
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.fna > GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.trim.5k.fna#check:grep '>' GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.fna | wc -l---105grep '>' GCA_902460285.1_Arabidopsis_thaliana_Ler_genomic.trim.5k.fna | wc -l---75
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.fna > GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.trim.5k.fna #check:grep '>' GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.fna | wc -l---94grep '>' GCA_902460295.1_Arabidopsis_thaliana_Sha_genomic.trim.5k.fna | wc -l---74
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.fna > GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.trim.5k.fna#check:grep '>' GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.fna | wc -l---184grep '>' GCA_902460305.1_Arabidopsis_thaliana_Kyo_genomic.trim.5k.fna | wc -l---155
bioawk -c fastx '(length($seq)>5000){print ">" $name ORS $seq}' GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.fna > GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.trim.5k.fna#check:grep '>' GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.fna | wc -l---142grep '>' GCA_902460315.1_Arabidopsis_thaliana_Eri-1_genomic.trim.5k.fna | wc -l---115
Now we should zip the .trim.5k.fna files:
for i in *trim.5k.fna ; do (echo "gzip $i"); done | bash
From now on, we are going to used the trimmed version of these assemblies.
We can follow theS. cerevisiae example:
ls *.gz | cut -f 1 -d '.' | uniq | while read f; do echo $f zcat $f.* > $f.fadone
To change the sequence names according toPanSN-spec, we can usefastix:
ls *.fa | while read f; do sample_name=$(echo $f | cut -f 1 -d '.'); echo ${sample_name} fastix -p "${sample_name}#1#" $f >> Arabidopsis.pg.in.fastadone
With"${sample_name}#1#"
we specify haplotype_id equals to 1 for all the assemblies, as they are all haploid.The names of the scaffolds are now following the pattern:
[sample_name][delim][haplotype_id][delim][contig_or_scaffold_name]
We should remove all what comes after\t
from the fasta headers as this can led to errors in subsequent steps:
cat Arabidopsis.pg.in.fasta | sed -E '/^>/s/( +|\t).*//' > Arabidopsis.adj.pg.in.fasta
We then should remove organelle scaffolds from our input file. Only three assemblies have Mt e Plt, and the corresponding scaffolds are:
>GCA_000001735#1#BK010421.1>GCA_000001735#1#AP000423.1>GCA_023115395#1#CP096029.1>GCA_023115395#1#CP096030.1>GCA_904420315#1#LR881472.1>GCA_904420315#1#LR881471.1
We need to put these headers in a file that we will callMt_and_Plt_headers.txt
.
We can now exclude these scaffolds from our previous multifasta input file:
awk '(NR==FNR) { toRemove[$1]; next } /^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 } p' Mt_and_Plt_headers.txt Arabidopsis.adj.pg.in.fasta > Arabidopsis.pg.input.fasta
We need to compress our input and to index it:
bgzip -@ 16 Arabidopsis.pg.input.fastasamtools faidx Arabidopsis.pg.input.fasta.gz
We can now runpartition-before-pggb
on our multifasta (note: if it is not possible to estimate the divergence, you should try different settings to individuate the right value to set for-p
):
partition-before-pggb -i Arabidopsis.pg.input.fasta.gz -o output_pbp_p90 -n 93 -t 100 -Y '#' -p 90 -s 10k -V 'GCA_000001735:#:1000' -m -D ./temp
- -i input
- -o output directory
- -n number of haplotypes
- -p percent identity for mapping/alignment
- -s segment length for mapping [default: 5000]
- -V to produce vcf against ref
- -m generate MultiQC report of graphs' statistics and visualizations,automatically runs odgi stats
- -D temporary files directory
We can now run pggb on each community! :-)
To build the graph of each community we will usepggb -p 95
.
pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.0.fa \ -o output_pbp_p90/community.0.out \ -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \ -k 23 -f 0 -B 10000000 \ -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \ -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100
pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.1.fa \ -o output_pbp_p90/community.1.out \ -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \ -k 23 -f 0 -B 10000000 \ -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \ -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100
pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.2.fa \ -o output_pbp_p90/community.2.out \ -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \ -k 23 -f 0 -B 10000000 \ -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \ -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100
pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.3.fa \ -o output_pbp_p90/community.3.out \ -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \ -k 23 -f 0 -B 10000000 \ -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \ -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100
pggb -i output_pbp_p90/Arabidopsis.pg.input.fasta.gz.c325321.community.4.fa \ -o output_pbp_p90/community.4.out \ -s 10000 -l 50000 -p 95 -n 93 -K 19 -F 0.001 -g 30 \ -k 23 -f 0 -B 10000000 \ -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \ -Y "#" -V GCA_000001735:#:1000 --multiqc --temp-dir ./temp --threads 100 --poa-threads 100
To useodgi untangle
for annotating the pangenome graph, we will switch tountangle_for_annotation
branch.
cd odgigit checkout untangle_for_annotation && git pull && git submodule update --init --recursivecmake -H. -Bbuild && cmake --build build -- -j 16
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/735/GCA_000001735.2_TAIR10.1/GCA_000001735.2_TAIR10.1_genomic.gff.gzgunzip GCA_000001735.2_TAIR10.1_genomic.gff.gz
Now, we will inject only the genes, so we need to extract only gene rows from the gff of TAIR10 and split them based on chromosomes.
Let's have a look to the gff file:
head GCA_000001735.2_TAIR10.1_genomic.gff---##gff-version 3#!gff-spec-version 1.21#!processor NCBI annotwriter#!genome-build TAIR10.1#!genome-build-accession NCBI_Assembly:GCA_000001735.2##sequence-region CP002684.1 1 30427671##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3702CP002684.1 Genbank region 1 30427671 . + . ID=CP002684.1:1..30427671;Dbxref=taxon:3702;Name=1;chromosome=1;ecotype=Columbia;gbkey=Src;mol_type=genomic DNACP002684.1 Genbank gene 3631 5899 . + . ID=gene-AT1G01010;Dbxref=Araport:AT1G01010,TAIR:AT1G01010;Name=NAC001;gbkey=Gene;gene=NAC001;gene_biotype=protein_coding;gene_synonym=ANAC001,NAC domain containing protein 1,T25K16.1,T25K16_1;locus_tag=AT1G01010CP002684.1 Genbank mRNA 3631 5899 . + . ID=rna-gnl|JCVI|mRNA.AT1G01010.1;Parent=gene-AT1G01010;Dbxref=Araport:AT1G01010,TAIR:AT1G01010;gbkey=mRNA;gene=NAC001;inference=similar to RNA sequence%2C mRNA:INSD:BT001115.1%2CINSD:AF439834.1%2CINSD:AK226863.1;locus_tag=AT1G01010;orig_protein_id=gnl|JCVI|AT1G01010.1;orig_transcript_id=gnl|JCVI|mRNA.AT1G01010.1;product=NAC domain containing protein 1
In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:
awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002684.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr1.genes.adj.TAIR10.bedhead chr1.genes.adj.TAIR10.bed | column -t---GCA_000001735#1#CP002684.1 3631 5899 AT1G01010:3631-5899GCA_000001735#1#CP002684.1 6788 9130 AT1G01020:6788-9130GCA_000001735#1#CP002684.1 11101 11372 AT1G03987:11101-11372GCA_000001735#1#CP002684.1 11649 13714 AT1G01030:11649-13714GCA_000001735#1#CP002684.1 23121 31227 AT1G01040:23121-31227GCA_000001735#1#CP002684.1 23312 24099 AT1G03993:23312-24099GCA_000001735#1#CP002684.1 28500 28706 AT1G01046:28500-28706GCA_000001735#1#CP002684.1 31170 33171 AT1G01050:31170-33171GCA_000001735#1#CP002684.1 32727 33009 AT1G03997:32727-33009GCA_000001735#1#CP002684.1 33365 37871 AT1G01060:33365-37871
Make sure that community 0 is chromosome 1 of TAIR10:
odgi paths -i Arabidopsis.pg.input.community.0.og -L | grep 'GCA_000001735'---GCA_000001735#1#CP002684.1
In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:
awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002685.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr2.genes.adj.TAIR10.bedhead chr2.genes.adj.TAIR10.bed | column -t---GCA_000001735#1#CP002685.1 1025 3173 AT2G01008:1025-3173GCA_000001735#1#CP002685.1 2805 3176 AT2G03855:2805-3176GCA_000001735#1#CP002685.1 3706 5513 AT2G01010:3706-5513GCA_000001735#1#CP002685.1 5782 5945 AT2G01020:5782-5945GCA_000001735#1#CP002685.1 6571 6672 AT2G01021:6571-6672GCA_000001735#1#CP002685.1 7090 7505 AT2G03865:7090-7505GCA_000001735#1#CP002685.1 7669 9399 AT2G03875:7669-9399GCA_000001735#1#CP002685.1 9648 9767 AT2G01023:9648-9767GCA_000001735#1#CP002685.1 9849 10177 AT2G00340:9849-10177GCA_000001735#1#CP002685.1 50288 51487 AT2G01035:50288-51487
Make sure that community 1 is chromosome 2 of TAIR10:
odgi paths -i Arabidopsis.pg.input.community.1.og -L | grep 'GCA_000001735'---GCA_000001735#1#CP002685.1
In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:
awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002686.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr3.genes.adj.TAIR10.bedhead chr3.genes.adj.TAIR10.bed | column -t---GCA_000001735#1#CP002686.1 1609 4159 AT3G01015:1609-4159GCA_000001735#1#CP002686.1 4342 4818 AT3G01010:4342-4818GCA_000001735#1#CP002686.1 5104 6149 AT3G01020:5104-6149GCA_000001735#1#CP002686.1 6657 7772 AT3G01030:6657-7772GCA_000001735#1#CP002686.1 8723 12697 AT3G01040:8723-12697GCA_000001735#1#CP002686.1 13046 15906 AT3G01050:13046-15906GCA_000001735#1#CP002686.1 15934 16320 AT3G00970:15934-16320GCA_000001735#1#CP002686.1 16631 18909 AT3G01060:16631-18909GCA_000001735#1#CP002686.1 19409 20806 AT3G01070:19409-20806GCA_000001735#1#CP002686.1 25355 27712 AT3G01080:25355-27712
Make sure that community 2 is chromosome 3 of TAIR10:
odgi paths -i Arabidopsis.pg.input.community.2.og -L | grep 'GCA_000001735'---GCA_000001735#1#CP002686.1
In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:
awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002687.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr4.genes.adj.TAIR10.bedhead chr4.genes.adj.TAIR10.bed | column -t---GCA_000001735#1#CP002687.1 1180 1536 AT4G00005:1180-1536GCA_000001735#1#CP002687.1 2895 10504 AT4G00020:2895-10504GCA_000001735#1#CP002687.1 10815 13359 AT4G00026:10815-13359GCA_000001735#1#CP002687.1 13527 14413 AT4G00030:13527-14413GCA_000001735#1#CP002687.1 14627 16079 AT4G00040:14627-16079GCA_000001735#1#CP002687.1 17639 20183 AT4G00050:17639-20183GCA_000001735#1#CP002687.1 21351 29064 AT4G00060:21351-29064GCA_000001735#1#CP002687.1 29072 31426 AT4G00070:29072-31426GCA_000001735#1#CP002687.1 32748 33756 AT4G00080:32748-33756GCA_000001735#1#CP002687.1 33800 33872 AT4G00085:33800-33872
Make sure that community 3 is chromosome 4 of TAIR10:
odgi paths -i Arabidopsis.pg.input.community.3.og -L | grep 'GCA_000001735'---GCA_000001735#1#CP002687.1
In order to inject the genes in the pangenome, we need to extract the information we need from the gff, and modify them according to the PanSN-spec naming:
awk -F "\t|ID=gene-|;Dbxref" '$1 ~/^CP002688.1$/ && $3 ~/^gene$/ {print ("GCA_000001735#1#"$1"\t"$4"\t"$5"\t"$10":"$4"-"$5)}' ../../../GCA_000001735.2_TAIR10.1_genomic.gff > chr5.genes.adj.TAIR10.bedhead chr5.genes.adj.TAIR10.bed | column -t---GCA_000001735#1#CP002688.1 2 303 AT5G00730:2-303GCA_000001735#1#CP002688.1 995 5156 AT5G01010:995-5156GCA_000001735#1#CP002688.1 5256 5907 AT5G01015:5256-5907GCA_000001735#1#CP002688.1 5339 5593 AT5G01017:5339-5593GCA_000001735#1#CP002688.1 5917 8467 AT5G01020:5917-8467GCA_000001735#1#CP002688.1 9780 13235 AT5G01030:9780-13235GCA_000001735#1#CP002688.1 13128 16236 AT5G01040:13128-16236GCA_000001735#1#CP002688.1 18086 20887 AT5G01050:18086-20887GCA_000001735#1#CP002688.1 22684 24934 AT5G01060:22684-24934GCA_000001735#1#CP002688.1 25012 25908 AT5G01070:25012-25908
Make sure that community 4 is chromosome 5 of TAIR10:
odgi paths -i Arabidopsis.pg.input.community.4.og -L | grep 'GCA_000001735'---GCA_000001735#1#CP002688.1
The original file fromMascagni et al., 2021 was modified as explained in scripts/bed_to_gff3.ipynb. The coordinates of pseudogenes annotated in the reverse strain were then converted in forward strain to allow compatibility with the downstream analysis. This is how the final annotations file looks like:
head pseudogenes.txt---ChromosomeendstartcoverageGenewise_scorepositionsizesizeStrandTypePaterSpeciesStop_codonFrameshiftChr1614462690.0956.39UTR129plusFRAGAT1G01010.1A.thaliana00Chr147087473250.0674.09Intergenic206plusFRAGAT1G02730.1A.thaliana13Chr189932901530.0224.75Intergenic51plusFRAGAT5G41740.2A.thaliana13Chr12590112591510.0648.95Intergenic84plusFRAGAT1G01690.1A.thaliana00Chr12665622674950.9968.96UTR979plusDUPAT4G00525.1A.thaliana00Chr14260184261760.0617.28Intergenic36plusFRAGAT3G04430.1A.thaliana11Chr15783055784840.0447.13Intron_cds102plusFRAGAT1G05120.1A.thaliana00Chr15824015825850.4652.69Intergenic236plusSEAT4G02160.1A.thaliana01Chr15826345829710.57117.33Intergenic247plusSEAT5G61710.1A.thaliana01
awk '$1 == "Chr1"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr1/GCA_000001735#1#CP002684.1/g' > Pseudogenes_TAIR10_chr1.bedhead Pseudogenes_TAIR10_chr1.bed ---GCA_000001735#1#CP002684.161446269GCA_000001735#1#CP002684.1:6144-6269GCA_000001735#1#CP002684.14708747325GCA_000001735#1#CP002684.1:47087-47325GCA_000001735#1#CP002684.18993290153GCA_000001735#1#CP002684.1:89932-90153GCA_000001735#1#CP002684.1259011259151GCA_000001735#1#CP002684.1:259011-259151GCA_000001735#1#CP002684.1266562267495GCA_000001735#1#CP002684.1:266562-267495GCA_000001735#1#CP002684.1426018426176GCA_000001735#1#CP002684.1:426018-426176GCA_000001735#1#CP002684.1578305578484GCA_000001735#1#CP002684.1:578305-578484GCA_000001735#1#CP002684.1582401582585GCA_000001735#1#CP002684.1:582401-582585GCA_000001735#1#CP002684.1582634582971GCA_000001735#1#CP002684.1:582634-582971GCA_000001735#1#CP002684.1893047893493GCA_000001735#1#CP002684.1:893047-893493
awk '$1 == "Chr2"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr2/GCA_000001735#1#CP002685.1/g' > Pseudogenes_TAIR10_chr2.bedhead Pseudogenes_TAIR10_chr2.bed ---GCA_000001735#1#CP002685.126292769GCA_000001735#1#CP002685.1:2629-2769GCA_000001735#1#CP002685.129393079GCA_000001735#1#CP002685.1:2939-3079GCA_000001735#1#CP002685.1126312126472GCA_000001735#1#CP002685.1:126312-126472GCA_000001735#1#CP002685.1167479167678GCA_000001735#1#CP002685.1:167479-167678GCA_000001735#1#CP002685.1184200184421GCA_000001735#1#CP002685.1:184200-184421GCA_000001735#1#CP002685.1214567214668GCA_000001735#1#CP002685.1:214567-214668GCA_000001735#1#CP002685.1249464249634GCA_000001735#1#CP002685.1:249464-249634GCA_000001735#1#CP002685.1287831287959GCA_000001735#1#CP002685.1:287831-287959GCA_000001735#1#CP002685.1338842339105GCA_000001735#1#CP002685.1:338842-339105GCA_000001735#1#CP002685.1343391343537GCA_000001735#1#CP002685.1:343391-343537
awk '$1 == "Chr3"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr3/GCA_000001735#1#CP002686.1/g' > Pseudogenes_TAIR10_chr3.bedhead Pseudogenes_TAIR10_chr3.bed ---GCA_000001735#1#CP002686.110091200GCA_000001735#1#CP002686.1:1009-1200GCA_000001735#1#CP002686.1430020430130GCA_000001735#1#CP002686.1:430020-430130GCA_000001735#1#CP002686.1484416484737GCA_000001735#1#CP002686.1:484416-484737GCA_000001735#1#CP002686.1630152630235GCA_000001735#1#CP002686.1:630152-630235GCA_000001735#1#CP002686.1792374792475GCA_000001735#1#CP002686.1:792374-792475GCA_000001735#1#CP002686.1799113799223GCA_000001735#1#CP002686.1:799113-799223GCA_000001735#1#CP002686.1832148832279GCA_000001735#1#CP002686.1:832148-832279GCA_000001735#1#CP002686.1863502863939GCA_000001735#1#CP002686.1:863502-863939GCA_000001735#1#CP002686.1864622864843GCA_000001735#1#CP002686.1:864622-864843GCA_000001735#1#CP002686.1987088990672GCA_000001735#1#CP002686.1:987088-990672
awk '$1 == "Chr4"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr4/GCA_000001735#1#CP002687.1/g' > Pseudogenes_TAIR10_chr4.bedhead Pseudogenes_TAIR10_chr4.bed ---GCA_000001735#1#CP002687.11125311351GCA_000001735#1#CP002687.1:11253-11351GCA_000001735#1#CP002687.16213662339GCA_000001735#1#CP002687.1:62136-62339GCA_000001735#1#CP002687.17024370416GCA_000001735#1#CP002687.1:70243-70416GCA_000001735#1#CP002687.19905699229GCA_000001735#1#CP002687.1:99056-99229GCA_000001735#1#CP002687.19935599464GCA_000001735#1#CP002687.1:99355-99464GCA_000001735#1#CP002687.1102138102411GCA_000001735#1#CP002687.1:102138-102411GCA_000001735#1#CP002687.1102614103287GCA_000001735#1#CP002687.1:102614-103287GCA_000001735#1#CP002687.1103362103703GCA_000001735#1#CP002687.1:103362-103703GCA_000001735#1#CP002687.1105289105375GCA_000001735#1#CP002687.1:105289-105375GCA_000001735#1#CP002687.1121936122058GCA_000001735#1#CP002687.1:121936-122058
awk '$1 == "Chr5"' ../../../../pseudogenes.txt | awk -v OFS='\t' '{print($1,$2,$3,$1":"$2"-"$3)}' | sed 's/Chr5/GCA_000001735#1#CP002688.1/g' > Pseudogenes_TAIR10_chr5.bedhead Pseudogenes_TAIR10_chr5.bed ---GCA_000001735#1#CP002688.11795218035GCA_000001735#1#CP002688.1:17952-18035GCA_000001735#1#CP002688.11803818085GCA_000001735#1#CP002688.1:18038-18085GCA_000001735#1#CP002688.1453356453511GCA_000001735#1#CP002688.1:453356-453511GCA_000001735#1#CP002688.1453793454342GCA_000001735#1#CP002688.1:453793-454342GCA_000001735#1#CP002688.1498633498719GCA_000001735#1#CP002688.1:498633-498719GCA_000001735#1#CP002688.1595390595590GCA_000001735#1#CP002688.1:595390-595590GCA_000001735#1#CP002688.1597848597991GCA_000001735#1#CP002688.1:597848-597991GCA_000001735#1#CP002688.1660001660384GCA_000001735#1#CP002688.1:660001-660384GCA_000001735#1#CP002688.1672914673057GCA_000001735#1#CP002688.1:672914-673057GCA_000001735#1#CP002688.1679891680126GCA_000001735#1#CP002688.1:679891-680126
odgi inject -i Arabidopsis.pg.input.community.0.og -b chr1.genes.adj.TAIR10.bed -o Arabidopsis.pg.community.0.inject.adj.og -t 100 -P
Extract gene names from injected pangenome:
odgi paths -i Arabidopsis.pg.community.0.inject.adj.og -L | grep 'AT1G' > chr1.genenames.txt# security checkwc -l chr1.genenames.txt ---8771 chr1.genenames.txtwc -l chr1.genes.adj.TAIR10.bed---8771 chr1.genes.adj.TAIR10.bed
odgi inject -i Arabidopsis.pg.input.community.1.og -b chr2.genes.adj.TAIR10.bed -o Arabidopsis.pg.community.1.inject.adj.og -t 50 -P
Extract gene names from injected pangenome:
odgi paths -i Arabidopsis.pg.community.1.inject.adj.og -L | grep 'AT2G' > chr2.genenames.txt# security checkwc -l chr2.genenames.txt ---5265 chr2.genenames.txtwc -l chr2.genes.adj.TAIR10.bed---5265 chr2.genes.adj.TAIR10.bed
odgi inject -i Arabidopsis.pg.input.community.2.og -b chr3.genes.adj.TAIR10.bed -o Arabidopsis.pg.community.2.inject.adj.og -t 50 -P
Extract gene names from injected pangenome:
odgi paths -i Arabidopsis.pg.community.2.inject.adj.og -L | grep 'AT3G' > chr3.genenames.txt# security checkwc -l chr3.genenames.txt ---6544 chr3.genenames.txtwc -l chr3.genes.adj.TAIR10.bed---6544 chr3.genes.adj.TAIR10.bed
odgi inject -i Arabidopsis.pg.input.community.3.og -b chr4.genes.adj.TAIR10.bed -o Arabidopsis.pg.community.3.inject.adj.og -t 50 -P
Extract gene names from injected pangenome:
odgi paths -i Arabidopsis.pg.community.3.inject.adj.og -L | grep 'AT4G' > chr4.genenames.txt# security checkwc -l chr4.genenames.txt ---5007 chr4.genenames.txtwc -l chr4.genes.adj.TAIR10.bed---5007 chr4.genes.adj.TAIR10.bed
odgi inject -i Arabidopsis.pg.input.community.4.og -b chr5.genes.adj.TAIR10.bed -o Arabidopsis.pg.community.4.inject.adj.og -t 50 -P
Extract gene names from injected pangenome:
odgi paths -i Arabidopsis.pg.community.4.inject.adj.og -L | grep 'AT5G' > chr5.genenames.txt# security checkwc -l chr5.genenames.txt ---7469 chr5.genenames.txtwc -l chr5.genes.adj.TAIR10.bed---7469 chr5.genes.adj.TAIR10.bed
odgi inject -i ../Arabidopsis.pg.input.community.0.og -b Pseudogenes_TAIR10_chr1.bed -o Arabidopsis.pg.community.0.inject.pseudogenes.og -t 50 -P
Extract gene names from injected pangenome:
odgi paths -i Arabidopsis.pg.community.0.inject.pseudogenes.og -L | grep 'GCA_000001735#1#CP002684.1:' > chr1.pseudogenenames.txt# security checkwc -l chr1.pseudogenenames.txt ---1083 chr1.pseudogenenames.txtwc -l Pseudogenes_TAIR10_chr1.bed---1083 Pseudogenes_TAIR10_chr1.bed
odgi inject -i ../Arabidopsis.pg.input.community.1.og -b Pseudogenes_TAIR10_chr2.bed -o Arabidopsis.pg.community.1.inject.pseudogenes.og -t 50 -P
Extract gene names from injected pangenome:
odgi paths -i Arabidopsis.pg.community.1.inject.pseudogenes.og -L | grep 'GCA_000001735#1#CP002685.1:' > chr2.pseudogenenames.txt# security checkwc -l chr2.pseudogenenames.txt ---904 chr2.pseudogenenames.txtwc -l Pseudogenes_TAIR10_chr2.bed---904 Pseudogenes_TAIR10_chr2.bed
odgi inject -i ../Arabidopsis.pg.input.community.2.og -b Pseudogenes_TAIR10_chr3.bed -o Arabidopsis.pg.community.2.inject.pseudogenes.og -t 50 -P
Extract gene names from injected pangenome:
odgi paths -i Arabidopsis.pg.community.2.inject.pseudogenes.og -L | grep 'GCA_000001735#1#CP002686.1:' > chr3.pseudogenenames.txt# security checkwc -l chr3.pseudogenenames.txt ---965 chr3.pseudogenenames.txtwc -l Pseudogenes_TAIR10_chr3.bed---965 Pseudogenes_TAIR10_chr3.bed
odgi inject -i ../Arabidopsis.pg.input.community.3.og -b Pseudogenes_TAIR10_chr4.bed -o Arabidopsis.pg.community.3.inject.pseudogenes.og -t 50 -P
Extract gene names from injected pangenome:
odgi paths -i Arabidopsis.pg.community.3.inject.pseudogenes.og -L | grep 'GCA_000001735#1#CP002687.1:' > chr4.pseudogenenames.txt# security checkwc -l chr4.pseudogenenames.txt ---804 chr4.pseudogenenames.txtwc -l Pseudogenes_TAIR10_chr4.bed---804 Pseudogenes_TAIR10_chr4.bed
odgi inject -i ../Arabidopsis.pg.input.community.4.og -b Pseudogenes_TAIR10_chr5.bed -o Arabidopsis.pg.community.4.inject.pseudogenes.og -t 50 -P
Extract gene names from injected pangenome:
odgi paths -i Arabidopsis.pg.community.4.inject.pseudogenes.og -L | grep 'GCA_000001735#1#CP002688.1:' > chr5.pseudogenenames.txt# security checkwc -l chr5.pseudogenenames.txt ---919 chr5.pseudogenenames.txtwc -l Pseudogenes_TAIR10_chr5.bed---919 Pseudogenes_TAIR10_chr5.bed
odgi stepindex -i Arabidopsis.pg.community.0.inject.adj.og -a 0 -o Arabidopsis.pg.community.0.inject.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.0.inject.adj.og -a Arabidopsis.pg.community.0.inject.index -R chr1.genenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.0.inject.paf
odgi stepindex -i Arabidopsis.pg.community.1.inject.adj.og -a 0 -o Arabidopsis.pg.community.1.inject.index -t 50 -Podgi untangle -i Arabidopsis.pg.community.1.inject.adj.og -a Arabidopsis.pg.community.1.inject.index -R chr2.genenames.txt -j 0 -t 50 -P -p > Arabidopsis.pg.community.1.inject.paf
odgi stepindex -i Arabidopsis.pg.community.2.inject.adj.og -a 0 -o Arabidopsis.pg.community.2.inject.index -t 50 -Podgi untangle -i Arabidopsis.pg.community.2.inject.adj.og -a Arabidopsis.pg.community.2.inject.index -R chr3.genenames.txt -j 0 -t 50 -P -p > Arabidopsis.pg.community.2.inject.paf
odgi stepindex -i Arabidopsis.pg.community.3.inject.adj.og -a 0 -o Arabidopsis.pg.community.3.inject.index -t 50 -Podgi untangle -i Arabidopsis.pg.community.3.inject.adj.og -a Arabidopsis.pg.community.3.inject.index -R chr4.genenames.txt -j 0 -t 50 -P -p > Arabidopsis.pg.community.3.inject.paf
odgi stepindex -i Arabidopsis.pg.community.4.inject.adj.og -a 0 -o Arabidopsis.pg.community.4.inject.index -t 50 -Podgi untangle -i Arabidopsis.pg.community.4.inject.adj.og -a Arabidopsis.pg.community.4.inject.index -R chr5.genenames.txt -j 0 -t 50 -P -p > Arabidopsis.pg.community.4.inject.paf
odgi stepindex -i Arabidopsis.pg.community.0.inject.pseudogenes.og -a 0 -o Arabidopsis.pg.community.0.inject.pseudogenes.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.0.inject.pseudogenes.og -a Arabidopsis.pg.community.0.inject.pseudogenes.index -R chr1.pseudogenenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.0.inject.paf
odgi stepindex -i Arabidopsis.pg.community.1.inject.pseudogenes.og -a 0 -o Arabidopsis.pg.community.1.inject.pseudogenes.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.1.inject.pseudogenes.og -a Arabidopsis.pg.community.1.inject.pseudogenes.index -R chr2.pseudogenenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.1.inject.paf
odgi stepindex -i Arabidopsis.pg.community.2.inject.pseudogenes.og -a 0 -o Arabidopsis.pg.community.2.inject.pseudogenes.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.2.inject.pseudogenes.og -a Arabidopsis.pg.community.2.inject.pseudogenes.index -R chr3.pseudogenenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.2.inject.paf
odgi stepindex -i Arabidopsis.pg.community.3.inject.pseudogenes.og -a 0 -o Arabidopsis.pg.community.3.inject.pseudogenes.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.3.inject.pseudogenes.og -a Arabidopsis.pg.community.3.inject.pseudogenes.index -R chr4.pseudogenenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.3.inject.paf
odgi stepindex -i Arabidopsis.pg.community.4.inject.pseudogenes.og -a 0 -o Arabidopsis.pg.community.4.inject.pseudogenes.index -t 100 -Podgi untangle -i Arabidopsis.pg.community.4.inject.pseudogenes.og -a Arabidopsis.pg.community.4.inject.pseudogenes.index -R chr5.pseudogenenames.txt -j 0 -t 100 -P -p > Arabidopsis.pg.community.4.inject.paf
# genes~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.0.inject.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# pseudogenes# pseudogenes paf needs one more step before of the processing, due to the way we named pseudogenes. We have to exclude the alignments of pseudogenes against eachother. This is done automatically for genes, because their pattern doesn't match with the reference namegrep -v '^GCA_000001735#1#CP002684.1:' Arabidopsis.pg.community.0.inject.paf > Arabidopsis.pg.community.0.inject.filtered.paf~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.0.inject.filtered.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# genes~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.1.inject.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# pseudogenes# pseudogenes paf needs one more step before of the processing, due to the way we named pseudogenes. We have to exclude the alignments of pseudogenes against eachother. This is done automatically for genes, because their pattern doesn't match with the reference namegrep -v '^GCA_000001735#1#CP002685.1:' Arabidopsis.pg.community.1.inject.paf > Arabidopsis.pg.community.1.inject.filtered.paf~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.1.inject.filtered.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# genes ~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.2.inject.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# pseudogenes# pseudogenes paf needs one more step before of the processing, due to the way we named pseudogenes. We have to exclude the alignments of pseudogenes against eachother. This is done automatically for genes, because their pattern doesn't match with the reference namegrep -v '^GCA_000001735#1#CP002686.1:' Arabidopsis.pg.community.2.inject.paf > Arabidopsis.pg.community.2.inject.filtered.paf~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.2.inject.filtered.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# genes~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.3.inject.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# pseudogenes# pseudogenes paf needs one more step before of the processing, due to the way we named pseudogenes. We have to exclude the alignments of pseudogenes against eachother. This is done automatically for genes, because their pattern doesn't match with the reference namegrep -v '^GCA_000001735#1#CP002687.1:' Arabidopsis.pg.community.3.inject.paf > Arabidopsis.pg.community.3.inject.filtered.paf~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.3.inject.filtered.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# genes~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.4.inject.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
# pseudogenes# pseudogenes paf needs one more step before of the processing, due to the way we named pseudogenes. We have to exclude the alignments of pseudogenes against eachother. This is done automatically for genes, because their pattern doesn't match with the reference namegrep -v '^GCA_000001735#1#CP002688.1:' Arabidopsis.pg.community.4.inject.paf > Arabidopsis.pg.community.4.inject.filtered.paf~/software/scripts/run_collapse_and_merge.sh -i ../Arabidopsis.pg.community.4.inject.filtered.paf -c ~/software/scripts/collapse_paf.py -m ~/software/scripts/merge_paf.py -d 100 -j 20
mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_genes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff /GCA_000001735.2_TAIR10.1_genomic.gff -ft gene
To verify correctness of matrix:
Maximum value: 49.0 found for Feature: AT1G40104 and Assembly: GCA_949796415grep 'AT1G40104' filter_passed_features.csv | grep 'GCA_949796415' | wc -l49
mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_genes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff /GCA_000001735.2_TAIR10.1_genomic.gff -ft gene
To verify correctness of matrix:
Maximum value: 129.0 found for Feature: AT2G01020 and Assembly: GCA_946412065grep 'AT2G01020' filter_passed_features.csv | grep 'GCA_946412065' | wc -l129
mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_genes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff /GCA_000001735.2_TAIR10.1_genomic.gff -ft gene
To verify correctness of matrix:
Maximum value: 42.0 found for Feature: AT3G41761 and Assembly: GCA_933208065grep 'AT3G41761' filter_passed_features.csv | grep 'GCA_933208065' | wc -l42
mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_genes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff /GCA_000001735.2_TAIR10.1_genomic.gff -ft gene
To verify correctness of matrix:
Maximum value: 6.0 found for Feature: AT4G20680 and Assembly: GCA_949796755grep 'AT4G20680' filter_passed_features.csv | grep 'GCA_949796755' | wc -l6
mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_genes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff /GCA_000001735.2_TAIR10.1_genomic.gff -ft gene
To verify correctness of matrix:
Maximum value: 8.0 found for Feature: AT5G05050 and Assembly: GCA_949796525grep 'AT5G05050' filter_passed_features.csv | grep 'GCA_949796525' | wc -l8
mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_pseudogenes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff pseudogenes.gff -ft pseudogene -chr 'CP002684.1' -ref 'GCA_000001735'
mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_pseudogenes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff pseudogenes.gff -ft pseudogene -chr 'CP002685.1' -ref 'GCA_000001735'
mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_pseudogenes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff pseudogenes.gff -ft pseudogene -chr 'CP002686.1' -ref 'GCA_000001735'
mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_pseudogenes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff pseudogenes.gff -ft pseudogene -chr 'CP002687.1' -ref 'GCA_000001735'
mkdir screening_cov95_id95cd screening_cov95_id95~/software/scripts/core_dispensable_pseudogenes.py -i ../collapse_and_merge/collapsed_and_merged.paf -id 95 -jc 0.95 -cov 95 -na 93 -sl 75 -gff pseudogenes.gff -ft pseudogene -chr 'CP002688.1' -ref 'GCA_000001735'
- find the reference paths using odgi paths + grep 'reference_name' and place it in
reference.txt
- extract non reference sequences using odgi paths with the flag
--non-reference-ranges
mkdir extract_non_reference_pathscd extract_non_reference_paths/odgi paths -i ../Arabidopsis.pg.input.community.0.og -L | grep 'GCA_000001735' > reference.txtodgi paths -i ../Arabidopsis.pg.input.community.0.og --non-reference-ranges reference.txt > non_reference.bed
filter_private_bed.py
script is in the scripts directory.It takes the output from the previous command and separates sequences shorther than threshold from sequences longer than threshold. It also prints the distribution of the paths length to stdout and to the log file.
~/software/scripts/filter_private_bed.py -i non_reference.bed -o non_reference_filtered.bed -thr 500---size_category count 1 16361432 2-50 3136982 51-99 168284 100-499 93759 500-999 22737 1000-4999 25706 5000-9999 5370 10000-19999 1905 20000-99999 1253100000-499999 280500000-999999 7 from_1Mbp 0
Bedtools needs a genome file for slop. We should use the fasta.fai of the multifasta used for building the pangenome with pggb.I will extend regions smaller than 500 bp of 500bp in both ends.
- filtered_out.bed is the bed file containing all the regions smaller than 500 bp
- non_reference_filtered.bed is the bed file containing all the regions larger than 500 bp
I will enlarge only the sequences contained in filtered_out.bed
bedtools slop -i filtered_out.bed -g ../../Arabidopsis.pg.input.fasta.gz.fai -b 500 > expanded_regions.bedhead expanded_regions.bed ---GCA_001651475#1#CM004359.1 0 789GCA_001651475#1#CM004359.1 0 882GCA_001651475#1#CM004359.1 0 887GCA_001651475#1#CM004359.1 0 912GCA_001651475#1#CM004359.1 0 1013GCA_001651475#1#CM004359.1 36 1037GCA_001651475#1#CM004359.1 104 1105GCA_001651475#1#CM004359.1 752 1753GCA_001651475#1#CM004359.1 933 1934GCA_001651475#1#CM004359.1 997 1998head filtered_out.bed---GCA_001651475#1#CM004359.1 0 289GCA_001651475#1#CM004359.1 381 382GCA_001651475#1#CM004359.1 385 387GCA_001651475#1#CM004359.1 411 412GCA_001651475#1#CM004359.1 413 513GCA_001651475#1#CM004359.1 536 537GCA_001651475#1#CM004359.1 604 605GCA_001651475#1#CM004359.1 1252 1253GCA_001651475#1#CM004359.1 1433 1434GCA_001651475#1#CM004359.1 1497 1498
Right, now the smaller regions have been expanded. Now I will join the expanded regions (expanded_regions.bed) with the regions that were already longer than 500 bp (non_reference_filtered.bed):
cat expanded_regions.bed non_reference_filtered.bed | bedtools sort -i - > slop_final.bed
Now we can apply bedtools merge to exclude overlappings and join close sequences:
bedtools merge -i slop_final.bed -d 100 > non_reference_chr1_for_annotation.bed
- -d 100 means that if there are gaps of 100bp between two sequences they are merged.
Now, with this bed file we should be able to extract the fasta that we want to annotate.
bedtools getfasta -fi ../../Arabidopsis.pg.input.fasta.gz -bed non_reference_chr1_for_annotation.bed -fo non_reference_chr1_for_annotation.fasta -name
head non_reference_chr1_for_annotation.fasta --->::GCA_001651475#1#CM004359.1:0-1998GTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTTAGGTTTTAGGGTTCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAAACCCTTAAACCCTAAACCCTAAACCCTAAATAAAGCGCTGTGGGATCAATCATTGCATTTTTCCATAGGATAGATAGGGCCGACAAGATCATCAGGGAAGAAGTCAAATCACATCCGAATTCAATTGTTCTTTTCCTAAACCCTAAACCCTAAACACTAAACCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTGTAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGGGATGGTCCTTTAGCATTTATTCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAGCCAAAAGGATTGGTGGTTTGAAGACACATCATATCAAAAAAGCTATCGCCTCGACGATGCTCTATTTCTATCCTTGTAGCACACATTTTGGCACTCAAAAAAGTATTTTTAGATGCTTGTTTTGCTTCTTTGAAGTAGTTTCTCTTTGCAAAATTCCTCTTTTTTTAGAGTGATTTGGATGATTCAAGACTTCTCGGTACTGCAAAGTTCTTCCGCCTGATTAATTATCCATTTTACCTTTGTCGTAGATATTAGGTAATCTGTAAGTCAACTCATATACAACTCATAATTTAAAAGAAAATTATGATCGACACACGTTTACACATAAAATCTGTAAATCAACTCATATACCCGTTATTCTCACAATCATATGCTTTCTAAAAGCAAAAGTATATGTCAACAATTGGTTATAAATTATTAGAAGTTTTCCACTTATGACTTAAGAACTTGTGAAGCAGAAAGTGGCAACACCCCCCACCTCCCCCCCCCCCCACCCCCCAAATTGAGAAGTCAATTTTATATAATTTAATCAAATAAATAAGTTTATGGTTAAGAGTTTTTTACTCTCTTTATTTTTCTTTTTCTTTTTGAGACATACTGAAAAAAGTTGTAATTATTAATGATAGTTCTGTGATTCCTCCATGAATCACATCTGCTTGATTTTTCTTTCATAAATTTATAAGTAATACATTCTTATAAAATGGTCAGAGAAACACCAAAGATCCCGAGATTTCTTCTCACTTACTTTTTTTCTATCTATCTAGATTATATAAATGAGATGTTGAATTAGAGGAACCTTTGATTCAATGATCATAGAAAAATTAGGTAAAGAGTCAGTGTCGTTATGTTATGGAAGATGTG>::GCA_001651475#1#CM004359.1:6363-7364TATATGTAATAATAATTAGTGCATCGTTTTGTGGTGTAGTTTATATAAATAAAGTGATATATAGTCTTGTATAAGAAAGGGATTTTACATGAGACCCAAATATGAGTAAAGGGTGTTGGCTCAAAGATTCATTTAGCAACCAAAGTTGCATTTGCAAGGAAATGAAAAGGTGTTAACAATGTCACTGCGTAACATGACTTCGATACAAAATCTCAAACAATACATCTTCAACTGTGGATTATGATGGACTTGGGGTTGCAGGTGATATGTCTATAGAAAAACGGGTGGGAATGGAAAATGGCTGAAGAAGAAAGCTCCAACTAAGCATCATAACTGGATTGTTTTAGAGGAGATGAGTCAAAGGAATTCACAGTGGAACTATCTCAAGAACATGATCATTGGCTTCTTATTGTTCATCTCCATCATTGGCTGGATCATTCTGGTTCAAGAGGTCAAATTATATATACATAACGGATCTAAGAAGTATAGTGTAGTCAATTAAAAACAAAACGAGACTTGAAAATAAGCATAAGTATTATTAAGTTAACCCAATATTCGTTTCAATGCTTTAGTTATCATCAGAATTCTCAACATTTTCAGATATTTAACTTGCCTTTGGTTGCTTTCTCGCCATGGTAGTAGCATTCTCCGGATAAGAATCAAGGGGAGCCTCAACTTCGGCTTCAACCGTCTCCTCGTCTTTCCTATGACATTCACTTGGTGTTGCAACAATGTGTTGATGCCCCACATTCATAGGAGAGTCCCACATATCTCCAACATTCATAGGAGGGTTCCACATATGCCCACCATTCCAAGGAGGGTACCACCACATATGCCCAGCATTCCAAGAAAGATGGAGCGACGATATCGTGAGGAACAAGGTTTTTGTAGGGCAAATAATTGTAGGTGGTTGAGTTATATCCGCCCGCACAGAGTAACAGACCAACAATTAACTTTTGATATTTTAGTAAGGTCTAATTCAATTTTTGGTGGCGATAATA>::GCA_001651475#1#CM004359.1:8710-9736AGCTAGAATCAGACAGGAACTAGCAATGCTTGAAATCAAGAACTTGAATTGAAATAGTTTTTTACCTGAATATTGACAGTTGCTGGATTAATTGCATTGTAGAGGACGTGTCTATATACCTTTGGTCTGTGAAGGATTAAATCGATGAAAATAATCTGCCAAAGAAAACAATTAAAGAACCAAAAACCAAAATTGGAAAGAAATAGGGAAACACCCAAAAAGGGAAAGAAAGTGATTAAAACAGACCATGCGTTCACACTCGATGTACTCATCTGCTACTTCCTTGCAATTTCCCTAAATATAACAATATGATCAAAGATGGAAACTTTGAAGAAATTTAATAGAGAATCTTATAAACCCTAATTGGGTCAAAGAAGATCCATTAATACAAAAATCTTACGCATTTCATGAGACGAATGTTACCCGGAGAGTATTGAATGAACAATGACTTTACCCTAAAACCACATCCCACGCATCTGTGTTCACTCGCCGCCATTGCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCAAGAGAAGAAGAATACGGAGCAATTAGAGTCCGGGTCTGGGCTACTGTTTTAACCCTAAATGGGCTTATTCATGGGCCAAGTTTTTGAAGTCTTAACTTTAAATTTGTTAGGCCCACTTTTGCTCTAAGCCGGGGTATTTGTACCCCAAAATTTAAAAATCATATACACGTTGTAATTTATAAATAGTTCAATTTGGATCAAAATCTTGTCCATATGACATAGCATTTTAAAATGCGTAGGTTCATGAATGAAACATATTATAGGCCTCAGATAAAGATATACATATTAAGTCTAAATTATTTAGTCTTCAGAATTTACCACACTTACTGAAAAGTCTAGTGGTTCACTAATATTATTACTGTCGTGTTACTTTCTATATATAGTTCATGACTTGTGAGTTGTGATGGATAAGTTTATAAGAAAATAAATTATTTATTACAATTCAACAGTGAAGAAATTTATTTAGTTTGATTAAATAAGAAAGGTAAATAA>::GCA_001651475#1#CM004359.1:10945-11946TTTAACACTTAATTAACGCAATCTTACCATATATTCATGGACTACATGCAGAATAGTAATTCTCCCAACCTTTCTAGTTATTTACCTGAATGTGTTTATGTACATGGACCGGTAACCTCATGTATATATGCACATACTGACAATCTGACATACATATATATAGTAGATATGACAACAACAACAAAAAAAAAAAAAATTCCTTGTTCGTGAAGCATGATCTGAGAGTTCCTAGTTAGCATGTTGTGTGGGATCATACTTTTAATATGCTGCAAGTACCAGTCAATTTTAGTATGGGAAACTATAAACATGTATAATCAACCAATGAACACGTCAATAACCTATTGAACAGCTTAGGGTGAAAATTATGATCCGTAGAGACAGCATTTAAAAGTTCCTTACGTCCACGTAAAATAATATATCAATTTATACATATACATGTGTAAACTGTGTATATATAGGGTAGGTATATGTGTATATATATAGTAATTGACAAATGATTTTGGTTCTAACATATATTCTAAAAGTACTCATGAGTTTGTGAGATCTACACAAGATACCTGATTTGATAAAAATGGCTTCAACTTGCAATCCAAACCAAACCAAACAAAGTTAATAACCAAGGGTTAATAACAAAAACAAGAATCTAGAATTAGTAAAAAAATGAGAAATTAATGAACCTGTGATCATAAAAAAAGTCAAACAATGTGAAAACATATCATACCTTTTGTTCTTTTTAATATAATAATTGAATTACTAAATGGATGGATCAGTCCTTCTTCCATAGCTAGCTTCTCTTTATTTTCTCTGCCCATAACCTGCAGAAAATCTCTTTAACCGAGCAAAATTACAAGAGTAACCAAACAAACAAAATAGGACCATCAGAGAGAGAGAAAAGAGTGCCTTTTTTTGGACCTGCCCATTAGCTTAGAATGTACCATGAAACACTTTGTCAGTGTAGGGAGATAGGCACAGAGAGTACAATTCATGAAATTTATAAGCTT>::GCA_001651475#1#CM004359.1:13387-14389CTGCTGCTTGCTCCGATGTTGGAGGTTAATTCTTGGTCTCTGTCTTGTTCTTGGTCGGAACTTGTTGTTGTTGTCGGAGCCAGGGATAGATCCATTATGACTTATCTGATTTTCCTTGGAGAAATCTTGATTTACGAGATGGTTCTTGAATCTTTTTTTTTTTTTTTTTGGTTTTTGGCAAAACCTTGCTCCGTTTAGTTTTGAGTTGGCTCTTAGTGGGTCTTAGTGAATCTTGATGAAGTTTAAACCTTTCTCTTGTTCTTCAATTGGTGAATTGTTGCGGATGTTTTTCAAAGCCATTGACGATGATGATGACTCAACTCGAGATCTTCAACATACATTAGAGAGATAATATTAAAAAGAGGAAACGAAACAAAAATCTAGCCACAATTATAAAAAAGGGTAGATGAATTAGTGAAAAAAAATTCAAAGAATATATTAAAAGAAGGGAAGAAGGAAAAAGAAAATCAAAAGTTCTTGTGGTCTCACCTTATAATATAAGAGAGAGAGAGAGGGAGAGAGAGATGTTGATATTGAAGTCTTCTTTCTCTTGCTGGTTCGATTGTTTCTTAGATTTTCTTCTTTGACCATGAAAATTTAGAGGAAATATGAAAACCCTAGAATCGGAAGAAAACTATATATGTATATCTTTCCGTTGACTTTATATAGAATGAAATCAAGGAAAGAAAAGAGCTAAGCAAATCAGGACTTAGAACTCAAATTGGGTTCTTGCCAAACAAGAGGATCTCTCTCTTTCTCTTTATGAAGATGAAGAAAAAATGGTGGCTTGAGATTCTTGATAGAGTTTCAAGACTTAGAGAGAGAGAGCGAGATTGGTACCGAGTAATGCGTAGGGGCCCATGAATGATATGGTTCGGGTCATGTGAGCTTCTTGTCCTTTTCATTAGGCTTCGAATTAGCTCTTTGTATTGAAATTTCAAAATCTCTTTATACATATGTGTATGTATATGTTTAGTTGTCGACGGTGTTAATTTGAGAATC
This is the fasta for annotation.
remove :: from the fasta headers (they have been automatically added):
sed -i 's/^>::/>/' non_reference_chr1_for_annotation.fasta
- find the reference paths using odgi paths + grep 'reference_name' and place it in
reference.txt
- extract non reference sequences using odgi paths with the flag
--non-reference-ranges
mkdir extract_non_reference_pathscd extract_non_reference_paths/odgi paths -i ../Arabidopsis.pg.input.community.1.og -L | grep 'GCA_000001735' > reference.txtodgi paths -i ../Arabidopsis.pg.input.community.1.og --non-reference-ranges reference.txt > non_reference.bed
filter_private_bed.py
script is in the scripts folder.It takes the output from the previous command and separates sequences shorther than a given threshold from longer sequences. It also prints the distribution of the paths length to stdout and to the log file.
~/software/scripts/filter_private_bed.py -i non_reference.bed -o non_reference_filtered.bed -thr 500---size_category count 1 11130257 2-50 2121266 51-99 119889 100-499 58584 500-999 15457 1000-4999 19663 5000-9999 4421 10000-19999 1694 20000-99999 1209100000-499999 256500000-999999 11 from_1Mbp 0
Bedtools needs a genome file for slop. We can use the fasta.fai of the multifasta used for building the pangenome with pggb.I will extend regions smaller than 500 bp of 500bp in both ends.
- filtered_out.bed is the bed file containing all the regions smaller than 500 bp
- non_reference_filtered.bed is the bed file containing all the regions larger than 500 bp
I will enlarge only the sequences contained in filtered_out.bed
bedtools slop -i filtered_out.bed -g ../../Arabidopsis.pg.input.fasta.gz.fai -b 500 > expanded_regions.bedhead expanded_regions.bed ---GCA_001651475#1#CM004360.1 0 614GCA_001651475#1#CM004360.1 0 703GCA_001651475#1#CM004360.1 0 780GCA_001651475#1#CM004360.1 0 796GCA_001651475#1#CM004360.1 0 966GCA_001651475#1#CM004360.1 284 1285GCA_001651475#1#CM004360.1 516 1517GCA_001651475#1#CM004360.1 564 1565GCA_001651475#1#CM004360.1 674 1675GCA_001651475#1#CM004360.1 972 1973head filtered_out.bed---GCA_001651475#1#CM004360.1 113 114GCA_001651475#1#CM004360.1 202 203GCA_001651475#1#CM004360.1 279 280GCA_001651475#1#CM004360.1 295 296GCA_001651475#1#CM004360.1 465 466GCA_001651475#1#CM004360.1 784 785GCA_001651475#1#CM004360.1 1016 1017GCA_001651475#1#CM004360.1 1064 1065GCA_001651475#1#CM004360.1 1174 1175GCA_001651475#1#CM004360.1 1472 1473
Right, now the smaller regions have been expanded. Now I will join the expanded regions (expanded_regions.bed) with the regions that were already above 500 bp (non_reference_filtered.bed):
cat expanded_regions.bed non_reference_filtered.bed | bedtools sort -i - > slop_final.bed
Now we can apply bedtools merge to exclude overlappings and join close sequences:
bedtools merge -i slop_final.bed -d 100 > non_reference_chr2_for_annotation.bed
- -d 100 means that if there are gaps of 100bp between two sequences they are merged.
Now, with this bed file we should be able to extract the fasta that we want to annotate.
bedtools getfasta -fi ../../Arabidopsis.pg.input.fasta.gz -bed non_reference_chr2_for_annotation.bed -fo non_reference_chr2_for_annotation.fasta -name
remove :: from the fasta headers (they have been automatically added):
sed -i 's/^>::/>/' non_reference_chr2_for_annotation.fasta
- find the reference paths using odgi paths + grep 'reference_name' and place it in
reference.txt
- extract non reference sequences using odgi paths with the flag
--non-reference-ranges
mkdir extract_non_reference_pathscd extract_non_reference_paths/odgi paths -i ../Arabidopsis.pg.input.community.2.og -L | grep 'GCA_000001735' > reference.txtodgi paths -i ../Arabidopsis.pg.input.community.2.og --non-reference-ranges reference.txt > non_reference.bed
filter_private_bed.py
script is in the scripts folder.It takes the output from the previous command and separates sequences shorther than a given threshold from longer sequences. It also prints the distribution of the paths length to stdout and to the log file.
~/software/scripts/filter_private_bed.py -i non_reference.bed -o non_reference_filtered.bed -thr 500---size_category count 1 12646306 2-50 2437703 51-99 107639 100-499 59027 500-999 17275 1000-4999 23841 5000-9999 5898 10000-19999 2585 20000-99999 1596100000-499999 324500000-999999 12 from_1Mbp 1
Bedtools needs a genome file for slop. We can use the fasta.fai of the multifasta used for building the pangenome with pggb.
I will extend regions smaller than 500 bp of 500bp in both ends.
- filtered_out.bed is the bed file containing all the regions smaller than 500 bp
- non_reference_filtered.bed is the bed file containing all the regions larger than 500 bp
I will enlarge only the sequences contained in filtered_out.bed
bedtools slop -i filtered_out.bed -g ../../Arabidopsis.pg.input.fasta.gz.fai -b 500 > expanded_regions.bedhead expanded_regions.bed ---GCA_001651475#1#CM004361.1 0 594GCA_001651475#1#CM004361.1 0 677GCA_001651475#1#CM004361.1 0 734GCA_001651475#1#CM004361.1 0 742GCA_001651475#1#CM004361.1 0 800GCA_001651475#1#CM004361.1 0 808GCA_001651475#1#CM004361.1 0 816GCA_001651475#1#CM004361.1 0 823GCA_001651475#1#CM004361.1 0 837GCA_001651475#1#CM004361.1 0 841head filtered_out.bed---GCA_001651475#1#CM004361.1 92 94GCA_001651475#1#CM004361.1 168 177GCA_001651475#1#CM004361.1 233 234GCA_001651475#1#CM004361.1 241 242GCA_001651475#1#CM004361.1 292 300GCA_001651475#1#CM004361.1 306 308GCA_001651475#1#CM004361.1 315 316GCA_001651475#1#CM004361.1 322 323GCA_001651475#1#CM004361.1 336 337GCA_001651475#1#CM004361.1 340 341
Right, now the smaller regions have been expanded. Now I will join the expanded regions (expanded_regions.bed) with the regions that were already above 500 bp (non_reference_filtered.bed):
cat expanded_regions.bed non_reference_filtered.bed | bedtools sort -i - > slop_final.bed
Now we can apply bedtools merge to exclude overlappings and join close sequences:
bedtools merge -i slop_final.bed -d 100 > non_reference_chr3_for_annotation.bed
- -d 100 means that if there are gaps of 100bp between two sequences they are merged.
Now, with this bed file we should be able to extract the fasta that we want to annotate.
bedtools getfasta -fi ../../Arabidopsis.pg.input.fasta.gz -bed non_reference_chr3_for_annotation.bed -fo non_reference_chr3_for_annotation.fasta -name
remove :: from the fasta headers (they have been automatically added):
sed -i 's/^>::/>/' non_reference_chr3_for_annotation.fasta
- find the reference paths using odgi paths + grep 'reference_name' and place it in
reference.txt
- extract non reference sequences using odgi paths with the flag
--non-reference-ranges
mkdir extract_non_reference_pathscd extract_non_reference_paths/odgi paths -i ../Arabidopsis.pg.input.community.3.og -L | grep 'GCA_000001735' > reference.txtodgi paths -i ../Arabidopsis.pg.input.community.3.og --non-reference-ranges reference.txt > non_reference.bed
filter_private_bed.py
script is in the scripts folder.It takes the output from the previous command and separates sequences shorther than a given threshold from longer sequences. It also prints the distribution of the paths length to stdout and to the log file.
~/software/scripts/filter_private_bed.py -i non_reference.bed -o non_reference_filtered.bed -thr 500---size_category count 1 11148357 2-50 2200442 51-99 112329 100-499 70063 500-999 15581 1000-4999 19389 5000-9999 4439 10000-19999 1948 20000-99999 1331100000-499999 392500000-999999 25 from_1Mbp 2
Bedtools needs a genome file for slop. We can use the fasta.fai of the multifasta used for building the pangenome with pggb.I will extend regions smaller than 500 bp of 500bp in both ends.
- filtered_out.bed is the bed file containing all the regions smaller than 500 bp
- non_reference_filtered.bed is the bed file containing all the regions larger than 500 bp
I will enlarge only the sequences contained in filtered_out.bed
bedtools slop -i filtered_out.bed -g ../../Arabidopsis.pg.input.fasta.gz.fai -b 500 > expanded_regions.bedhead expanded_regions.bed ---GCA_001651475#1#CM004362.1 4567 5570GCA_001651475#1#CM004362.1 4571 5574GCA_001651475#1#CM004362.1 4575 5584GCA_001651475#1#CM004362.1 4586 5612GCA_001651475#1#CM004362.1 4614 5616GCA_001651475#1#CM004362.1 4617 5619GCA_001651475#1#CM004362.1 4620 5623GCA_001651475#1#CM004362.1 4624 5628GCA_001651475#1#CM004362.1 4630 5662GCA_001651475#1#CM004362.1 4673 5674head filtered_out.bed---GCA_001651475#1#CM004362.1 5067 5070GCA_001651475#1#CM004362.1 5071 5074GCA_001651475#1#CM004362.1 5075 5084GCA_001651475#1#CM004362.1 5086 5112GCA_001651475#1#CM004362.1 5114 5116GCA_001651475#1#CM004362.1 5117 5119GCA_001651475#1#CM004362.1 5120 5123GCA_001651475#1#CM004362.1 5124 5128GCA_001651475#1#CM004362.1 5130 5162GCA_001651475#1#CM004362.1 5173 5174
Right, now the smaller regions have been expanded. Now I will join the expanded regions (expanded_regions.bed) with the regions that were already above 500 bp (non_reference_filtered.bed):
cat expanded_regions.bed non_reference_filtered.bed | bedtools sort -i - > slop_final.bed
Now we can apply bedtools merge to exclude overlappings and join close sequences:
bedtools merge -i slop_final.bed -d 100 > non_reference_chr4_for_annotation.bed
- -d 100 means that if there are gaps of 100bp between two sequences they are merged.
Now, with this bed file we should be able to extract the fasta that we want to annotate.
bedtools getfasta -fi ../../Arabidopsis.pg.input.fasta.gz -bed non_reference_chr4_for_annotation.bed -fo non_reference_chr4_for_annotation.fasta -name
remove :: from the fasta headers (they have been automatically added):
sed -i 's/^>::/>/' non_reference_chr4_for_annotation.fasta
- find the reference paths using odgi paths + grep 'reference_name' and place it in
reference.txt
- extract non reference sequences using odgi paths with the flag
--non-reference-ranges
mkdir extract_non_reference_pathscd extract_non_reference_paths/odgi paths -i ../Arabidopsis.pg.input.community.4.og -L | grep 'GCA_000001735' > reference.txtodgi paths -i ../Arabidopsis.pg.input.community.4.og --non-reference-ranges reference.txt > non_reference.bed
filter_private_bed.py
script is in the scripts folder.It takes the output from the previous command and separates sequences shorther than a given threshold from longer sequences. It also prints the distribution of the paths length to stdout and to the log file.
~/software/scripts/filter_private_bed.py -i non_reference.bed -o non_reference_filtered.bed -thr 500---size_category count 1 15088021 2-50 2535317 51-99 98407 100-499 158317 500-999 21568 1000-4999 23741 5000-9999 6136 10000-19999 2140 20000-99999 1024100000-499999 180500000-999999 10 from_1Mbp 1
Bedtools needs a genome file for slop. We can use the fasta.fai of the multifasta used for building the pangenome with pggb.I will extend regions smaller than 500 bp of 500bp in both ends.
- filtered_out.bed is the bed file containing all the regions smaller than 500 bp
- non_reference_filtered.bed is the bed file containing all the regions larger than 500 bp
I will enlarge only the sequences contained in filtered_out.bed
bedtools slop -i filtered_out.bed -g ../../Arabidopsis.pg.input.fasta.gz.fai -b 500 > expanded_regions.bedhead expanded_regions.bed ---GCA_001651475#1#CM004363.1 0 501GCA_001651475#1#CM004363.1 0 503GCA_001651475#1#CM004363.1 0 508GCA_001651475#1#CM004363.1 0 510GCA_001651475#1#CM004363.1 0 515GCA_001651475#1#CM004363.1 0 517GCA_001651475#1#CM004363.1 0 522GCA_001651475#1#CM004363.1 0 524GCA_001651475#1#CM004363.1 0 529GCA_001651475#1#CM004363.1 0 531head filtered_out.bed---GCA_001651475#1#CM004363.1 0 1GCA_001651475#1#CM004363.1 2 3GCA_001651475#1#CM004363.1 6 8GCA_001651475#1#CM004363.1 9 10GCA_001651475#1#CM004363.1 14 15GCA_001651475#1#CM004363.1 16 17GCA_001651475#1#CM004363.1 21 22GCA_001651475#1#CM004363.1 23 24GCA_001651475#1#CM004363.1 28 29GCA_001651475#1#CM004363.1 30 31
Right, now the smaller regions have been expanded. Now I will join the expanded regions (expanded_regions.bed) with the regions that were already above 500 bp (non_reference_filtered.bed):
cat expanded_regions.bed non_reference_filtered.bed | bedtools sort -i - > slop_final.bed
Now we can apply bedtools merge to exclude overlappings and join close sequences:
bedtools merge -i slop_final.bed -d 100 > non_reference_chr5_for_annotation.bed
- -d 100 means that if there are gaps of 100bp between two sequences they are merged.
Now, with this bed file we should be able to extract the fasta that we want to annotate.
bedtools getfasta -fi ../../Arabidopsis.pg.input.fasta.gz -bed non_reference_chr5_for_annotation.bed -fo non_reference_chr5_for_annotation.fasta -name
remove :: from the fasta headers (they have been automatically added):
sed -i 's/^>::/>/' non_reference_chr5_for_annotation.fasta
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz#get taxonomy files wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gzwget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zipunzip taxdmp.zip
# format dbdiamond makedb --in uniref100.fasta --taxonmap prot.accession2taxid.FULL.gz --taxonnodes nodes.dmp --taxonnames names.dmp -d uniref100 --threads 20
We need to download the conversion table between UniProt names and Araport from TAIR10 website:https://www.arabidopsis.org/download/file?path=Proteins/Id_conversions/TAIR2UniprotMapping.txt.We will need this table to use theparse_exonerate_results.py
script.
For the same script, we also need to extract informations from the uniref100 headers:
grep '>' uniref100.fasta > headers_uniref100sed -i 's/^>//' headers_uniref100awk '{split($0, a, " "); printf "%s\t", a[1]; for (i=2; i<=NF; i++) printf "%s%s", $i, (i==NF ? "" : "_"); print ""}' headers_uniref100 > info_headers.txtrm headers_uniref100
diamond blastx -d uniref100 -q non_reference_chr1_for_annotation.fasta -o diamond_chr1_top0_results.tsv --range-culling --top 0 -F 15 -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxids sscinames sskingdoms skingdoms sphylums -t ./ -p 50 --log
We want to refine the annotations found by diamond blastx, and not repeat the whole analysis with exonerate.To do so, we should prepare some input files and run exonerate using the 'PW_exonerate.pl' script.
In order to speed up the analysis, we are going to split the diamond results in chunks, and with the help of seqtk we will create the protein and sequence fasta files, in order to parallelise exonerate (which doesn't allow multithreading).
For each chunk we need to create:
- a file containing the whole sequences where diamond found proteins
- a file containing the proteins that matched with our sequences from the database.
To automate this step I created thesplit_and_extract_for_exonerate.sh
script.It splits diamond results in chunks and for each chunks seqtk extracts the fasta of the proteins and the fasta of the sequences.
Here's how we can use it:
~/software/scripts/split_and_extract_for_exonerate.sh -i ../diamond_chr1_top0_results.tsv -l 100000 -s ../non_reference_chr1_for_annotation.fasta -p uniref100.fasta
Then, we can run exonerate in parallel on the different chunks with the following command for each chunk (change suffix for each chunk):
perl ~/software/scripts/PW_exonerate.pl --protein_fasta diamond_chunk_aa_proteins.fasta --genome_fasta diamond_chunk_aa_seqs.fasta --diamond_results diamond_chunk_aa --suffix aa# join all the results togethercat exonerate_results_aa.gff exonerate_results_ab.gff exonerate_results_ac.gff exonerate_results_ad.gff exonerate_results_ae.gff exonerate_results_af.gff exonerate_results_ag.gff exonerate_results_ah.gff exonerate_results_ai.gff exonerate_results_aj.gff exonerate_results_ak.gff > chr1_exonerate_results.gff# remove chunks to save spacerm exonerate_results_*
To parse the results of exonerate we are going to useparse_exonerate.py
.
# extract C4 alignmentsawk '/C4 Alignment:/,/vulgar/{if (!/vulgar/) print}' chr1_exonerate_results.gff > chr1_C4_alignments.txt# Parse the results ~/software/scripts/parse_exonerate.py -e chr1_exonerate_results.gff -c chr1_C4_alignments.txt -t TAIR2UniprotMapping.txt -i info_headers.txt
diamond blastx -d uniref100 -q non_reference_chr2_for_annotation.fasta -o diamond_chr2_top0_results.tsv --range-culling --top 0 -F 15 -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxids sscinames sskingdoms skingdoms sphylums -t ./ -p 50 --log
~/software/scripts/split_and_extract_for_exonerate.sh -i ../diamond_chr2_top0_results.tsv -l 100000 -s ../non_reference_chr2_for_annotation.fasta -p uniref100.fasta
Then, we can run exonerate in parallel on the different chunks with the following command for each chunk (changing suffix for each chunk):
perl ~/software/scripts/PW_exonerate.pl --protein_fasta diamond_chunk_aa_proteins.fasta --genome_fasta diamond_chunk_aa_seqs.fasta --diamond_results diamond_chunk_aa --suffix aacat exonerate_results_aa.gff exonerate_results_ab.gff exonerate_results_ac.gff exonerate_results_ad.gff exonerate_results_af.gff exonerate_results_ag.gff exonerate_results_ah.gff > chr2_exonerate_results.gffrm exonerate_results_*
awk '/C4 Alignment:/,/vulgar/{if (!/vulgar/) print}' chr2_exonerate_results.gff > chr2_C4_alignments.txt~/software/scripts/parse_exonerate.py -e chr2_exonerate_results.gff -c chr2_C4_alignments.txt -t ~/Arabidopsis/pangenome/ara_pan_from_April/TAIR2UniprotMapping.txt -i info_headers.txt
diamond blastx -d uniref100 -q non_reference_chr3_for_annotation.fasta -o diamond_chr3_top0_results.tsv --range-culling --top 0 -F 15 -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxids sscinames sskingdoms skingdoms sphylums -t ./ -p 50 --log
~/software/scripts/split_and_extract_for_exonerate.sh -i ../diamond_chr3_top0_results.tsv -l 100000 -s ../non_reference_chr3_for_annotation.fasta -p uniref100.fasta
Then, we can run exonerate in parallel on the different chunks with the following command for each chunk (change suffix for each chunk):
perl ~/software/scripts/PW_exonerate.pl --protein_fasta diamond_chunk_aa_proteins.fasta --genome_fasta diamond_chunk_aa_seqs.fasta --diamond_results diamond_chunk_aa --suffix aacat exonerate_results_aa.gff exonerate_results_ab.gff exonerate_results_ac.gff exonerate_results_ad.gff exonerate_results_ae.gff exonerate_results_af.gff exonerate_results_ag.gff exonerate_results_ah.gff exonerate_results_ai.gff > chr3_exonerate_results.gffrm exonerate_results_*
awk '/C4 Alignment:/,/vulgar/{if (!/vulgar/) print}' chr3_exonerate_results.gff > chr3_C4_alignments.txt~/software/scripts/parse_exonerate.py -e chr3_exonerate_results.gff -c chr3_C4_alignments.txt -t TAIR2UniprotMapping.txt -i info_headers.txt
diamond blastx -d uniref100 -q non_reference_chr4_for_annotation.fasta -o diamond_chr4_top0_results.tsv --range-culling --top 0 -F 15 -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxids sscinames sskingdoms skingdoms sphylums -t ./ -p 50 --log
~/software/scripts/split_and_extract_for_exonerate.sh -i ../diamond_chr4_top0_results.tsv -l 100000 -s ../non_reference_chr4_for_annotation.fasta -p uniref100.fasta
Then, we can run exonerate in parallel on the different chunks with the following command for each chunk (changing suffix for each chunk):
perl ~/software/scripts/PW_exonerate.pl --protein_fasta diamond_chunk_aa_proteins.fasta --genome_fasta diamond_chunk_aa_seqs.fasta --diamond_results diamond_chunk_aa --suffix aacat exonerate_results_aa.gff exonerate_results_ab.gff exonerate_results_ac.gff exonerate_results_ad.gff exonerate_results_ae.gff exonerate_results_af.gff exonerate_results_ag.gff > chr4_exonerate_results.gffrm exonerate_results_*
awk '/C4 Alignment:/,/vulgar/{if (!/vulgar/) print}' chr4_exonerate_results.gff > chr4_C4_alignments.txt~/software/scripts/parse_exonerate.py -e chr4_exonerate_results.gff -c chr4_C4_alignments.txt -t TAIR2UniprotMapping.txt -i info_headers.txt
diamond blastx -d uniref100 -q non_reference_chr5_for_annotation.fasta -o diamond_chr5_top0_results.tsv --range-culling --top 0 -F 15 -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxids sscinames sskingdoms skingdoms sphylums -t ./ -p 50 --log
~/software/scripts/split_and_extract_for_exonerate.sh -i ../diamond_chr5_top0_results.tsv -l 100000 -s ../non_reference_chr5_for_annotation.fasta -p uniref100.fasta
Then, we can run exonerate in parallel on the different chunks with the following command for each chunk (changing suffix for each chunk):
perl ~/software/scripts/PW_exonerate.pl --protein_fasta diamond_chunk_aa_proteins.fasta --genome_fasta diamond_chunk_aa_seqs.fasta --diamond_results diamond_chunk_aa --suffix aacat exonerate_results_aa.gff exonerate_results_ab.gff exonerate_results_ac.gff exonerate_results_ad.gff exonerate_results_ae.gff exonerate_results_af.gff exonerate_results_ag.gff exonerate_results_ah.gff exonerate_results_ai.gff exonerate_results_aj.gff > chr5_exonerate_results.gffrm exonerate_results_*
awk '/C4 Alignment:/,/vulgar/{if (!/vulgar/) print}' chr5_exonerate_results.gff > chr5_C4_alignments.txt~/software/scripts/parse_exonerate.py -e chr5_exonerate_results.gff -c chr5_C4_alignments.txt -t TAIR2UniprotMapping.txt -i info_headers.txt
These are the files of the results that we get fromparse_exonerate.py
:
corresponding_to_tair.tsv
--> these are the UniRef100 annotations that have surely a correspondent TAIR10 annotation. From these, we will need to exclude UniRef genes/pseudogenes which have not a unique correspondent TAIR locus. This is needed as itg may lead to uncorrect calculations of statistics and downstream analysis.new_results_not_tair.tsv
--> These are the "new genes" that we found with the analysis, meaning that they don't have a correspondent in TAIR10 annotations. To have very robust results, we need to filter them to keep only the results corresponding to reviewed proteins (i.e. those coming from UniProtKB/Swiss-Prot), as UniRef also contains proteins which were not reviewed.
Since the gene names were grouped by UniProt ID, and the corresponding TAIR names were concatenated in a single field separated by a,
when more than one, we can simply remove those lines which contain a comma in the field containing TAIR names:
# chr1 awk -F'\t' '$26 !~ /,/' corresponding_to_tair.tsv > reviewed_corresponding_to_tair.tsv# chr2awk -F'\t' '$26 !~ /,/' corresponding_to_tair.tsv > reviewed_corresponding_to_tair.tsv# chr3awk -F'\t' '$26 !~ /,/' corresponding_to_tair.tsv > reviewed_corresponding_to_tair.tsv# chr4awk -F'\t' '$26 !~ /,/' corresponding_to_tair.tsv > reviewed_corresponding_to_tair.tsv# chr5awk -F'\t' '$26 !~ /,/' corresponding_to_tair.tsv > reviewed_corresponding_to_tair.tsv
We need to reviewnew_results_not_tair.tsv
by looking in uniprot if they have been confirmed or not.
We will proceed in this way:
- extraction of the unique UniRef IDs from
new_results_not_tair.tsv
(for genes and pseudogenes, unclassified will be excluded) - go toUniProt ID mapping
- select from UniRef100 to UniProtKB/Swiss-Prot --> this will give us automatically only reviewed results
- upload IDs list, submit the job
- costumise results (TSV format) to get also Gene Ontology info (we will need this later)
- Be sure to include info about proteins review (it's under miscellaneous options)
- include also taxonomic lineage
grep 'gene' new_results_not_tair.tsv | cut -f 1 | sort | uniq > IDs_chr1.txt
The scriptreview_exonerate_results.py
will drop all the results coming from proteins that were not reviewed.
~/software/scripts/review_exonerate_results.py -i chr1_reviewed_proteins.tsv -a new_results_not_tair.tsv -o reviewed_not_tair_results.tsv
grep 'gene' new_results_not_tair.tsv | cut -f 1 | sort | uniq > IDs_chr2.txt~/software/scripts/review_exonerate_results.py -i chr2_reviewed_proteins.tsv -a new_results_not_tair.tsv -o reviewed_not_tair_results.tsv
grep 'gene' new_results_not_tair.tsv | cut -f 1 | sort | uniq > IDs_chr3.txt~/software/scripts/review_exonerate_results.py -i chr3_reviewed_proteins.tsv -a new_results_not_tair.tsv -o reviewed_not_tair_results.tsv
grep 'gene' new_results_not_tair.tsv | cut -f 1 | sort | uniq > IDs_chr4.txt~/software/scripts/review_exonerate_results.py -i chr4_reviewed_proteins.tsv -a new_results_not_tair.tsv -o reviewed_not_tair_results.tsv
grep 'gene' new_results_not_tair.tsv | cut -f 1 | sort | uniq > IDs_chr5.txt~/software/scripts/review_exonerate_results.py -i chr5_reviewed_proteins.tsv -a new_results_not_tair.tsv -o reviewed_not_tair_results.tsv
Now, we can screen the whole pangenome, using all the data coming fromexonerate
andodgi untangle
.
We'll need the following files:
filter_passed_features.csv
--> obtained fromcore_dispensable_genes.py
pangenome_screening.csv
--> obtained fromcore_dispensable_genes.py
reviewed_corresponding_to_tair.tsv
--> obtained fromparse_exonerate.py
and reviewed withreview_exonerate_results.py
reviewed_not_tair_results.tsv
--> obtained fromparse_exonerate.py
and reviewed withreview_exonerate_results.py
~/software/scripts/final_genes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75
~/software/scripts/final_genes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75
~/software/scripts/final_genes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75
~/software/scripts/final_genes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75
~/software/scripts/final_genes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75
We'll need the following files:
filter_passed_features.csv
--> obtained fromcore_dispensable_pseudogenes.py
pangenome_screening.csv
--> obtained fromcore_dispensable_pseudogenes.py
reviewed_corresponding_to_tair.tsv
--> obtained fromparse_exonerate.py
and reviewed withreview_exonerate_results.py
reviewed_not_tair_results.tsv
--> obtained fromparse_exonerate.py
and reviewed withreview_exonerate_results.py
~/software/scripts/final_pseudogenes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75
~/software/scripts/final_pseudogenes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75
~/software/scripts/final_pseudogenes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75
~/software/scripts/final_pseudogenes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75
~/software/scripts/final_pseudogenes_screening.py -u filter_passed_features.csv -p pangenome_screening.csv -t reviewed_corresponding_to_tair.tsv -n reviewed_not_tair_results.tsv -a 93 -s 75
The universe for A. thaliana was obtained from:https://www.arabidopsis.org/download/list?dir=GO_and_PO_Annotations
The analysis was then performed on using TopGO as shown inscripts/TopGO.R
.
To obtain information about the structure of each graph, we can use the commandodgi stats
:
odgi stats -i Arabidopsis.pg.input.community.0.og -m > chr1_odgi_stats.yamlodgi stats -i Arabidopsis.pg.input.community.1.og -m > chr2_odgi_stats.yamlodgi stats -i Arabidopsis.pg.input.community.2.og -m > chr3_odgi_stats.yaml odgi stats -i Arabidopsis.pg.input.community.3.og -m > chr4_odgi_stats.yamlodgi stats -i Arabidopsis.pg.input.community.4.og -m > chr5_odgi_stats.yaml
We can useodgi paths --coverage-levels
as in the following example:
odgi paths -i Arabidopsis.pg.input.community.0.og --coverage-levels 2,75,93 > chr1_nodes_classification.txt
odgi paths --coverage-levels
classifies the nodes in genomic classes. In our case, we couldn't use this script for classification because the number of paths did not correspond to the number of assemblies included in the pangenome.
odgi paths -i Arabidopsis.pg.input.community.0.og -H -D "#" -p1 -t 100 > chr1_nodes_composition.txtodgi paths -i Arabidopsis.pg.input.community.1.og -H -D "#" -p1 -t 100 > chr2_nodes_composition.txtodgi paths -i Arabidopsis.pg.input.community.2.og -H -D "#" -p1 -t 100 > chr3_nodes_composition.txtodgi paths -i Arabidopsis.pg.input.community.3.og -H -D "#" -p1 -t 100 > chr4_nodes_composition.txtodgi paths -i Arabidopsis.pg.input.community.4.og -H -D "#" -p1 -t 100 > chr5_nodes_composition.txt
~/software/scripts/node_matrix_processing.py -i chr1_nodes_composition.txt -p chr1 -o chr1_matrix.csv -na 93 -sl 75~/software/scripts/node_matrix_processing.py -i chr2_nodes_composition.txt -p chr2 -o chr2_matrix.csv -na 93 -sl 75~/software/scripts/node_matrix_processing.py -i chr3_nodes_composition.txt -p chr3 -o chr3_matrix.csv -na 93 -sl 75~/software/scripts/node_matrix_processing.py -i chr4_nodes_composition.txt -p chr4 -o chr4_matrix.csv -na 93 -sl 75~/software/scripts/node_matrix_processing.py -i chr5_nodes_composition.txt -p chr5 -o chr5_matrix.csv -na 93 -sl 75
We will join the graph withodgi squeeze
and subsequently we will runodgi similarity
to obtain a similarity matrix.
odgi squeeze
needs a file with the list of the graphs as input
nano graphs.txtArabidopsis.pg.input.community.0.ogArabidopsis.pg.input.community.1.ogArabidopsis.pg.input.community.2.ogArabidopsis.pg.input.community.3.ogArabidopsis.pg.input.community.4.og
Squeeze the graphs:
odgi squeeze -f graphs.txt -o squeezed_pangenome.og -t 100 -P
Obtain the matrix:
odgi similarity -i squeezed_pangenome.og -D '#' -d -t 100 -P > similarity_nodes.tsv
# get dataset from gene matrixperl shuffle_count.pl --file concatenated_matrix.csv --iterations 10 --campionamenti 10-20-30-40-50-60-70-80-90-93 --mode NOT_REI --output out_shuffling.txt# get the figurepython3 curve_plot.py out_shuffling.txt # for the zoom of the "All" curvepython3 curve_plot2.py out_shuffling.txt
Join not transposed matrices
cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/allmkdir gene_clustercd gene_clusterln -s ../../chr1/matrix_chr1.txtln -s ../../chr2/matrix_chr2.txtln -s ../../chr3/matrix_chr3.txtln -s ../../chr4/matrix_chr4.txtln -s ../../chr5/matrix_chr5.txt./combine_notransp_matrices.pymv concatenated_matrix.csv genes_concatenated_matrix.csvcp genes_concatenated_matrix.csv ~/Arabidopsis_pangenome/R_plots/Cluster/
Rlibrary("ggplot2")library("vegan")library(viridis)library(ggtree)library(ape)setwd("/home/lia/Arabidopsis_pangenome/R_plots")# Load datamy_data <- read.csv("Cluster/genes_concatenated_matrix.csv", header = TRUE, sep = "\t")head(my_data)# Extract assembly namesassembly_names <- colnames(my_data)[2:94]# Load locationslocations <- read.csv("genebank_country_1.csv", sep="\t", stringsAsFactors = FALSE, header = TRUE)head(locations)colnames(locations) <- c("Assembly", "Location") country_to_colour <- c('Afghanistan' = "salmon2",'Belgium' = "gold3",'Cape Verde' = "darkgreen",'Czech Republic' = "grey46",'France' = "red", 'Germany' = "magenta",'Italy' = "brown",'Japan' = "springgreen4",'Lithuania' = "black",'Netherlands' = "green",'Poland' = "orange",'Madeira' = "darkviolet",'Romania' = "pink",'Spain' = 'deepskyblue3','Sweden' = "olivedrab",'Tanzania' = 'blue','United Kingdom' = 'mediumpurple','USA' = 'cyan')# Create a mapping from assembly names to countriesassembly_to_country <- setNames(locations$Location, locations$Assembly)# Map assembly names to colors through their countriesassembly_to_color <- sapply(assembly_names, function(assembly) { country <- assembly_to_country[assembly] color <- country_to_colour[country] return(color)})# Calculate Jaccard distancedist.jaccard <- vegdist(t(my_data[, 2:94]), method = "jaccard")write.csv(as.matrix(dist.jaccard), "genes_PAV_jaccard_distance_matrix.csv", row.names = TRUE)# Perform hierarchical clusteringres.hc <- hclust(d = dist.jaccard, method = "ward.D2")# Calculate the cophenetic distance matrix from the clustering resultres.coph<-cophenetic(res.hc)# Compute the cophenetic correlationcophenetic_corr <- cor(as.dist(dist.jaccard), as.dist(res.coph))# Print cophenetic correlation to check the quality of the clusteringprint(cophenetic_corr)phylo_tree <- as.phylo(res.hc)# Prepare a data frame that ggtree can use to map colors to tips# The row names of 'my_data' are assumed to be the assembly names used in 'phylo_tree$tip.label'# Ensure these names match those in 'assembly_to_color'tip_colors <- data.frame(label = names(assembly_to_color), color = assembly_to_color)# Plot the tree with colored tipsp <- ggtree(phylo_tree, layout="circular") + geom_tiplab(aes(color = label), size = 4) + scale_color_manual(values = tip_colors$color) + theme_tree2() + theme(legend.position = "none", # Adjust legend position plot.background = element_blank(), # Customize background panel.grid.major = element_blank(), panel.grid.minor = element_blank(), # Remove grid lines axis.text = element_blank(), # Hide axis text axis.ticks = element_blank()) # Hide axis ticksp# Save the plotggsave(filename = "Cluster/phylo_jaccard_ward_PAV_genes.png", plot = p, width = 16, height = 16, dpi = 300)
Join not transposed matrices
cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/allmkdir pseudogene_clustercd pseudogene_clusterln -s ../../chr1/matrix_chr1.txtln -s ../../chr2/matrix_chr2.txtln -s ../../chr3/matrix_chr3.txtln -s ../../chr4/matrix_chr4.txtln -s ../../chr5/matrix_chr5.txt./combine_notransp_matrices.pymv concatenated_matrix.csv pseudogenes_concatenated_matrix.csvcp pseudogenes_concatenated_matrix.csv ~/Arabidopsis_pangenome/R_plots/Cluster/
mamba activate RRlibrary("ggplot2")library("vegan")library(ggtree)library(ape) setwd("/home/lia/Arabidopsis_pangenome/R_plots")# Load datamy_data <- read.csv("Cluster/pseudogenes_concatenated_matrix.csv", header = TRUE, sep = "\t")head(my_data)# Extract assembly namesassembly_names <- colnames(my_data)[2:94]# Load locationslocations <- read.csv("genebank_country_1.csv", sep="\t", stringsAsFactors = FALSE, header = TRUE)head(locations)colnames(locations) <- c("Assembly", "Location") country_to_colour <- c('Afghanistan' = "salmon2",'Belgium' = "gold3",'Cape Verde' = "darkgreen",'Czech Republic' = "grey46",'France' = "red", 'Germany' = "magenta",'Italy' = "brown",'Japan' = "springgreen4",'Lithuania' = "black",'Netherlands' = "green",'Poland' = "orange",'Madeira' = "darkviolet",'Romania' = "pink",'Spain' = 'deepskyblue3','Sweden' = "olivedrab",'Tanzania' = 'blue','United Kingdom' = 'mediumpurple','USA' = 'cyan')# Create a mapping from assembly names to countriesassembly_to_country <- setNames(locations$Location, locations$Assembly)# Map assembly names to colors through their countriesassembly_to_color <- sapply(assembly_names, function(assembly) { country <- assembly_to_country[assembly] color <- country_to_colour[country] return(color)})# Calculate Jaccard distancedist.jaccard <- vegdist(t(my_data[, 2:94]), method = "jaccard")write.csv(as.matrix(dist.jaccard), "pseudogenes_PAV_jaccard_distance_matrix.csv", row.names = TRUE)# Hierarchical clusteringres.hc <- hclust(d = dist.jaccard, method = "ward.D2")# Calculate the cophenetic distance matrix from the clustering resultres.coph<-cophenetic(res.hc)# Compute the cophenetic correlationcophenetic_corr <- cor(as.dist(dist.jaccard), as.dist(res.coph))# Print cophenetic correlation to check the quality of the clusteringprint(cophenetic_corr)# Convert to phylo objectphylo_tree <- as.phylo(res.hc)# Prepare a data frame that ggtree can use to map colors to tips# The row names of 'my_data' are assumed to be the assembly names used in 'phylo_tree$tip.label'# Ensure these names match those in 'assembly_to_color'tip_colors <- data.frame(label = names(assembly_to_color), color = assembly_to_color)# Plot the treep <- ggtree(phylo_tree, layout="circular") + geom_tiplab(aes(color = label), size = 4) + scale_color_manual(values = tip_colors$color) + theme_tree2() + theme(legend.position = "none", # Adjust legend position plot.background = element_blank(), # Customize background panel.grid.major = element_blank(), panel.grid.minor = element_blank(), # Remove grid lines axis.text = element_blank(), # Hide axis text axis.ticks = element_blank()) # Hide axis ticksp# Save the plotggsave(filename = "Cluster/phylo_jaccard_ward_PAV_pseudogenes.png", plot = p, width = 16, height = 16, dpi = 300)
For this figure we generated the input file usingodgi similarity
after joining all the graphs from each chromosomes withodgi squeeze
.To generate the phylogenetic tree we will follow these instructions:https://hackmd.io/@AndreaGuarracino/SyhbiKuE2#Primate-chromosome-6
mamba activate RRlibrary(readr)library(dplyr)library(tidyr)library(tibble)library(ape)library(ggtree)library(ggplot2)setwd ("/home/lia/Arabidopsis_pangenome/FINAL/nodes_similarity")path_dist_tsv <- 'similarity_nodes.tsv'# Read sparse matrixsparse_matrix_df <- read_tsv(path_dist_tsv)# Prepare distance matrixjaccard_dist_df <- sparse_matrix_df %>% arrange(group.a, group.b) %>% select(group.a, group.b, jaccard.distance) %>% pivot_wider(names_from = group.b, values_from = jaccard.distance) %>% column_to_rownames(var = "group.a")dist.jaccard <- as.dist(jaccard_dist_df)write.csv(as.matrix(dist.jaccard), "nodes_jaccard_distance_matrix.csv", row.names = TRUE)# Perform hierarchical clusteringres.hc <- hclust(d = dist.jaccard)# Calculate the cophenetic distance matrix from the clustering resultres.coph<-cophenetic(res.hc)# Compute the cophenetic correlationcophenetic_corr <- cor(as.dist(dist.jaccard), as.dist(res.coph))# Print cophenetic correlation to check the quality of the clusteringprint(cophenetic_corr)# Convert hclust object to phylophylo_tree <- as.phylo(res.hc)# Load locationslocations <- read.csv("genebank_country_1.csv", sep="\t", stringsAsFactors = FALSE, header = TRUE)head(locations)colnames(locations) <- c("Assembly", "Location") #extract assembly namesassembly_names <- locations$Assemblycountry_to_colour <- c('Afghanistan' = "salmon2",'Belgium' = "gold3",'Cape Verde' = "darkgreen",'Czech Republic' = "grey46",'France' = "red", 'Germany' = "magenta",'Italy' = "brown",'Japan' = "springgreen4",'Lithuania' = "black",'Netherlands' = "green",'Poland' = "orange",'Madeira' = "darkviolet",'Romania' = "pink",'Spain' = 'deepskyblue3','Sweden' = "olivedrab",'Tanzania' = 'blue','United Kingdom' = 'mediumpurple','USA' = 'cyan')# Create a mapping from assembly names to countriesassembly_to_country <- setNames(locations$Location, locations$Assembly)# Map assembly names to colors through their countriesassembly_to_color <- sapply(assembly_names, function(assembly) { country <- assembly_to_country[assembly] color <- country_to_colour[country] return(color)})# Prepare a data frame that ggtree can use to map colors to tips# The row names of 'my_data' are assumed to be the assembly names used in 'phylo_tree$tip.label'# Ensure these names match those in 'assembly_to_color'tip_colors <- data.frame(label = names(assembly_to_color), color = assembly_to_color)# Remove country information from tip_colors labelstip_colors$label <- sub("\\..*", "", tip_colors$label)# Ensure tip_colors is a named vectortip_colors_named <- setNames(tip_colors$color, tip_colors$label)# Plot the tree with color mappingp <- ggtree(phylo_tree, layout="circular") + geom_tiplab(aes(label = label, color = label), size = 4) + scale_color_manual(values = tip_colors_named) + theme_tree2() + theme(legend.position = "none", # Adjust legend position plot.background = element_blank(), # Customize background panel.grid.major = element_blank(), panel.grid.minor = element_blank(), # Remove grid lines axis.text = element_blank(), # Hide axis text axis.ticks = element_blank()) # Hide axis ticks# Display the plotprint(p)# Save the plotggsave(filename = "phylo_jaccard_nodes.png", plot = p, width = 16, height = 16, dpi = 300)
We will need the following files:
presence_absence_matrix.csv
with the results from all the chromosomes- A file called
loci_classes.txt
, where we have theFeature/Pater
andLabel
columns fromfinal_screening_GO.tsv
with the results from all the chromosomes matrix_transposed.csv
, to be obtained withtranspose.pl
- The dataset rearranged for R, to be obtained with
comp_node_path.pl
- Final screening and Presence-absence matrix per chromosome
# chr1cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/chr1cut -f 1,4 final_genes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' loci_classes.txt > chr1_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr1.txt# transpose matrixperl ../transpose.pl matrix_chr1.txt > chr1_transp_matrix.txt# chr2cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/chr2cut -f 1,4 final_genes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' loci_classes.txt > chr2_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr2.txt# transpose matrixperl ../transpose.pl matrix_chr2.txt > chr2_transp_matrix.txt# chr3cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/chr3cut -f 1,4 final_genes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' loci_classes.txt > chr3_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr3.txt# transpose matrixperl ../transpose.pl matrix_chr3.txt > chr3_transp_matrix.txt# chr4cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/chr4cut -f 1,4 final_genes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' loci_classes.txt > chr4_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr4.txt# transpose matrixperl ../transpose.pl matrix_chr4.txt > chr4_transp_matrix.txt# chr5cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/chr5cut -f 1,4 final_genes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' loci_classes.txt > chr5_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr5.txt# transpose matrixperl ../transpose.pl matrix_chr5.txt > chr5_transp_matrix.txt
- Join Transposed matrices
cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/allln -s ../chr1/chr1_transp_matrix.txtln -s ../chr2/chr2_transp_matrix.txtln -s ../chr3/chr3_transp_matrix.txtln -s ../chr4/chr4_transp_matrix.txtln -s ../chr5/chr5_transp_matrix.txt../combine_transposed_matrices.py
- Join loci classes
cd ~/Arabidopsis_pangenome/FINAL/final_screening/genes/allln -s ../chr1/chr1_loci_classes.txtln -s ../chr2/chr2_loci_classes.txtln -s ../chr3/chr3_loci_classes.txtln -s ../chr4/chr4_loci_classes.txtln -s ../chr5/chr5_loci_classes.txtfor f in chr*_loci_classes.txt; do tail -n +2 "$f"; done >> combined_loci_classes.txt
- R dataset
perl ../comp_node_path_PAV.pl combined_loci_classes.txt concatenated_matrix.csv > loci_comp_classes.txt
This last script produces alsodata_set_R.txt
which is the file we are going to use for the figure. Let's rename the file:
mv data_set_R.txt R_data_genes_PAV_total.txt
Install required packages
mamba activate Rmamba install r-ggplot2mamba install r-factoextramamba install conda-forge::r-veganmamba install bioconda::bioconductor-ggtreemamba install conda-forge::r-apemamba install bioconda::bioconductor-ggtreeextra
Open R
R
setwd("/home/lia/Arabidopsis_pangenome/R_plots")
data<-read.table(file="./Assembly_based_genes/R_data_genes_PAV_total.txt", sep="\t", header=FALSE)head(data)
colnames(data)<-c("Assembly", "Class", "Count")head(data)
library("ggplot2")
custom_colors <- c("private" = "black", "dispensable" = "beige", "softcore" = "olivedrab", "core" = "firebrick") # Replace with your actual classes and desired colors# Order the classes in a custom orderdata$Class <- factor(data$Class, levels = c("private", "dispensable", "softcore", "core")) # Replace with your actual class order# Assign names to the colors based on the levels of Classnames(custom_colors) <- levels(data$Class)# Create the plotgene_plot_impr <- ggplot(data = data, aes(x = Assembly, y = Count, fill = Class)) + geom_bar(stat = "identity") + coord_flip() + theme_minimal() + # Set a white background theme(text = element_text(size = 6), #panel.grid.major = element_blank(), # Remove major grid lines panel.grid.minor = element_blank(), # Remove minor grid lines panel.background = element_blank(), # Ensure the background is white plot.background = element_blank()) + scale_fill_manual(values = custom_colors) # Apply the custom colors# Print the plotprint(gene_plot_impr)
ggsave(filename = "Assembly_based_genes/gene_plot_PAV_total.png", plot = gene_plot_impr, width = 10, height = 7, dpi = 300)
This will be based on untangle results only.
We will need the following files:
matrix.csv
with the results from all the chromosomes- A file called
loci_classes.txt
, where we have theFeature/Pater
andLabel
columns frompangenome_screening.csv
with the results from all the chromosomes matrix_transposed.csv
, to be obtained withtranspose.pl
- The dataset rearranged for R, to be obtained with
comp_node_path.pl
- Pangenome screening and matrix per chromosome
# chr1cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/chr1cut -f 2,8 pangenome_screening.csv > chr1_loci_classes.txt# transpose matrixperl ../transpose.pl matrix.csv > chr1_transp_matrix.txt# chr2cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/chr2cut -f 2,8 pangenome_screening.csv > chr2_loci_classes.txt# transpose matrixperl ../transpose.pl matrix.csv > chr2_transp_matrix.txt# chr3cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/chr3cut -f 2,8 pangenome_screening.csv > chr3_loci_classes.txt# transpose matrixperl ../transpose.pl matrix.csv > chr3_transp_matrix.txt# chr4cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/chr4cut -f 2,8 pangenome_screening.csv > chr4_loci_classes.txt# transpose matrixperl ../transpose.pl matrix.csv > chr4_transp_matrix.txt# chr5cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/chr5cut -f 2,8 pangenome_screening.csv > chr5_loci_classes.txt# transpose matrixperl ../transpose.pl matrix.csv > chr5_transp_matrix.txt
- Join Transposed matrices
cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/mkdir allcd allln -s ../chr1/chr1_transp_matrix.txtln -s ../chr2/chr2_transp_matrix.txtln -s ../chr3/chr3_transp_matrix.txtln -s ../chr4/chr4_transp_matrix.txtln -s ../chr5/chr5_transp_matrix.txt../combine_transposed_matrices.py
- Join loci classes
cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/genes/allln -s ../chr1/chr1_loci_classes.txtln -s ../chr2/chr2_loci_classes.txtln -s ../chr3/chr3_loci_classes.txtln -s ../chr4/chr4_loci_classes.txtln -s ../chr5/chr5_loci_classes.txtfor f in chr*_loci_classes.txt; do tail -n +2 "$f"; done >> combined_loci_classes.txt
- R dataset
perl ../comp_node_path_CNV.pl combined_loci_classes.txt concatenated_matrix.csv > loci_comp_classes.txt
This last script produces alsodata_set_R.txt
which is the file we are going to use for the figure. Let's rename the file:
mv data_set_R.txt R_data_genes_CNV_untangle.txt
setwd("/home/lia/Arabidopsis_pangenome/R_plots")
data<-read.table(file="Assembly_based_genes/R_data_genes_CNV_untangle.txt", sep="\t", header=FALSE)head(data)
colnames(data)<-c("Assembly", "Class", "Count")head(data)
library("ggplot2")
custom_colors <- c("private" = "black", "dispensable" = "beige", "softcore" = "olivedrab", "core" = "firebrick") # Replace with your actual classes and desired colors# Order the classes in a custom orderdata$Class <- factor(data$Class, levels = c("private", "dispensable", "softcore", "core")) # Replace with your actual class order# Assign names to the colors based on the levels of Classnames(custom_colors) <- levels(data$Class)# Create the plotgene_plot_impr <- ggplot(data = data, aes(x = Assembly, y = Count, fill = Class)) + geom_bar(stat = "identity") + coord_flip() + theme_minimal() + # Set a white background theme(text = element_text(size = 6), #panel.grid.major = element_blank(), # Remove major grid lines panel.grid.minor = element_blank(), # Remove minor grid lines panel.background = element_blank(), # Ensure the background is white plot.background = element_blank()) + scale_fill_manual(values = custom_colors) # Apply the custom colors# Print the plotprint(gene_plot_impr)
ggsave(filename = "Assembly_based_genes/gene_plot_CNV_untangle.png", plot = gene_plot_impr, width = 10, height = 7, dpi = 300)
We will need the following files:
presence_absence_matrix.csv
with the results from all the chromosomes- A file called
loci_classes.txt
, where we have theFeature/Pater
andLabel
columns fromfinal_screening_GO.tsv
with the results from all the chromosomes matrix_transposed.csv
, to be obtained withtranspose.pl
- The dataset rearranged for R, to be obtained with
comp_node_path.pl
- Final screening and Presence-absence matrix per chromosome
# chr1cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/chr1cut -f 1,4 final_pseudogenes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' loci_classes.txt > chr1_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr1.txt# transpose matrixperl ../transpose.pl matrix_chr1.txt > chr1_transp_matrix.txt# chr2cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/chr2cut -f 1,4 final_pseudogenes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' loci_classes.txt > chr2_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr2.txt# transpose matrixperl ../transpose.pl matrix_chr2.txt > chr2_transp_matrix.txt# chr3cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/chr3cut -f 1,4 final_pseudogenes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' loci_classes.txt > chr3_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr3.txt# transpose matrixperl ../transpose.pl matrix_chr3.txt > chr3_transp_matrix.txt# chr4cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/chr4cut -f 1,4 final_pseudogenes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' loci_classes.txt > chr4_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr4.txt# transpose matrixperl ../transpose.pl matrix_chr4.txt > chr4_transp_matrix.txt# chr5cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/chr5cut -f 1,4 final_pseudogenes_screening_GO.tsv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' loci_classes.txt > chr5_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' presence_absence_matrix.csv > matrix_chr5.txt# transpose matrixperl ../transpose.pl matrix_chr5.txt > chr5_transp_matrix.txt
- Join Transposed matrices
cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/mkdir allcd allln -s ../chr1/chr1_transp_matrix.txtln -s ../chr2/chr2_transp_matrix.txtln -s ../chr3/chr3_transp_matrix.txtln -s ../chr4/chr4_transp_matrix.txtln -s ../chr5/chr5_transp_matrix.txt../combine_transposed_matrices.py
- Join loci classes
cd ~/Arabidopsis_pangenome/FINAL/final_screening/pseudogenes/allln -s ../chr1/chr1_loci_classes.txtln -s ../chr2/chr2_loci_classes.txtln -s ../chr3/chr3_loci_classes.txtln -s ../chr4/chr4_loci_classes.txtln -s ../chr5/chr5_loci_classes.txtfor f in chr*_loci_classes.txt; do tail -n +2 "$f"; done >> combined_loci_classes.txt
- R dataset
perl ../comp_node_path_PAV.pl combined_loci_classes.txt concatenated_matrix.csv > loci_comp_classes.txt
This last script produces alsodata_set_R.txt
which is the file we are going to use for the figure. Let's rename the file:
mv data_set_R.txt R_data_pseudogenes_PAV_total.txt
Install required packages
mamba activate Rmamba install r-ggplot2mamba install r-factoextra
Open R
R
setwd("/home/lia/Arabidopsis_pangenome/R_plots")
data<-read.table(file="./Assembly_based_pseudogenes/R_data_pseudogenes_PAV_total.txt", sep="\t", header=FALSE)head(data)
colnames(data)<-c("Assembly", "Class", "Count")head(data)
library("ggplot2")
custom_colors <- c("private" = "black", "dispensable" = "beige", "softcore" = "olivedrab", "core" = "firebrick") # Replace with your actual classes and desired colors# Order the classes in a custom orderdata$Class <- factor(data$Class, levels = c("private", "dispensable", "softcore", "core")) # Replace with your actual class order# Assign names to the colors based on the levels of Classnames(custom_colors) <- levels(data$Class)# Create the plotgene_plot_impr <- ggplot(data = data, aes(x = Assembly, y = Count, fill = Class)) + geom_bar(stat = "identity") + coord_flip() + theme_minimal() + # Set a white background theme(text = element_text(size = 6), #panel.grid.major = element_blank(), # Remove major grid lines panel.grid.minor = element_blank(), # Remove minor grid lines panel.background = element_blank(), # Ensure the background is white plot.background = element_blank()) + scale_fill_manual(values = custom_colors) # Apply the custom colors# Print the plotprint(gene_plot_impr)
ggsave(filename = "Assembly_based_pseudogenes/pseudogene_plot_PAV_total.png", plot = gene_plot_impr, width = 10, height = 7, dpi = 300)
This will be based on untangle results only.
We will need the following files:
matrix.csv
with the results from all the chromosomes- A file called
loci_classes.txt
, where we have theFeature/Pater
andLabel
columns frompangenome_screening.csv
with the results from all the chromosomes matrix_transposed.csv
, to be obtained withtranspose.pl
- The dataset rearranged for R, to be obtained with
comp_node_path.pl
- Pangenome screening and matrix per chromosome
# chr1cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/chr1cut -f 2,8 pangenome_screening.csv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' loci_classes.txt > chr1_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr1"; print $0 }' OFS='\t' matrix.csv > matrix_chr1.txt# transpose matrixperl ../transpose.pl matrix_chr1.txt > chr1_transp_matrix.txt# chr2cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/chr2cut -f 2,8 pangenome_screening.csv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' loci_classes.txt > chr2_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr2"; print $0 }' OFS='\t' matrix.csv > matrix_chr2.txt# transpose matrixperl ../transpose.pl matrix_chr2.txt > chr2_transp_matrix.txt# chr3cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/chr3cut -f 2,8 pangenome_screening.csv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' loci_classes.txt > chr3_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr3"; print $0 }' OFS='\t' matrix.csv > matrix_chr3.txt# transpose matrixperl ../transpose.pl matrix_chr3.txt > chr3_transp_matrix.txt# chr4cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/chr4cut -f 2,8 pangenome_screening.csv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' loci_classes.txt > chr4_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr4"; print $0 }' OFS='\t' matrix.csv > matrix_chr4.txt# transpose matrixperl ../transpose.pl matrix_chr4.txt > chr4_transp_matrix.txt# chr5cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/chr5cut -f 2,8 pangenome_screening.csv > loci_classes.txt# append _chr* to gene namesawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' loci_classes.txt > chr5_loci_classes.txtawk -F'\t' 'NR==1 {print; next} { $1 = $1 "_chr5"; print $0 }' OFS='\t' matrix.csv > matrix_chr5.txt# transpose matrixperl ../transpose.pl matrix_chr5.txt > chr5_transp_matrix.txt2. Join Transposed matrices
cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/mkdir allcd all
ln -s ../chr1/chr1_transp_matrix.txtln -s ../chr2/chr2_transp_matrix.txtln -s ../chr3/chr3_transp_matrix.txtln -s ../chr4/chr4_transp_matrix.txtln -s ../chr5/chr5_transp_matrix.txt
../combine_transposed_matrices.py
3. Join loci classes
cd ~/Arabidopsis_pangenome/FINAL/untangle_based_results/pseudogenes/all
ln -s ../chr1/chr1_loci_classes.txtln -s ../chr2/chr2_loci_classes.txtln -s ../chr3/chr3_loci_classes.txtln -s ../chr4/chr4_loci_classes.txtln -s ../chr5/chr5_loci_classes.txt
for f in chr*_loci_classes.txt; do tail -n +2 "$f"; done >> combined_loci_classes.txt
3. R dataset
perl ../comp_node_path_CNV.pl combined_loci_classes.txt concatenated_matrix.csv > loci_comp_classes.txt
This last script produces also `data_set_R.txt` which is the file we are going to use for the figure. Let's rename the file:
mv data_set_R.txt R_data_pseudogenes_CNV_untangle.txt
mamba activate RR
### Make the figure```{r}setwd("/home/lia/Arabidopsis_pangenome/R_plots")
data<-read.table(file="Assembly_based_pseudogenes/R_data_pseudogenes_CNV_untangle.txt", sep="\t", header=FALSE)head(data)
colnames(data)<-c("Assembly", "Class", "Count")head(data)
library("ggplot2")
custom_colors <- c("private" = "black", "dispensable" = "beige", "softcore" = "olivedrab", "core" = "firebrick") # Replace with your actual classes and desired colors# Order the classes in a custom orderdata$Class <- factor(data$Class, levels = c("private", "dispensable", "softcore", "core")) # Replace with your actual class order# Assign names to the colors based on the levels of Classnames(custom_colors) <- levels(data$Class)# Create the plotgene_plot_impr <- ggplot(data = data, aes(x = Assembly, y = Count, fill = Class)) + geom_bar(stat = "identity") + coord_flip() + theme_minimal() + # Set a white background theme(text = element_text(size = 6), #panel.grid.major = element_blank(), # Remove major grid lines panel.grid.minor = element_blank(), # Remove minor grid lines panel.background = element_blank(), # Ensure the background is white plot.background = element_blank()) + scale_fill_manual(values = custom_colors) # Apply the custom colors# Print the plotprint(gene_plot_impr)
ggsave(filename = "Assembly_based_pseudogenes/pseudogene_plot_CNV_untangle.png", plot = gene_plot_impr, width = 10, height = 7, dpi = 300)
we need to obtain a huge matrix:
cd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/all_graphsln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr1/node_matrix/chr1_matrix.csvln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr2/node_matrix/chr2_matrix.csvln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr3/node_matrix/chr3_matrix.csvln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr4/node_matrix/chr4_matrix.csvln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr5/node_matrix/chr5_matrix.csvscreen -S combine_matricesmamba activate pandascombine_transposed_matrices.pycd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr1/node_matrix/cut -f 1,4 PAV_CNV.csv > chr1_loci_classes.txtcd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr2/node_matrix/cut -f 1,4 PAV_CNV.csv > chr2_loci_classes.txtcd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr3/node_matrix/cut -f 1,4 PAV_CNV.csv > chr3_loci_classes.txtcd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr4/node_matrix/cut -f 1,4 PAV_CNV.csv > chr4_loci_classes.txtcd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr5/node_matrix/cut -f 1,4 PAV_CNV.csv > chr5_loci_classes.txtcd /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/all_graphsln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr1/node_matrix/chr1_loci_classes.txtln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr2/node_matrix/chr2_loci_classes.txtln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr3/node_matrix/chr3_loci_classes.txtln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr4/node_matrix/chr4_loci_classes.txtln -s /home/edg01/edg01/lia/Arabidopsis/pangenome/ara_pan_from_April/chr5/node_matrix/chr5_loci_classes.txtfor f in chr*_loci_classes.txt; do tail -n +2 "$f"; done >> combined_loci_classes.txtscreen -r combine_matrices./comp_node_path_PAV.pl combined_loci_classes.txt concatenated_matrix.csv > loci_comp_classes_pav.txtmv data_set_R.txt pav_data_set_R.txt
mamba activate RR
setwd("/home/lia/Arabidopsis_pangenome/R_plots")
data<-read.table(file="Assembly_based_nodes/pav_data_set_R.txt", sep="\t", header=FALSE)head(data)
colnames(data)<-c("Assembly", "Class", "Count")head(data)
library("ggplot2")library(scale)
custom_colors <- c("private" = "black", "dispensable" = "beige", "softcore" = "olivedrab", "core" = "firebrick") # Replace with your actual classes and desired colors# Order the classes in a custom orderdata$Class <- factor(data$Class, levels = c("private", "dispensable", "softcore", "core")) # Replace with your actual class order# Assign names to the colors based on the levels of Classnames(custom_colors) <- levels(data$Class)# Create the plotnode_plot_pav <- ggplot(data = data, aes(x = Assembly, y = Count, fill = Class)) + geom_bar(stat = "identity") + coord_flip() + theme_minimal() + # Set a white background theme(text = element_text(size = 6), #panel.grid.major = element_blank(), # Remove major grid lines panel.grid.minor = element_blank(), # Remove minor grid lines panel.background = element_blank(), # Ensure the background is white plot.background = element_blank()) + scale_fill_manual(values = custom_colors) + # Apply the custom colors scale_y_continuous(labels = label_number()) # Format y-axis labels to avoid scientific notation# Print the plotprint(node_plot_pav)
ggsave(filename = "Assembly_based_nodes/nodes_plot_PAV.png", plot = node_plot_pav, width = 10, height = 7, dpi = 300)
Refer toscripts/python_plot.ipynb