Movatterモバイル変換

cubar

Comprehensive Codon Usage Bias Analysis in R

Overview

Codon usage bias refers to the non-uniform usage of synonymous codons(codons that encode the same amino acid) across different organisms,genes, and functional categories.cubar is acomprehensive R package for analyzing codon usage bias in codingsequences. It provides a unified framework for calculating establishedcodon usage metrics, conducting sliding-window analyses or differentialusage analyses, and optimizing sequences for heterologousexpression.

Features

🧬 Codon-Level Analysis

RSCU calculation: Relative synonymous codon usageanalysis
Amino acid usage: Frequency of each amino acid insequences
Codon weights: Calculate weights based on geneexpression, tRNA availability, and mRNA stability
Optimal codon inference: Machine learning-basedidentification of optimal codons
Codon-anticodon visualization: Visualization ofcodon-tRNA pairing relationships

📊 Gene-Level Metrics

Codon frequency tabulation: Count codon occurrencesacross sequences
CAI (Codon Adaptation Index): Measure similarity tohighly expressed genes
ENC (Effective Number of Codons): Assess codonusage bias strength
Fop (Fraction of Optimal codons): Calculateproportion of optimal codons
tAI (tRNA Adaptation Index): Match codon usage totRNA availability
CSCg (Codon Stabilization Coefficients): QuantifymRNA stability effects
Dp (Deviation from Proportionality): Analyzevirus-host codon usage relationships
GC content metrics: Overall GC, GC3s (3rd codonpositions), GC4d (4-fold degenerate sites)

🛠️ Utilities & Tools

Sliding window analysis: Positional codon usagepatterns within genes
Sequence optimization: Redesign sequences foroptimal expression
Differential codon usage: Statistical comparisonbetween sequence sets
Quality control: Comprehensive CDS validation andpreprocessing

Why Choose cubar?

🚀 High Performance: Process large datasets(>100,000 sequences) efficiently using optimizedBiostrings anddata.table backends
🧬 Flexible Genetic Codes: Support for all NCBIgenetic codes plus custom genetic code tables
🔗 R Ecosystem Integration: Seamlessly integratewith other bioinformatics and data analysis packages
📚 Comprehensive Documentation: Extensivetutorials, examples, and theoretical background
🔬 Research Ready: Implements established metricswith proper citations and validation

Installation

Stable Release (Recommended)

Install the latest stable version from CRAN:

install.packages("cubar")

Development Version

Install the latest development version from GitHub:

# Install devtools if not already installedif (!requireNamespace("devtools",quietly =TRUE)) {install.packages("devtools")}# Install cubar from GitHubdevtools::install_github("mt1022/cubar",dependencies =TRUE)

Dependencies

System Requirements: - R (≥ 4.1.0)

Required Packages: -Biostrings (≥2.60.0) - Bioconductor package for sequence manipulation -IRanges (≥ 2.34.0) - Bioconductor infrastructure for rangeoperations
-data.table (≥ 1.14.0) - High-performance datamanipulation -ggplot2 (≥ 3.3.5) - Data visualization -rlang (≥ 0.4.11) - Language tools

Note: Bioconductor packages will be installedautomatically, but you may need to update your R installation if youencounter compatibility issues.

Documentation & Tutorials

📖Complete documentation is available within R(?function_name) and on ourpackagewebsite.

🎯 Getting Started

Introductionto cubar - Basic usage and core functionality
Non-standardGenetic Codes - Working with alternative genetic codes
CodonOptimization - Sequence optimization strategies

📚 Advanced Topics

MathematicalFoundations - Detailed theory behind the metrics
FunctionReference - Complete function documentation

Example Workflow

Here’s a typical analysis workflow demonstrating keyfunctionality:

library(cubar)library(ggplot2)# 1. Load and quality-check sequencesdata(yeast_cds)clean_cds<-check_cds(yeast_cds)# 2. Calculate codon frequenciescodon_freq<-count_codons(clean_cds)# 3. Calculate multiple metricsenc<-get_enc(codon_freq)# Effective number of codonsgc3s<-get_gc3s(codon_freq)# GC content at 3rd positions# 4. Analyze highly expressed genesdata(yeast_exp)yeast_exp<- yeast_exp[yeast_exp$gene_id%in%rownames(codon_freq), ]high_expr<-head(yeast_exp[order(-yeast_exp$fpkm), ],500)rscu_high<-est_rscu(codon_freq[high_expr$gene_id, ])cai<-get_cai(codon_freq, rscu_high)# 5. Visualize resultsdf<-data.frame(ENC = enc,CAI = cai,GC3s = gc3s)ggplot(df,aes(color = GC3s,x = ENC,y = CAI))+geom_point(alpha =0.6)+scale_color_viridis_c()+labs(title ="Codon Usage Bias Relationships",x ="Effective Number of Codons",y ="Codon Adaptation Index")

🆘 Getting Help

📋 GitHub Issues:Report bugs, requestfeatures, or ask questions
📖 Documentation: Check function help(?function_name) andonline docs

Related Packages

For complementary analysis, consider these R packages:

Biostrings- Sequence input/output and manipulation
Peptides -Peptide and protein property calculations

License

This project is licensed under the MIT License - see theLICENSE file for details.

Acknowledgments

GitHub Copilot was used to suggest code snippetsduring development
GitHubEducation for providing free access to developmenttools
The R and Bioconductor communities for excellent foundationalpackages
Contributors and users who have provided feedback andimprovements

📚Documentation •🐛Report Bug •💡Request Feature

[8]ページ先頭