MetaX Cookbook

This is the guidebook for the MetaXGUI Version. If you are using the CLI to analyze, We recommend that you read thedocumentation for each MetaX module for instructions on how to use it from the command line.

Overview

MetaX is a novel tool for linking peptide sequences with taxonomic and functional information inMetaproteomics. We introduce theOperational Taxon-Function (OTF) concept to explore microbial roles and interactions ("who is doing what and how") within ecosystems.

MetaX also featuresstatistical modules andplotting tools for analyzing peptides, taxa, functions, proteins, and taxon-function contributions across groups.

Project Page

VisitGitHub to get more information:

https://github.com/byemaxx/MetaX

Getting Started

The main window of MetaX

Click 'Tools Menu' to switchdifferent modules

Exploring Data with MetaX

See thePreparing Your Data section to build the database and annotate peptides to OTFs before starting.

Module 1. OTF Analyzer

After obtaining theOperational Taxa-Functions (OTF) Table using thePeptide Annotator, you can perform downstream analysis with theOTF Analyzer.

1. Data Preparation

OTFs (Operational Taxa-Functions) Table: Obtained from thePeptide Annotator module.

Meta Table: The first column is sample names, and the other columns represent different groups. If no meta table is provided, meta info will be generated automatically: (1) all samples are in the same group; (2) each sample is a separate group.

Example Meta Table:

samples	Individuals	Treatment	Sweetener
sample_1	V1	Treatment	XYL
sample_2	V1	Treatment	XYL
sample_3	V1	Treatment	XYL
sample_4	V1	Control	PBS
sample_5	V1	Control	PBS
sample_6	V1	Control	PBS

You can load example data byclicking the button.

Then, clickGo to start the analysis.

Advanced Settings
Peptide Column Name: Specifies the column in the OTF table that contains peptide information.
Protein Column Name: Specifies the column in the OTF table that contains protein information (only required if protein summation is performed in downstream analysis).
Sample Column Prefix: Identifies the prefix of sample columns to determine intensity columns in the OTF table.
Any Data Mode: Allows analysis of any table using MetaX, not limited to OTF tables (only partial tool functionality is available).
- Customized Table Item Column Name: Specifies the column containing item names in any data mode. If left empty, the first column will be selected by default.

2. Data Overview

The Data Overview provides basic information about your data, such as the number of taxa, functions, and proportions.

Set the threshold for linked peptides and the differences between them to plot figures.

Select different functions to plot the proportion distribution.

Filter out samples for downstream analysis.

3. Set TaxaFunc

Data Selection

Function: Select a function for downstream analysis (None in the list means no function is selected, focusing only on peptides and taxa).
Function Filter Threshold: If a specific function within a protein group of a peptide has the highest proportion, it will be considered the representative function for that peptide. The default threshold is 1.00 (100%).

Taxa Level: Select a taxa level for downstream analysis (Life in the list means no filtering by any taxa, the follow analysis focus on functions).
Split Function: Split the annotations with multi-functions.
KO Intensity
ko:K00625,ko:K13788 10
to
KO Intensity
ko:K00625 10
ko:K13788 10
IfShare Intensity is checked, the intensity above would given5 to each split KO
Peptide Number Threshold: only keep the taxon (function or OTF) at least has the setting number of peptides.
Create Taxa and Func only from OTFs:
Without selection (checkbox not checked):
- Taxa table: Peptides are filtered based solely on taxa levels, without considering any functional categories.
- Function table: Peptides are filtered solely by functional categories and thresholds, regardless of their taxa levels.
- Taxa-Function (OTFs) table: Peptides are filtered by both taxa levels and functional categories simultaneously.
With selection (checkbox checked):
All tables are filtered by both taxa levels and functional categories simultaneously.

KO	Intensity
ko:K00625,ko:K13788	10

KO	Intensity
ko:K00625	10
ko:K13788	10

Sum Proteins Intensity

ClickCreate Proteins Intensity Table to sum peptides to proteins if the Protein column is in the original table.

Occam's Razor,Anti-Razor andRank: Methods available for inferring shared peptides.
Razor:
1. Build a minimal set of proteins to cover all peptides.
2. For each peptide, choose the protein with the most peptides (if multiple proteins have the same number of peptides, share intensity to them).
Anti-Razor:
- All proteins share the intensity of each peptide.
Rank:
1. Build the rank of proteins.
2. Choose the protein with a higher rank for the shared peptide.
Methods to Build Protein Rank:- unique_counts: Use the counts of proteins inferred by unique peptides.- all_count: Use the counts of all proteins.- unique_intensity: Use the intensity of proteins inferred by unique peptides.- shared_intensity: Use the intensity divided by the number of shared peptides for each protein.

Data preprocessing

Quantitative Method：
Sum: Sum the peptides intensity directly to Taxa, Functions or OTFs intensity.
DirecteLFQ: Using DirecteLFQ to normalize the peptides and then estimate the intensity by usingintensity traces.
Outlier handling:

There are several methods for detecting and handling outliers.

Two steps will be applied:
Outlier Detection: Users can select a method to mark outlier values as NaN. Then the rowsonly contain NaN values and 0 will be removed. The remaining NaN values will be handled in the next step.
Outlier Handling: Users can choose a method to fill the remaining NaN values.
Outliers Detection:
IQR: In a group, if the value is greater than Q3+1.5*IQR or less than Q1-1.5*IQR, the value will be marked as NaN.
Missing-Value: Detect nan values in the data. If a value is nan, it will be marked as a NaN.
Half-Zero: This rule applies to groups of data. If more than half of the values in a group are 0, while the rest are non-zero, then the non-zero values are marked as NaN. Conversely, if less than half of the values are 0, then the zero values are marked as NaN. If the group contains an equal number of 0 and non-zero values, all values in the group are marked as NaN.
Zero-Dominant: This rule applies to groups of data. If more than half of the values in a group are 0, then the non-zero values are marked as NaN.
Zero-Inflated Poisson: This method is based on the Zero-Inflated Poisson (ZIP) model, which is a type of model that is used when the data contains a lot of zeros, more than what is expected in a standard Poisson model. In this context, the ZIP model is used to detect outliers in the data. The process involves fitting the ZIP model to the data and then predicting the data values. If the predicted value is less than 0.01, then the data point is marked as an outlier (NaN).
Negative Binomial: This method is based on the Negative Binomial model, which is a type of model used when the variance of the data is greater than the mean. Similar to the ZIP method, the Negative Binomial model is fitted to the data and then used to predict the data values. If the predicted value is less than 0.01, then the data point is marked as an outlier (NaN).
Z-Score: Z-score is a statistical measure that tells how far a data point is from the mean in terms of standard deviations. Outliers are often identified as points with Z-scores greater than 2.5 or less than -2.5.
Mahalanobis Distance: Mahalanobis distance measures the distance between a point and a distribution, considering the correlation among variables. Outliers can be identified as points with a Mahalanobis distance that exceeds a certain threshold.

In all methods, You can choose detection outliers by a meta column, and a meta to handle the outliers.

Outliers Imputation:
Drop: Remove peptides that contain any NaN values.
Original: Remove peptides that contain any NaN values.
Mean: Outliers will be imputed by the mean.
Median: Outliers will be imputed by the median.
KNN: Outliers will be imputed by KNN (K=5). The K-Nearest Neighbors algorithm uses the mean or median of the nearest neighbours to fill in missing values.
Regression: Outliers will be imputed by using IterativeImputer with regression method. This method uses round-robin linear regression, modelling each feature with missing values as a function of other features.
Multiple: Outliers will be imputed by using IterativeImputer with multiple imputations method. It uses the IterativeImputer with a specified number (K=5) of the nearest features.

You can choose the outliers Imputation byeach group or byall samples.

Remove Batch Effect:
Here, you can choose a group as the batch effect, then use [reCombat] (https://github.com/BorgwardtLab/reComBat) for handling.
Data Transformation:
Log2, Log10, Square root transformation, Cube root transformation and box-cox.
Data Normalization:
Trace Shifting: Reframing the Normalization Problem with Intensity traces (inspired by DirectLFQ).
- Note: Ifboth trace shifting and transformation are applied,normalization will be done before transformation.
Standard Scaling (Z-Score), Min-Max Scaling, Pareto Scaling, Mean centring and Normalization by Precentage.

If you use [Z-Score, Mean centring and Pareto Scaling] data normalization, the data will be given a minimum offset again to avoid negative values.

Drag the item's name to change theorder of data preprocessing.

Then, click Go to create a TaxaFunc object for analysis.

Then we can check tables inTable Review part, and export it.

4. Basic Stats

PCA, Correlation and Box Plot

We can selectmetagroups orsamples (default all) to plotPCA,Correlation, andBox Plot for[Taxa, Function, Taxa-Func, Peptide table, Protein table]

Setting and modifying the plot
Show or hide labels in the figure by checking the checkboxShow Labels
SelectSub Meta to plot with two meta
Change settings in thePLOT PARAMETER tab
Select specific Groupswith condition
e.g. : Select PBS, BAS and others groupsonly inIndividualV1
Selectspecific Samples to Analysis
Number stats
We can plot the bar for the number for each table bygroups or bysamples

Taxa Specific
Alpha/Beta Diversity
Sunburst
TreeMap
Sankey

Heatmap and Bar Plot

Select items(Taxa, Function, Taxa-Func and Peptide ) to plot:
AddAll Taxa, or select one we are interested in.

Add items to Top List: select the top items to plot by some statistical method.
Clickedfilter with threshold will filter by the padj of ANOVA and T-TEST and padj and Log2FC of DESeq2 result (setting in the corresponding page).

Add a list for ploting:
Make sure one row one item

Setting:
Change the setting fit for your data.
Rename Samples: Add group info to each sample name
Rename Taxa: Only keep the last taxonomic level to reduce to name
Plot Mean: calculate the mean of each group before plotting
Sub Meta: select a second meta, then combine two meta by mean for Heatmap and 3D bar plot
Plot all color maps to view by right clickTheme
Plot:

Modify the pic to fit the window to get thePerfect picture:
Bar Plot:

interactive function:

change to line plot:
3D Bar plot
Plot 3D bar by selecting asub meta.

Peptide Query

Query everything of a peptide

5. Cross Test

T-TEST

Select 2 group stats T-Test for[Taxa, Function, Taxa-Func, Peptide table and Proteins Table]

ANOVA-TEST

selectsome groups orall groups to ANOVA Test for[Taxa, Function, Taxa-Func and Peptide table]

Significant Taxa-Func

Significant comparing enables us to find the result ofThe taxa between the two groups showing no significant differences, while the related functions are significantly different and function no significant but relted taxa significant.

Plot Corss Heatmap

Theresult of the T-test and ANOVA Test will show in a new window

Plot Heatmap for results
Chose a Table to plottop differences heatmap or getthe top table

Taxa-Func cross heatmap:
The orange cells mean in the corresponding function ( X-axis) and Taxa( Y-axis) are significantly different between groups.

Func(Taxa) Heatmap:
The colour shows the intensity of the significant Func(Taxa) between groups.

Significant Taxa-Func Heatmap:
The colored tiles represent the taxa which were not significantly different between groups but the related functions were.

Group-Control TEST

Dunnett's Test

Set a Group as"Control", then compare all groups to Control

Comparing in Each Condition: Select a meta such as individual, then compare groups to control in each individual.
DESeq2 Test

Bingo! You noticed the hidden function of MetaX, clickHelp -> About -> Like 3 times to unlock the function to compare all groups to control.

Result of Dunnett's Test:
- T- Statistic value shown in the heatmap

DESeq2

Select two groups to calculate FoldChange by [PyDESeq2]: https://github.com/owkin/PyDESeq2

Selectp-adjust,log2FC to plot

(Ultra-Up(Down): |log2FC| > Max log2FC)

Volcano:
Sankey:
- The last node level is the functions linked to each Taxon (When plotting Taxa-Func)

TUKEY_TEST

Select a function:
Test the significant groups in this function.
Select a Taxon:
Test the significant groups in this taxon.
Select both function and taxon:
Test the significant groups in this function and this taxon.

Show Linked Taxa Only: only shows the taxa linked with the current function in the taxa combo box.
Show Linked Func Only: only shows the function linked with the current taxon in the function combo box.
Do not forget to clickReset Function Taxa List to reset all items after the filtering
Tukey result plot:
The dots and lines show the difference in the mean value of the Tukey test

6. Expression Analysis

Co-Expression Networks & Heatmap

select Groups or Samples to calculate the correlation and plot the network

Slecet table, and set the method of correlation and threshold

Add some items to the focus list (Optional)

Network Plot
The Red dots are focus items
The depth of color and the width of edges represent the correlation value
The size of the dot indicates the number of connections

Correlation of expression

Expression Trends

Add items to the list window to plot the clusters with similar trends of intensity

Clusters plot (clustered byk-means)
The coloured line is the average

Select aspecific cluster to plotinteractive Lines or get thetable
The dashed red line is the average

7. Taxa-Func Link

Taxa-Func Link Plot

Check all taxa in one function (or Check all functions in a taxon)
selecta function, and click the buttonShow Linked Taxa Only
- Linked Number: The number shows how many taxa are linked in this function
- The number starts with Taxa: The number shows how many peptides are in this Taxa-Func

Filter items of the Taxa and Func list

Plot Heatmap or Bar
Select some groups (Default all) to getthe intensity of each taxon of this function

Plotpeptides inone Function of a Taxon

Switch Bar to Stacked or not ( Line)

Change Bar plot to Lines

Taxa-Func Network

Select some groups or samples (default all)
add some taxa, func or taxa-func to focus the window (Optional)

Plot list only
Plot List Only: Show the items only in the list and the items linked to them
Without Links: Only show the items in the focus list
Network plot
The yellow dots are taxa, and the grey dots are functions, the size of the dots presents the intensity
The red dots are the taxa we focused on
The green dots are the functions we focused on
More parameters can be set inDev->Settings->Others (e.g. Nodes Shape, color, Line Style)

8. Restore Last TaxaFunc Object

Once you create TaxaFunc, theTaxaFunc Object will save automatically, and you can restore it next time.
Also, we can export the current MetaX to a file and reload it again.

Preparing Your Data

Module 2. Database Builder

Note: The results fromMetaLab v2.3 MaxQuant workflow do not require database building. However, we do not recommend using these results as input to MetaX, as many peptides may be discarded.

Build the database for thefirst time using theDatabase Builder.

Option 1: Build Database Using MGnify Data

Ensure you download the correct database type corresponding to your data.

Option 2: Build Database Using Own Data

Annotation Table: A TSV table (tab-separated), with the first column as protein name joined with Genome by "_", e.g., "Genome1_protein1", and other columns containing annotation information.

Taxa Table: A TSV table (tab-separated), with the first column as Genome name, e.g., "Genome1", and the second column as taxa.

Example Annotation Table:

Query	Preferred_name	EC	KEGG_ko
MGYG000000001_00696	mfd	-	ko:K03723
MGYG000000001_02838	hxlR	-	-
MGYG000000001_01674	ispG	1.17.7.1,1.17.7.3	ko:K03526
MGYG000000001_02710	glsA	3.5.1.2	ko:K01425
MGYG000000001_01356	mutS2	-	ko:K07456
MGYG000000001_02630	-	-	-
MGYG000000001_02418	ackA	2.7.2.1	ko:K00925
MGYG000000001_00728	atpA	3.6.3.14	ko:K02111
MGYG000000001_00695	pth	3.1.1.29	ko:K01056
MGYG000000001_02907	-	-	ko:K03086
MGYG000000001_02592	rplC	-	ko:K02906
MGYG000000001_00137	-	-	ko:K03480,ko:K03488

Example Taxa Table:

Genome	Lineage
MGYG000000001	d_Bacteria;p_Firmicutes_A;c_Clostridia;o_Peptostreptococcales;f_Peptostreptococcaceae;g_GCA-900066495;s_GCA-900066495 sp902362365
MGYG000000002	d_Bacteria;p_Firmicutes_A;c_Clostridia;o_Lachnospirales;f_Lachnospiraceae;g_Blautia_A;s_Blautia_A faecis
MGYG000000003	d_Bacteria;p_Bacteroidota;c_Bacteroidia;o_Bacteroidales;f_Rikenellaceae;g_Alistipes;s_Alistipes shahii
MGYG000000004	d_Bacteria;p_Firmicutes_A;c_Clostridia;o_Oscillospirales;f_Ruminococcaceae;g_Anaerotruncus;s_Anaerotruncus colihominis
MGYG000000005	d_Bacteria;p_Firmicutes_A;c_Clostridia;o_Peptostreptococcales;f_Peptostreptococcaceae;g_Terrisporobacter;s_Terrisporobacter glycolicus_A
MGYG000000006	d_Bacteria;p_Firmicutes;c_Bacilli;o_Staphylococcales;f_Staphylococcaceae;g_Staphylococcus;s_Staphylococcus xylosus
MGYG000000007	d_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Lactobacillaceae;g_Lactobacillus;s_Lactobacillus intestinalis
MGYG000000008	d_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Lactobacillaceae;g_Lactobacillus;s_Lactobacillus johnsonii
MGYG000000009	d_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Lactobacillaceae;g_Ligilactobacillus;s_Ligilactobacillus murinus

Module 3. Database Updater

TheDatabase Updater allows updating the database built by theDatabase Builder or adding more annotations. This step isoptional.

Update the built database and extend annotations.

Option 1: Built-in Mode

We recommend some extended databases, such asdbCAN_seq.

Option 2: TSV Table

Extend the database by adding a new database to the database table. Ensure the column separator is a tab and the first column is the Protein name, with other columns containing function annotations.

Example:

Protein ID	COG	KEGG	...
MGYG000000001_02630	Function 1	Function 1	...
MGYG000000001_01475	Function 2	Function 1	...
MGYG000000001_01539	Function 3	Function 1	...

Module 4. Peptide Annotator

1. Results from MAG Workflow

The peptide results use Metagenome-assembled genomes (MAGs) as the reference database for protein searches, e.g., MetaLab-MAG, MetaLab-DIA and other workflows wich using MAG databases like MGnify or customized MAGs Database.

Annotate the peptide to the Operational Taxa-Functions (OTF) Table before analysis using thePeptide Annotator.

Required:

Database: The database created byDatabase Builder

Peptide Table:

Option 1: From MetaLab-MAG results (final_peptides.tsv)
Option 2: Create it manually, with the first column as the ID (e.g., peptide sequence) and the second column as the proteins ID of MGnify (e.g., MGYG000003683_00301; MGYG000001490_01143) or your database, and other columns as the intensity of each sample.