Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Comparison of protein learning

NotificationsYou must be signed in to change notification settings

sinc-lab/Comparison-of-Protein-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains the data and code used in the review of proteinsequence embeddings entitled"Transfer learning in proteomics: comparison ofnovel learned representations for protein sequences," by E. Fenoy, A. Ederaand G. Stegmayer (under review). Research Institute for Signals, Systems andComputational Intelligence,sinc(i).

In the figure above, points depict 2D non-linear projections calculated fromthe 12 protein sequence embeddings studied. Orange points highlight proteinsequences having theImmunoglobulin C1-set domain(PF07654).

The figures above show the performance of the 12 embeddings used forpredicting the GO terms annotating protein sequences. Performance is measuredwith the F1 score and predictions are grouped according to the threesub-ontologies of the GO terms: Biological Process (BP), Cellular Component(CC) and Molecular Function (MF).

Introduction

Recently, representation learning techniques are being proposed for encodingdifferent types of protein information (sequence, domains, interactions, etc.)as low-dimensional vectors. In this review, we performed a detailedexperimental comparison of several protein sequence embeddings on severalbioinformatics tasks:

  • determining similarities between proteins in the embeddings projected space.

  • inferring protein domains.

  • predicting GO ontology-based protein functions.

Notebook

Thisnotebookreproduces the visual comparative analysis of 12 embeddings in the evaluationof the capability of protein sequence embeddings for capturing protein domaininformation.

Protein sequence embeddings

The review used 9,479human protein sequences tobuildembeddingswith 12 embedding methods.

Note: Click the method name below to download the embeddings used in thisreview.

EmbeddingDimensionalityReference

CPCProt

512

Lu et al., 2020

DeepGOCNN

8,192

Kulmanov & Hoehndorf, 2019

ESM

1,280

Rives et al., 2021

GP

64

Yang et al., 2018

Plus-RNN

1,024

Min et al., 2021

ProtTrans

1,024

Elnaggar et al., 2021

ProtVec

300

Asgari & Mofrad, 2015

rawMSA

50

Mirabello & Wallner, 2019

RBM

100

Tubiana et al., 2019

SeqVec

1,024

Heinzinger et al., 2019

TAPE

768

Rao et al., 2019

UniRep

1,900

Alley et al., 2019


[8]ページ先頭

©2009-2025 Movatter.jp