LucaProt

LucaProt(DeepProtFuc) is a ope source project developed by Alibaba ad licesed uder the Apache Licese (Versio 2.0).

This product cotais various third-party compoets uder other ope source liceses. See the NOTICE file for more iformatio.

Itroductio

LucaProt: A ovel deep learig framework that icorporates protei amio acid sequece ad structure iformatio to predict protei fuctio.

1. Model

1) Model Itroductio

We developed a ew deep learig model, amely, Deep Sequetial ad Structural Iformatio Fusio Network for Proteis Fuctio Predictio (DeepProtFuc/LucaProt), which takes ito accout protei sequece compositio ad structure to facilitate the accurate aotatio of protei fuctio.

Here, we applied LucaProt to idetify viral RdRP.

2) Model Architecture

We treat protei fuctio predictio as a classificatio problem. For example, viral RdRP idetificatio is a biary-class classificatio task, ad protei geeral fuctio aotatio is a multi-label classificatio task. The model icludes five modules: Iput, Tokeizer, Ecoder, Poolig, ad Output. Its architecture is show i Figure 1.

Figure 1 The Architecture of LucaProt

3) Model Iput/Output

Use the amio acid letter sequece as the iput of our model. The model outputs the fuctio label of the iput protei, which is a sigle tag (biary-class classificatio or multi-class classificatio) or a set of tags (multi-label classificatio).

2. Depedece

System: Ubutu 20.04.5 LTS
Pytho: 3.9.13
Dowload aacoda: aacoda
Cuda: cuda11.7 (torch==1.13.1)

# Select 'YES' durig istallatio for iitializig the coda eviromet  
sh Aacoda3-2022.10-Liux-x86_64.sh  
# Source the eviromet
source ~/.bashrc  
# Verificatio
coda  
# Istall ev ad pytho 3.9.13   
coda create - lucaprot pytho=3.9.13    
# activate ev
coda activate lucaprot  
# Istall git      
sudo apt-get update         
sudo apt istall git-all

# Eter the project   
cd LucaProt     

# Istall
pip istall -r requiremets.txt -i https://pypi.tua.tsighua.edu.c/simple

3. Iferece

1) Predictio from oe sample

cd LucaProt/src/predictio/ 
sh ru_predict_oe_sample.sh

Note: the embeddig matrix of the sample is real-time predictive.

Or:

cd LucaProt/src/

export CUDA_VISIBLE_DEVICES=0

pytho predict_oe_sample.py \
    --protei_id protei_1 \
    --sequece MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ    \
    --emb_dir ./emb/ \
    --trucatio_seq_legth 4096 \
    --dataset_ame rdrp_40_exted \
    --dataset_type protei \
    --task_type biary_class \
    --model_type sef \
    --time_str 20230201140320 \
    --step 100000 \
    --threshold 0.5

--protei_id
str, the protei id.
--sequece
str, the protei sequece.
--trucatioseqlegth
it, trucate sequeces loger tha the give value. Recommeded values: 4096, 2048, 1984, 1792, 1534, 1280, 1152, 1024, defualt: 4096.
--emb_dir(optioal)
path, the saved dirpath of the protei predicted embeddig matrix or vector durig predictio, optioal.
--datasetame
str, the dataset ame for buildig of our traied model(rdrp40_exted).
--dataset_type
str, the dataset type for buildig of our traied model(protei).
--tasktype
str, the task ame for buildig of our traied model(biaryclass).
--model_type
str, the model ame for buildig of our traied model(sef).
--time_str
str, the ruig time strig(yyyymmddHimiss) for buildig of our traied model(20230201140320).
--step
it, the traiig global step of model fializatio(100000).
--threshold
float, sigmoid threshold for biary-class or multi-label classificatio, Noe for multi-class classificatio, default: 0.5.

2) Predictio from may samples

the samples are i *.fasta, sample by sample predictio.

--fasta_file
str, the samples fasta file
--save_file
str, file path, save the predicted results ito the file.
--pritperumber
it, prit progress iformatio for every umber of samples completed, default: 100.

cd LucaProt/src/predictio/   
sh ru_predict_may_samples.sh

Or:

cd LucaProt/src/

export CUDA_VISIBLE_DEVICES=0  

pytho predict_may_samples.py \
    --fasta_file ../data/rdrp/test/test.fasta  \
    --save_file ../result/rdrp/test/test_result.csv  \
    --emb_dir ../emb/   \
    --trucatio_seq_legth 4096  \
    --dataset_ame rdrp_40_exted  \
    --dataset_type protei     \
    --task_type biary_class     \
    --model_type sef     \
    --time_str 20230201140320   \
    --step 100000  \
    --threshold 0.5 \
    --prit_per_umber 10

3) Predictio from the file

The test data (small ad real) is i demo.csv, where the 7th colum of each lie is the fileame of the structural embeddig iformatio prepared i advace.
Ad the structural embeddig files store i embs.

The test data icludes 50 viral-RdRPs ad 50 o-viral RdRPs.

cd LucaProt/src/predictio/   
sh ru_predict_from_file.sh

Or:

cd LucaProt/src/

export CUDA_VISIBLE_DEVICES=0

pytho predict.py \
    --data_path ../data/rdrp/demo/demo.csv \
    --emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
    --dataset_ame rdrp_40_exted \
    --dataset_type protei \
    --task_type biary_class \
    --model_type sef \
    --time_str 20230201140320 \
    --step 100000 \
    --evaluate \
    --threshold 0.5 \
    --batch_size 16 \
    --prit_per_batch 2

--data_path
path, the file path of predictio data, icludig 9 colums metioed above. The value of Colum Label ca be ull.
--emb_dir
path, the saved dirpath of all sample's structural embeddig iformatio prepared i advace.
--datasetame
str, the dataset ame for buildig of our traied model(rdrp40_exted).
--dataset_type
str, the dataset type for buildig of our traied model(protei).
--tasktype
str, the task ame for buildig of our traied model(biaryclass).
--model_type
str, the model ame for buildig of our traied model(sef).
--time_str
str, the ruig time strig(yyyymmddHimiss) for buildig of our traied model(20230201140320).
--step
it, the traiig global step of model fializatio(100000).
--threshold
float, sigmoid threshold for biary-class or multi-label classificatio, Noe for multi-class classificatio, default: 0.5.
--evaluate(optioal)
store_true, whether to evaluate the predicted results.
--groudtruthcolidex(optioal)
it, the groud truth col idex of the ${datapath}, default: Noe.
--batch size
it, batch size per GPU/CPU for evaluatio, default: 16.
--pritperbatch
it, how may batches are completed every time for pritig progress iformatio, default: 1000.

Note: the embeddig matrices of all the proteis i this file eed to prepare i advace($emb_dir).

4. Iferece Time

LucaProt is suitably speedy because it oly eeds to predict the structural represetatio matrix rather tha the complete 3D structure of the protei sequece.

Bechmark: For each sequece legth rage, selected 50 viral-RdRPS ad 50 o-viral RdRPs for iferece time cost calculatio.

Note: The sped time icludes the time of the structural represetatio matrix iferece, excludes the time of model loadig.

1) GPU(Nvidia A100, Cuda: 11.7)

Protei Seq Le Rage	Average Time	Maximum Time	Miimum Time
300 <= Le < 500	0.20s	0.24s	0.16s
500 <= Le < 800	0.30s	0.39s	0.24s
800 <= Le < 1,000	0.42s	0.46s	0.39s
1,000 <= Le < 1,500	0.59s	0.74s	0.45s
1,500 <= Le < 2,000	0.87s	1.02s	0.73s
2,000 <= Le < 3,000	1.31s	1.69s	1.01s
3,000 <= Le < 5,000	2.14s	2.78s	1.72s
5,000 <= Le < 8,000	3.03s	3.45s	2.65s
8,000 <= Le < 10,000	3.77s	4.24s	3.32s
10,000 <= Le	9.92s	17.66s	4.30s

2) CPU (16 cores, 64G memory of Alibaba Cloud ECS)

Protei Seq Le Rage	Average Time	Maximum Time	Miimum Time
300 <= Le < 500	3.97s	5.71s	2.77s
500 <= Le < 800	5.78s	7.50s	4.48s
800 <= Le < 1,000	8.23s	9.41s	7.41s
1,000 <= Le < 1,500	11.49s	16.42s	9.22s
1,500 <= Le < 2,000	17.71s	22.36s	14.93s
2,000 <= Le < 3,000	26.97s	36.68s	20.99s
3,000 <= Le < 5,000	45.56s	58.42s	35.82s
5,000 <= Le < 8,000	56.57s	58.17s	55.55s
8,000 <= Le < 10,000	57.76s	58.86s	56.66s
10,000 <= Le	66.49s	76.80s	58.42s

3) CPU (96 cores, 768G memory of Alibaba Cloud ECS)

Protei Seq Le Rage	Average Time	Maximum Time	Miimum Time
300 <= Le < 500	1.89s	2.55s	1.10s
500 <= Le < 800	2.68s	3.44s	2.13s
800 <= Le < 1,000	3.45s	4.25s	2.65s
1,000 <= Le < 1,500	4.27s	5.90s	3.54s
1,500 <= Le < 2,000	5.81s	7.44s	4.76s
2,000 <= Le < 3,000	8.14s	10.74s	6.37s
3,000 <= Le < 5,000	13.25s	17.69s	10.06s
5,000 <= Le < 8,000	17.03s	18.20s	15.98s
8,000 <= Le < 10,000	17.90s	18.99s	16.92s
10,000 <= Le	25.90s	35.02s	18.66s

5. Dataset for Virus RdRP

1) Fasta

viral RdRP(Postive: 5,979)

The positive sequece fasta file is i data/rdrp/all_dataset_positive.fasta.zip
alldatasetpositive.fasta.zip
No-viral RdRP(Negative: 229434)

The egative sequece fasta file is i dataset/rdrp/all_dataset_egative.fasta.zip
icludig:
- other proteis of the virus
- other protei domais of the virus
- o-viral proteis
alldatasetegative.fasta.zip

2) Structural embeddig(matrix ad vector)

All structural embeddig files of the dataset for model buildig are available at: embs
All structural embeddig files of the predictio data for opeig are i the process(because of the amout of data).

3) PDB (3D Structure)

All 3D-structure PDB files of the model buildig dataset ad predicted data for opeig are i the process (because of the amout of data).

4) Vocab

structure vocab
This vocab file is struct_vocab/rdrp_40_exted/protei/biary_class/struct_vocab.txt
struct_vocab.txt
subword-level vocab
The size of the vocab of sequece we use is 20,000.
This vocab file is vocab/rdrp_40_exted/protei/biary_class/subword_vocab_20000.txt
subwordvocab20000.txt
char-level vocab
This vocab file is vocab/rdrp_40_exted/protei/biary_class/vocab.txt
vocab.txt

5) Label

Viral RdRP idetificatio is a biary-class classificatio task, icludig positive ad egative classes, usig 0 ad 1 to represet a egative ad positive sample, respectively. The label list file is dataset/rdrp_40_exted/protei/biary_class/label.txt
label.txt

6) Dataset

We costructed a data set with 235,413 samples for model buildig, which icluded 5,979 positive samples of kow viral RdRPs (i.e. the well-curated RdRP database described i the previous sectio of Methods), ad 229,434 (to maitai a 1:40 ratio for viral RdRP ad o-virus RdRPs) egative samples of cofirmed o-virus RdRPs. Ad the o-virus RdRPs cotaied proteis from Eukaryota DNA depedet RNA polymerase (Eu DdRP, N=1,184), Eukaryota RNA depedet RNA polymerase (Eu RdRP, N=2,233), Reverse Trascriptase (RT, N=48,490), proteis obtaied from DNA viruses (N=1,533), o-RdRP proteis obtaied from RNA viruses (N=1,574), ad a wide array of cellular proteis from differet fuctioal categories (N=174,420). We radomly divided the dataset ito traiig, validatio, ad testig sets with a ratio of 8.5:1:1, which were used for model fittig, model fializatio (based o the best F1-score traiig iteratio), ad performace reportig (icludig accuracy, precisio, recall, F1-score, ad Area uder the ROC Curve (AUC)), respectively.

Etire Dataset
This file is dataset/rdrp/all_dataset_with_pdb_emb.csv.zip
alldatasetwithpdbemb.csv.zip
Traiig set
This file copy to dataset/rdrp_40_exted/protei/biary_class/trai_with_pdb_emb.csv
traiwithpdb_emb.csv
Validatio set
This file copy to dataset/rdrp_40_exted/protei/biary_class/dev_with_pdb_emb.csv
devwithpdb_emb.csv
Testig set
This file copy to dataset/rdrp_40_exted/protei/biary_class/test_with_pdb_emb.csv
testwithpdb_emb.csv

Oe row i all the above files represets oe sample. All three files cosist of 9 colums, icludig protid, seq, seqle, pdbfileame, ptm, meaplddt, emb_fileame, label, ad source. The details of these colums are as follows:

prot_id
the protei id
seq
the amio acid(aa) sequece
seq_le
the legth of the protei sequece.
pdb_fileame
The PDB fileames of 3D-structure are predicted by the calculatio model or obtaied by experimets.
ptm
the pTM of the predicted 3D-structure.
mea_plddt
the mea pLDDT of the predicted 3D-structure.
emb_fileame
The fileame of the embeddig matrix or vector of protei structure.
Note: the embeddig matrics of the dataset eed to prepare i advace.
label
the sample label, 0 or 1 for biary-class classificatio, [0, 1, …, N-1] for multi-class classificatio, a list of [0, 1, …, N-1] for multi-label classificatio.
source
optioal, the sample source (such as RdRP, RT, DdRP, o-virus RdRP, ad Other).

Note: if usig strategy oe i structure ecoder, the pdbfileame, the ptm, ad the meaplddt ca be ull.

6. Supported Task Types

biary-class classificatio
The label is 0 or 1 for biary-class classificatio, such as viral RdRP idetificatio.
multi-class classificatio
The label is 0~N-1 for multi-class classificatio, such as the species predictio for proteis.
multi-label classificatio
The labels form a list of 0~N-1 for multi-label classificatio, such as Gee Otology aotatio for proteis.

7. Buildig Your Model

1) Predictio of protei 3D-structure(Optioal)

The script structure_from_esm_v1.py is i the directory "src/proteistructure", ad it use ESMFold (esmfoldv1) to predict 3D-Structure of protei.

I. Predictio from file

cd LucaProt/src/protei_structure/     

export CUDA_VISIBLE_DEVICES=0

pytho structure_from_esm_v1.py \
    -i data/rdrp/rdrp.fasta \
    -o pdbs/rdrp/ \
    --um-recycles 4 \
    --trucatio_seq_legth 4096 \
    --chuk-size 64 \
    --cpu-offload \
    --batch_size 1

Parameters:

-i (iput filepaths)
- fasta filepath
- csv filepath
  the first row is the header
  colum 0: protei_id
  colum 1: sequece
- mutil filepaths
  comma-cocateatio
-o (save dirpath)
The dir path of savig the predicted 3D-structure data, each protei is stored i a PDB file, ad each PDB file is amed as "protei" + a auto-icremet id + ".pdb", such as "protei1.pdb".
The mappig betwee protei ids ad auto-icremet ids is stored i the file "resultifo.csv" (icludig: "idex", "proteiid(uuid)", "seqle", "ptm", "meaplddt") i this dir path.
For failed samples(CUDA out of memory), this script will save their protei ids i the "ucompleted.txt", ad you ca reduce the value of "trucatioseqlegth" ad add "--try_failure" for retry.
--batch_size
the batch size of ruig, default: 1.
--trucatioseqlegth
trucate sequeces loger tha the give value, recommeded values: 4096, 2048, 1984, 1792, 1536, 1280, 1152, 1022.
--um-recycles
umber of recycles to ru.
--chuk-size
chuks axial attetio computatio to reduce memory usage from O(L^2) to O(L), recommeded values: 128, 64, 32.
--tryfailure
retry the failed samples whe reducig the "trucatioseq_legth" value.

II. Predictio from iput sequeces

cd LucaProt/src/protei_structure/    

export CUDA_VISIBLE_DEVICES=0

pytho structure_from_esm_v1.py \
    -ame protei_id1,protei_id2  \
    -seq VGGLFDYYSVPIMT,LPDSWENKLLTDLILFAGSFVGSDTCGKLF \
    -o pdbs/rdrp/  \
    --um-recycles 4 \
    --trucatio_seq_legth 4096 \
    --chuk-size 64 \
    --cpu-offload \
    --batch_size 1

Parameters:

-ame
protei ids, comma-cocateatio for multi proteis.
-seq
protei sequeces, comma-cocateatio for multi proteis.

2) Predictio of protei structural embeddig

The script embeddig_from_esmfold.py is i "src/proteistructure", ad it use ESMFold (esm2t363BUR50D) to predict protei structural embeddig matrices or vectors.

I. Predictio from file

cd LucaProt/src/protei_structure/    

export CUDA_VISIBLE_DEVICES=0  

pytho embeddig_from_esmfold.py \
    --model_ame esm2_t36_3B_UR50D \
    --file data/rdrp.fasta \
    --output_dir emb/rdrp/ \
    --iclude per_tok cotacts bos \
    --trucatio_seq_legth 4094

Parameters:

--modelame
the model ame, default: "esm2t363BUR50D"
-i/--file (iput filepath)
- fasta filepath
- csv filepath
  the first row is the header
  colum 0: protei_id
  colum 1: sequece
-o/--outputdir (save dirpath)
The dir path of savig the predicted structural embeddig data, each protei is stored i a pickle file, ad each embeddig file is amed as "embeddig" + auto-icremet id + ".pt", such as "embeddig1.pt". The mappig betwee protei ids ad auto-icremet ids is stored i the file "{}embedfastaid2idx.csv"(icludig: "idex", "proteiid(uuid)") i this dir path. For failed samples(CUDA out of memory), this script will save their protei ids i the "{}embeducompleted.txt", ad you ca reduce the "trucatioseqlegth" value ad add "--tryfailure" for retry.
--trucatioseqlegth
trucate sequeces loger tha the give value. Recommeded values: 4094, 2046, 1982, 1790, 1534, 1278, 1150, 1022.
--iclude
The embeddig matrix or vector type of the predicted structural embeddig data, icludig per_tok, mea, cotacts, ad bos.
- pertok icludes the full sequece, with a embeddig per amio acid (seqle x hidde_dim).
- mea icludes the embeddigs averaged over the full sequece, per layer.
- bos icludes the embeddigs from the begiig-of-sequece toke.
- cotacts icludes the attetio value betwee two amio acids of the the full sequece.
Referece：https://github.com/facebookresearch/esm [Compute embeddigs i bulk from FASTA]

II. Predictio from iput sequeces

cd LucaProt/src/protei_structure/     

export CUDA_VISIBLE_DEVICES=0  

pytho embeddig_from_esmfold.py \
    --model_ame esm2_t36_3B_UR50D \
    -ame protei_id1,protei_id2 \
    -seq VGGLFDYYSVPIMT,LPDSWENKLLTDLILFAGSFVGSDTCGKLF \
    --output_dir embs/rdrp/test/ \
    --iclude per_tok cotacts bos \
    --trucatio_seq_legth 4094

Parameters:

-ame
protei ids, comma-cocateatio for multi proteis.
-seq
protei sequeces, comma-cocateatio for multi proteis.

3) Costruct dataset for model buildig

Costruct your dataset ad radomly divide the dataset ito traiig, validatio, ad testig sets with a specified ratio, ad save the three sets i dataset/${dataset_ame}/${dataset_type}/${task_type}, icludig trai.csv, dev.csv, test_*.csv.

The file format ca be .csv (must iclude the header ) or .txt (does ot eed to have the header).

Each file lie is a sample cotaiig 9 colums, icludig protid, seq, seqle, pdbfileame, ptm, meaplddt, emb_fileame, label, ad source.

Colum seq is the sequece, Colum pdbfileame is the saved PDB fileame for structure ecoder strategy 2, Colum ptm ad Colum meaplddt are optioal, which are obtaied from the 3D-Structure computed model, Colum emb_fileame is the saved embeddig fileame for structure ecoder strategy 1, Colum label is the sample class(a sigle value or a list value of label idex or label ame). Colum source is the sample source (optioal).

For example:

like_YP_009351861.1_Meghai_flavivirus,MEQNG...,3416,,,,embeddig_21449.pt,1,rdrp

Note: if your dataset takes too much space to load ito memory at oce,
use "src/dataprocess/datapreprocessitotfrecordsforrdrp.py" to covert the dataset ito "tfrecords". Ad create a idex file: pytho -m tfrecord.tools.tfrecord2idx xxxx.tfrecords xxxx.idex

4) Traiig the model

ru.py
the mai script for buildig model.
Parameters
- data_dir: path, the dataset dirpath
- fileamepatter: the dataset fileame patter, such as "{}withpdbemb.csv", icludig traiwithpdbemb.csv, devwithpdbemb.csv, ad testwithpdbemb.csv i ${datadir}
- separatefile: storetrue, load the etire dataset ito memory, the ames of the pdb ad embeddig files are listed i the trai/dev/test.csv, ad eed to load them.
- tfrecords: storetrue, whether the dataset is i the tfrecords, whe true, oly the specified umber of samples(${shufflequeuesize}) are loaded ito memory at oce. The tfrecords must cosist of "${datadir}/tfrecords/trai/xxx.tfrecords", "${datadir}/tfrecords/dev/xxx.tfrecords" ad "${datadir}/tfrecords/test/xxx.tfrecords". "xxx.tfrecords" is oe of 01-of-01.tfrecords(oly icludig sequece), 01-of-01emb.records (icludig sequece ad structural embeddig), ad 01-of-01pdb_emb.records (icludig sequece, 3D-structure cotact map, ad structural embeddig).
- shufflequeuesize: it, how may samples are loaded ito memory at oce, default: 5000.
- datasetame: str, your dataset ame, such as "rdrp40_exted"
- dataset_type: str, your dataset type, such as "protei"
- tasktype: choices=["multilabel", "multiclass", "biaryclass"], your task type, such as "biary_class"
- model_type: choices=["sequece", "structure", "embeddig", "sef", "ssf"], they represet oly the sequece for iput, oly the 3D-structure cotact map for iput, oly the structural embeddig for iput, the sequece ad the structural embeddig for iput, ad the sequece ad the 3D-structure cotact map for iput, respectively
- subword: store_true, whether to process for sequece at the subword level.
- codesfile: path, subword codes filepath whe usig subword, such as "../subword/rdrp/proteicodesrdrp20000.txt"
- label_type: str, the label type ame, such as "rdrp"
- label_filepath: path, the label list filepath
- cmaptype: choices=["Calpha", "C_bert"], the calculatio type of 3D-structure cotact map
- cmap_thresh: the distace threshold (Uit: Agstrom) i cotact map calculatio. Two amio acids are liked if the distace betwee them is equal to ad less tha the threshold, default: 10.0.
- output_dir: path, the output dirpath
- log_dir: path, the logger savepath
- tblogdir: path, the save path of metric evaluatio records i model traiig, the tesorboardX ca be used to show these metrics.
- cofig_path: path, the cofiguratio filepath of the model.
- seqvocabpath: path, the vocab filepath of sequece tokeizer
- structvocabpath: path, the vocab filepath of 3D-structure ode (Structural Ecoder Strategy 2)
- seqpooligtype: choices=["oe", "max", "value_attetio"], the sequece represetaio matrix poolig type, "oe" represets that \ vector is used.
- structpooligtype: choices=["max", "value_attetio"], the 3D-structure represetaio matrix poolig type.
- embeddigpooligtype: choices=["oe", "max", "value_attetio"], the structual embeddig represetaio matrix poolig type, "oe" represets that \ vector is used.
- evaluatedurigtraiig: store_true, whether to evaluate the validatio set ad the testig set durig traiig.
- doeval: storetrue, whether to use the best saved model to evaluate the validatio set.
- dopredict: storetrue, whether to use the best saved model to evaluate the testig set.
- dolowercase: store_true, whether to lowercase the iput whe tokeizig.
- pergputraibatchsize: it, batch size per GPU/CPU for traiig, default: 16
- pergpuevalbatchsize: it, batch size per GPU/CPU for evaluatio, default: 16
- gradietaccumulatiosteps: it, umber of updates steps to accumulate before performig a backward/update pass, default: 1.
- learig_rate: float, the iitial learig rate for Adam, defaul: 1e-4.
- umtraiepochs: it, the total umber of traiig epochs to perform, default: 50,.
- loggig_steps: log every X updates steps, default: 1000.
- losstype: choices=["focalloss", "bce", "multilabel_cce", "asl", "cce"], loss-fuctio type of model traiig, default: "bce".
- maxmetrictype: choices=["acc", "jaccard", "prec", "recall", "f1", "fmax", "rocauc", "prauc"], which metric is used for model fializatio, default: "f1".
- pos_weight: float, positive samples weight for "bce".
- focallossalpha: float, alpha for focal loss, default: 0.7.
- focallossgamma: float, gamma for focal loss, default:2.0.
- focallossreduce: store_true, "mea" for oe sample whe i multi-label classifcatio, default:"sum".
- aslgammaeg: float, egative gamma for asl, default: 4.0.
- aslgammapos: float, positive gamma for asl, default: 1.0.
- seqmaxlegth: it, the legth of iput sequece more tha max legth will be trucated, shorter will be padded, default: 2048.
- structmaxlegth: it, the legth of iput cotact map more tha max legth will be trucated, shorter will be padded., default: 2048.
- truc_type: choices=["left", "right"], the trucate type for whole iput sequece, default: "right".
- opositioembeddigs: store_true, whether ot use positio embeddig for the sequece.
- otoketypeembeddigs: storetrue, whether ot use toke type embeddig for the sequece.
- embeddigiputsize: it, the dim of the structural embeddig vector/matrix, default: 2560， {"esm2t30150MUR50D": 640, "esm2t33650MUR50D": 1280, "esm2t363BUR50D": 2560, "esm2t4815BUR50D": 5120}.
- embeddig_type: choices=[Noe, "cotacts", "bos", "matrix"], the type of the structural embeddig ifo, default: "matrix.
- embeddigmaxlegth: it, the legth of iput embeddig matrix more tha max legth will be trucated, shorter will be padded, default: 2048.
- saveall: storetrue, the model for each evaluatio is saved.
- deleteold: storetrue, oly save the best metric (${maxmetrictype}) model of all evaluatio o testig set durig traiig.
Traiig ```shell

!/bi/bash

export CUDAVISIBLEDEVICES=0

DATASETNAME="rdrp40exted" DATASETTYPE="protei" TASKTYPE="biaryclass"

sequece + structural embedddig

MODELTYPE="sef" CONFIGNAME="sefcofig.jso" INPUTMODE="sigle" LABELTYPE="rdrp" embeddigiputsize=2560 embeddigtype="matrix" SEQMAXLENGTH="2048" embeddigmaxlegth="2048" TRUNCT_TYPE="right"

oe, max, value_attetio

SEQPOOLINGTYPE="value_attetio"

max, value_attetio

embeddigpooligtype="valueattetio" VOCABNAME="subwordvocab20000.txt" SUBWORDCODESNAME="proteicodesrdrp20000.txt" MAXMETRICTYPE="f1" timestr=$(date "+%Y%m%d%H%M%S")

pytho ru.py \ --datadir ../dataset/$DATASETNAME/$DATASETTYPE/$TASKTYPE \ --tfrecords \ --fileamepatter {}withpdbemb.csv \ --datasetame $DATASETNAME \ --datasettype $DATASETTYPE \ --tasktype $TASKTYPE \ --modeltype $MODELTYPE \ --subword \ --codesfile ../subword/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$SUBWORDCODESNAME\ --iputmode $INPUTMODE \ --labeltype $LABELTYPE \ --labelfilepath ../dataset/$DATASETNAME/$DATASETTYPE/$TASKTYPE/label.txt \ --outputdir ../models/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \ --logdir ../logs/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \ --tblogdir ../tb-logs/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \ --cofigpath ../cofig/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$CONFIGNAME \ --seqvocabpath ../vocab/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$VOCABNAME\ --seqpooligtype $SEQPOOLINGTYPE \ --embeddigpooligtype $embeddigpooligtype \ --dotrai \ --doeval \ --dopredict \ --evaluatedurigtraiig \ --pergputraibatchsize=16 \ --pergpuevalbatchsize=16 \ --gradietaccumulatiosteps=1 \ --learigrate=1e-4 \ --umtraiepochs=50 \ --loggigsteps=1000 \ --savesteps=1000 \ --overwriteoutputdir \ --sigmoid \ --losstype bce \ --maxmetrictype $MAXMETRICTYPE \ --seqmaxlegth=$SEQMAXLENGTH \ --embeddigmaxlegth=$embeddigmaxlegth \ --tructype=$TRUNCTTYPE \ --otoketypeembeddigs \ --embeddigiputsize $embeddigiputsize\ --embeddigtype $embeddigtype \ --shufflequeuesize 10000 \ --save_all ```
Cofiguratio file
The cofiguratio files of all methods is i "cofig/rdrp40exted/protei/biaryclass/". If traiig your model, please put the cofiguratio file i "cofig/${datasetame}/${datasettype}/${tasktype}/"
Value meaig i cofiguratio file
referrig to "src/SSFN/README.md"
Baselies
- LGBM (usig the embeddig vector: \ as the iput)
```
  cd src/baselies/
  sh ru_lgbm.sh
```
- XGBoost (usig the embeddig vector: \ as the iput)
```
  cd src/baselies/
  sh ru_xgb.sh
```
- DNN (usig the embeddig vector: \ as the iput)
```
  cd src/baselies/
  sh ru_d.sh
```
Or：
```
  cd src/traiig
  ru_subword_rdrp_emb.sh
```
- Trasoformer-Char Level (usig the sequece as the iput)
```
  cd src/traiig
  sh ru_char_rdrp_seq.sh
```
- Trasoformer-Subword Level (usig the sequece as the iput)
```
  cd src/traiig
  sh ru_subword_rdrp_seq.sh
```
- DNN2 (VALP + DNN, usig the embeddig matrix as the iput)
```
  cd src/traiig
  ru_subword_rdrp_emb_v2.sh
```
Ours
- Ours (the sequece + the 3D-structure)
  comig soo…
- Ours (the sequece + the embeddig matrix)
```
  cd src/traiig
  ru_subword_rdrp_sef.sh
```

5) Traiig Loggig Iformatio

logs

The ruig iformatio is saved i "logs/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${time_str}/logs.txt".

The iformatio icludes the model cofiguratio, model layers, ruig parameters, ad evaluatio iformatio.

models

The checkpoits are saved i "models/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${timestr}/checkpoit-${globalstep}/", this directory icludes "pytorchmodel.bi", "cofig.jso", "traiigargs.bi", ad tokeizer iformatio "sequece" or "strcut". The details are show i Figure 2.

Figure 2: The File List i Checkpoit Dir Path

tb-logs

The metrics are recorded i "tb-logs/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${time_str}/evets.out.tfevets.xxxxx.xxxxx"

ru: tesorboard --logdir=tb-logs/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${timestr --bidall

predicts

The predicted results is saved i "predicts/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${timestr}/checkpoit-${globalstep}", icludig:

predcofusiomatrix.pg
pred_metrics.txt
pred_result.csv
seqlegthdistributio.pg

The details are show i Figure 3.

Figure 3: The File List i Predictio Dir Path

Note: whe usig the saved model to predict, the "logs.txt" ad the checkpoit dirpath will be used.

8. Related to the Project

1) ClstrSearch

A covetioal approach that clustered all proteis based o their sequece homology.

See ClstrSerch/README.md for details.

2) src

Costruct RdRP Dataset for Model Buildig

*.py i "src/data_preprocess"

Model

*.py i "src/SSFN"

Predictio Shell Script

*.sh i "src/predictio"
icludig:

rupredictfrom_file.sh
ru predictio for may samples from a file, the structural embeddig iformatio prepared i advace.
rupredictoe_sample.sh
ru predictio for oe sample from the iput.
rupredictmay_samples.sh
ru predictio for may samples from the iput.

We perform ablatio studies o our model by removig specific module(sequece-specific ad embeddig-specific) oe at a time to explore their relative importace.

rupredictolyseqfrom_file.sh
oly usig the sequece to predict ad calculate metrics three positive testig datasets, three egative testig datasets, ad our checked RdRPs by predictio SRA.
rupredictolyembfrom_file.sh
oly usig the structural embeddig to predict ad calculate metrics three positive testig datasets, three egative testig datasets, ad our checked RdRPs by predictio SRA.
rupredictseqembfrom_file.sh
usig the sequetail ifo ad the structural embeddig to predict ad calculate metrics three positive testig datasets, three egative testig datasets, ad our checked RdRPs by predictio SRA.

Baselies

*.py i "src/baselies", usig the embeddig vector as the iput, icludig:

DNN
LGBM
XGBoost

Baselies for Deep Learig

*.py i "src/deep_baselies", icludig:

CHEER: HierarCHical taxoomic classificatio for viral mEtagEomic data via deep learig(2021). code: CHEER

VirHuter: A Deep Learig-Based Method for Detectio of Novel RNA Viruses i Plat Sequecig Data(2022). code: VirHuter

Virtifier: a deep learig-based idetifier for viral sequeces from metageomes(2022). code: Virtifier

RNN-VirSeeker: RNN-VirSeeker: A Deep Learig Method for Idetificatio of Short Viral Sequeces From Metageomes. code: RNN-VirSeeker

rudeepbaselies.sh
the script to trai deep baselie models.
rupredictdeep_baselies.sh
use traied deep baselie models to predict three positive test datasets, three egative test datasets, ad our checked RdRP datasets.
ru.py
the mai script for traiig deep baselie models.
statistics
the script to statistic the accuracy i three kids of test datasets(positive, egative, our checked) after predictio by deep baselies.

Cotact Map Geerator

*.py i "src/biotoolbox"

Loss & Metrics

*.py i "src/commo"

Traiig Model

*.sh i "src/traiig"

Predictio of Model

*.sh i "src/predictio"

3) Data

Raw Data

the raw data is i "data/".

Dataset

the files of the dataset is i "dataset/${datasetame}/${datasettype}/${task_type}/".

4) Model Cofiguratio

the cofiguratio file of all methods is i "cofig/${datasetame}/${datasettype}/${task_type}/".

5) Pic

some pictures is i "pics/".

6) Plot

the scripts of pictures plotig is i "src/plot".

7) Spider

the codes ad results of Geo iformatio Spider i "src/geo_map".

9. Ope Resource

The ope resources of our study ar icludes six subdirectories: Kow_RdRPs, Results, All_Cotigs, All_Protei_Sequeces, ad LucaProt, ad Self_Sequecig_Reads.

LucaProt/ icludes some resources related to LucaProt, icludig code, model buildig dataset, model testig datasets, ad our traied model.

1) Code

As metioed above.

2) Dataset

Model Buildig Dataset

sequetial ifo
traiwithpdb_emb.csv
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/

devwithpdb_emb.csv
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/

testwithpdb_emb.csv
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/
structural ifo
embs
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/embs/
tfrcords
trai
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/tfrecords/trai/

dev
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/tfrecords/dev/

test
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/tfrecords/test/

Model Testig (Validatio) Dataset

Three Positive Testig Dataset
- sequetial ifo
  Neri RdRP
  copy to LucaProt/data/rdrp
  Referece: Expasio of the global RNA virome reveals diverse clades of bacteriophages
  
  Zayed RdRP
  copy to LucaProt/data/rdrp
  Referece: Cryptic ad abudat marie viruses at the evolutioary origis of Earth’s RNA virome
  
  Che RdRP
  copy to LucaProt/data/rdrp
  Referece: RNA viromes from terrestrial sites across Chia expad evirometal viral diversity
- structural ifo
  Neri RdRP
  
  Zayed RdRP
  
  Che RdRP
Three Negative Testig Dataset
- sequetial ifo
  RT
  copy to LucaProt/data/rdrp
  
  Eu DdRP
  copy to LucaProt/data/rdrp
  
  Eu RdRP
  copy to LucaProt/data/rdrp
- structural ifo
  RT
  
  Eu DdRP
  
  Eu RdRP

Results

Our Checked RdRP Dataset (Our Results)
- sequetial ifo
  ourscheckedrdrp_fial.csv
- structural ifo
  embs
- PDB
  All 3D-structure PDB files of our predicted results for opeig are i the process.

Self-Samples

Our Sampled Dataset
- fasta
  00selfsequecig300aa.pep

3) Traied Model

The traied model for RdRP idetificatio is available at:

logs
logs
copy tp LucaProt/logs/
models
models
copy tp LucaProt/models/

10. Cotributor

LucaTeam:
Yog He, Zhaorog Li, Xi Hou, Mag Shi

11. FTP

The all data of LucaProt is available at the website: Ope Resources

12. Citatio

the pre-prit versio:

@article { lucaprot,
author = {Xi Hou ad Yog He ad Pa Fag ad Shi-Qiag Mei ad Za Xu ad Wei-Che Wu ad Ju-Hua Tia ad Shu Zhag ad Zhe-Yu Zeg ad Qi-Yu Gou ad Ge-Yag Xi ad Shi-Jia Le ad Yi-Yue Xia ad Yu-La Zhou ad Feg-Mig Hui ad Yua-Fei Pa ad Joh-Sebastia Ede ad Zhao-Hui Yag ad Chog Ha ad Yue-Log Shu ad Deyi Guo ad Ju Li ad Edward C Holmes ad Zhao-Rog Li ad Mag Shi},
title = {Artificial itelligece redefies RNA virus discovery},
elocatio-id = {2023.04.18.537342},
year = {2023},
doi = {10.1101/2023.04.18.537342},
publisher = {Cold Sprig Harbor Laboratory}, URL = {https://www.biorxiv.org/cotet/early/2023/04/18/2023.04.18.537342},
eprit = {https://www.biorxiv.org/cotet/early/2023/04/18/2023.04.18.537342.full.pdf},
joural = {bioRxiv}
}

13. Pip

ame: lucaprot
chaels:
  - defaults
depedecies:
  - pip:
        - h5py==3.8.0
        - biopytho==1.80
        - biotite==0.35.0
        - brotlipy==0.7.0
        - umpy==1.24.2
        - oboet==0.3.1
        - padas==1.5.3
        - pickle5==0.0.11
        - Pillow==9.3.0
        - scikit-lear==1.2.1
        - scipy==1.10.1
        - seabor==0.12.2
        - six==1.16.0
        - subword-mt==0.3.8
        - tesorboard==2.11.2
        - tesorboardX==2.5.1
        - tesorflow==2.11.0
        - tesorflow-estimator==2.11.0
        - tfrecord==1.14.1
        - tokeizers==0.13.2
        - torch==1.13.1
        - torchaudio==0.13.1
        - torchvisio==0.14.1
        - tqdm==4.64.1
        - trasformers==4.26.0
        - huggigface-hub==0.12.0
        - matplotlib==3.6.3
        - Werkzeug==2.2.2
        - wget==3.2
        - wrapt==1.14.1
        - xgboost==1.7.3
        - zipp==3.12.0
        - lightgbm==3.3.5
        - xgboost==1.7.3
        - BeautifulSoup4==4.11.1
        - requests==2.24.0
        - gemmi==0.5.8
        - etworkx==3.0
        - fair-esm[esmfold]
        - dllogger @ git+https://github.com/NVIDIA/dllogger.git

路卡珀特

技术信息

作品详情

LucaProt

Itroductio

1. Model

1) Model Itroductio

2) Model Architecture

3) Model Iput/Output

2. Depedece

3. Iferece

1) Predictio from oe sample

2) Predictio from may samples

3) Predictio from the file

4. Iferece Time

1) GPU(Nvidia A100, Cuda: 11.7)

2) CPU (16 cores, 64G memory of Alibaba Cloud ECS)

3) CPU (96 cores, 768G memory of Alibaba Cloud ECS)

5. Dataset for Virus RdRP

1) Fasta

2) Structural embeddig(matrix ad vector)

3) PDB (3D Structure)

4) Vocab

5) Label

6) Dataset

6. Supported Task Types

7. Buildig Your Model

1) Predictio of protei 3D-structure(Optioal)

I. Predictio from file

II. Predictio from iput sequeces

2) Predictio of protei structural embeddig

I. Predictio from file

II. Predictio from iput sequeces

3) Costruct dataset for model buildig

4) Traiig the model

!/bi/bash

sequece + structural embedddig

oe, max, value_attetio

max, value_attetio

5) Traiig Loggig Iformatio

logs

models

tb-logs

predicts

8. Related to the Project

1) ClstrSearch

2) src

Costruct RdRP Dataset for Model Buildig

Model

Predictio Shell Script

Baselies

Baselies for Deep Learig

Cotact Map Geerator

Loss & Metrics

Traiig Model

Predictio of Model

3) Data

Raw Data

Dataset

4) Model Cofiguratio

5) Pic

6) Plot

7) Spider

9. Ope Resource

1) Code

2) Dataset

Model Buildig Dataset

Model Testig (Validatio) Dataset

Results

Self-Samples

3) Traied Model

10. Cotributor

11. FTP

12. Citatio

13. Pip

功能介绍

重点城市程序员兼职推荐

重点岗位程序员兼职推荐