LucaProt
LucaProt(DeepProtFuc) is a ope source project developed by Alibaba ad licesed uder the Apache Licese (Versio 2.0).
This product cotais various third-party compoets uder other ope source liceses.
See the NOTICE file for more iformatio.
Itroductio
LucaProt: A ovel deep learig framework that icorporates protei amio acid sequece ad structure iformatio to predict protei fuctio.
1. Model
1) Model Itroductio
We developed a ew deep learig model, amely, Deep Sequetial ad Structural Iformatio Fusio Network for Proteis Fuctio Predictio (DeepProtFuc/LucaProt), which takes ito accout protei sequece compositio ad structure to facilitate the accurate aotatio of protei fuctio.
Here, we applied LucaProt to idetify viral RdRP.
2) Model Architecture
We treat protei fuctio predictio as a classificatio problem. For example, viral RdRP idetificatio is a biary-class classificatio task, ad protei geeral fuctio aotatio is a multi-label classificatio task. The model icludes five modules: Iput, Tokeizer, Ecoder, Poolig, ad Output. Its architecture is show i Figure 1.
Figure 1 The Architecture of LucaProt
3) Model Iput/Output
Use the amio acid letter sequece as the iput of our model. The model outputs the fuctio label of the iput protei, which is a sigle tag (biary-class classificatio or multi-class classificatio) or a set of tags (multi-label classificatio).
2. Depedece
System: Ubutu 20.04.5 LTS
Pytho: 3.9.13
Dowload aacoda: aacoda
Cuda: cuda11.7 (torch==1.13.1)
# Select 'YES' durig istallatio for iitializig the coda eviromet
sh Aacoda3-2022.10-Liux-x86_64.sh
# Source the eviromet
source ~/.bashrc
# Verificatio
coda
# Istall ev ad pytho 3.9.13
coda create - lucaprot pytho=3.9.13
# activate ev
coda activate lucaprot
# Istall git
sudo apt-get update
sudo apt istall git-all
# Eter the project
cd LucaProt
# Istall
pip istall -r requiremets.txt -i https://pypi.tua.tsighua.edu.c/simple
3. Iferece
1) Predictio from oe sample
cd LucaProt/src/predictio/
sh ru_predict_oe_sample.sh
Note: the embeddig matrix of the sample is real-time predictive.
Or:
cd LucaProt/src/
export CUDA_VISIBLE_DEVICES=0
pytho predict_oe_sample.py \
--protei_id protei_1 \
--sequece MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ \
--emb_dir ./emb/ \
--trucatio_seq_legth 4096 \
--dataset_ame rdrp_40_exted \
--dataset_type protei \
--task_type biary_class \
--model_type sef \
--time_str 20230201140320 \
--step 100000 \
--threshold 0.5
--protei_id
str, the protei id.
--sequece
str, the protei sequece.
--trucatioseqlegth
it, trucate sequeces loger tha the give value. Recommeded values: 4096, 2048, 1984, 1792, 1534, 1280, 1152, 1024, defualt: 4096.
--emb_dir(optioal)
path, the saved dirpath of the protei predicted embeddig matrix or vector durig predictio, optioal.
--datasetame
str, the dataset ame for buildig of our traied model(rdrp40_exted).
--dataset_type
str, the dataset type for buildig of our traied model(protei).
--tasktype
str, the task ame for buildig of our traied model(biaryclass).
--model_type
str, the model ame for buildig of our traied model(sef).
--time_str
str, the ruig time strig(yyyymmddHimiss) for buildig of our traied model(20230201140320).
--step
it, the traiig global step of model fializatio(100000).
--threshold
float, sigmoid threshold for biary-class or multi-label classificatio, Noe for multi-class classificatio, default: 0.5.
2) Predictio from may samples
the samples are i *.fasta, sample by sample predictio.
--fasta_file
str, the samples fasta file
--save_file
str, file path, save the predicted results ito the file.
--pritperumber
it, prit progress iformatio for every umber of samples completed, default: 100.
cd LucaProt/src/predictio/
sh ru_predict_may_samples.sh
Or:
cd LucaProt/src/
export CUDA_VISIBLE_DEVICES=0
pytho predict_may_samples.py \
--fasta_file ../data/rdrp/test/test.fasta \
--save_file ../result/rdrp/test/test_result.csv \
--emb_dir ../emb/ \
--trucatio_seq_legth 4096 \
--dataset_ame rdrp_40_exted \
--dataset_type protei \
--task_type biary_class \
--model_type sef \
--time_str 20230201140320 \
--step 100000 \
--threshold 0.5 \
--prit_per_umber 10
3) Predictio from the file
The test data (small ad real) is i demo.csv, where the 7th colum of each lie is the fileame of the structural embeddig iformatio prepared i advace.
Ad the structural embeddig files store i embs.
The test data icludes 50 viral-RdRPs ad 50 o-viral RdRPs.
cd LucaProt/src/predictio/
sh ru_predict_from_file.sh
Or:
cd LucaProt/src/
export CUDA_VISIBLE_DEVICES=0
pytho predict.py \
--data_path ../data/rdrp/demo/demo.csv \
--emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
--dataset_ame rdrp_40_exted \
--dataset_type protei \
--task_type biary_class \
--model_type sef \
--time_str 20230201140320 \
--step 100000 \
--evaluate \
--threshold 0.5 \
--batch_size 16 \
--prit_per_batch 2
--data_path
path, the file path of predictio data, icludig 9 colums metioed above. The value of Colum Label ca be ull.
--emb_dir
path, the saved dirpath of all sample's structural embeddig iformatio prepared i advace.
--datasetame
str, the dataset ame for buildig of our traied model(rdrp40_exted).
--dataset_type
str, the dataset type for buildig of our traied model(protei).
--tasktype
str, the task ame for buildig of our traied model(biaryclass).
--model_type
str, the model ame for buildig of our traied model(sef).
--time_str
str, the ruig time strig(yyyymmddHimiss) for buildig of our traied model(20230201140320).
--step
it, the traiig global step of model fializatio(100000).
--threshold
float, sigmoid threshold for biary-class or multi-label classificatio, Noe for multi-class classificatio, default: 0.5.
--evaluate(optioal)
store_true, whether to evaluate the predicted results.
--groudtruthcolidex(optioal)
it, the groud truth col idex of the ${datapath}, default: Noe.
--batch size
it, batch size per GPU/CPU for evaluatio, default: 16.
--pritperbatch
it, how may batches are completed every time for pritig progress iformatio, default: 1000.
Note: the embeddig matrices of all the proteis i this file eed to prepare i advace($emb_dir).
4. Iferece Time
LucaProt is suitably speedy because it oly eeds to predict the structural represetatio matrix rather tha the complete 3D structure of the protei sequece.
Bechmark: For each sequece legth rage, selected 50 viral-RdRPS ad 50 o-viral RdRPs for iferece time cost calculatio.
Note: The sped time icludes the time of the structural represetatio matrix iferece, excludes the time of model loadig.
1) GPU(Nvidia A100, Cuda: 11.7)
Protei Seq Le Rage |
Average Time |
Maximum Time |
Miimum Time |
300 <= Le < 500 |
0.20s |
0.24s |
0.16s |
500 <= Le < 800 |
0.30s |
0.39s |
0.24s |
800 <= Le < 1,000 |
0.42s |
0.46s |
0.39s |
1,000 <= Le < 1,500 |
0.59s |
0.74s |
0.45s |
1,500 <= Le < 2,000 |
0.87s |
1.02s |
0.73s |
2,000 <= Le < 3,000 |
1.31s |
1.69s |
1.01s |
3,000 <= Le < 5,000 |
2.14s |
2.78s |
1.72s |
5,000 <= Le < 8,000 |
3.03s |
3.45s |
2.65s |
8,000 <= Le < 10,000 |
3.77s |
4.24s |
3.32s |
10,000 <= Le |
9.92s |
17.66s |
4.30s |
2) CPU (16 cores, 64G memory of Alibaba Cloud ECS)
Protei Seq Le Rage |
Average Time |
Maximum Time |
Miimum Time |
300 <= Le < 500 |
3.97s |
5.71s |
2.77s |
500 <= Le < 800 |
5.78s |
7.50s |
4.48s |
800 <= Le < 1,000 |
8.23s |
9.41s |
7.41s |
1,000 <= Le < 1,500 |
11.49s |
16.42s |
9.22s |
1,500 <= Le < 2,000 |
17.71s |
22.36s |
14.93s |
2,000 <= Le < 3,000 |
26.97s |
36.68s |
20.99s |
3,000 <= Le < 5,000 |
45.56s |
58.42s |
35.82s |
5,000 <= Le < 8,000 |
56.57s |
58.17s |
55.55s |
8,000 <= Le < 10,000 |
57.76s |
58.86s |
56.66s |
10,000 <= Le |
66.49s |
76.80s |
58.42s |
3) CPU (96 cores, 768G memory of Alibaba Cloud ECS)
Protei Seq Le Rage |
Average Time |
Maximum Time |
Miimum Time |
300 <= Le < 500 |
1.89s |
2.55s |
1.10s |
500 <= Le < 800 |
2.68s |
3.44s |
2.13s |
800 <= Le < 1,000 |
3.45s |
4.25s |
2.65s |
1,000 <= Le < 1,500 |
4.27s |
5.90s |
3.54s |
1,500 <= Le < 2,000 |
5.81s |
7.44s |
4.76s |
2,000 <= Le < 3,000 |
8.14s |
10.74s |
6.37s |
3,000 <= Le < 5,000 |
13.25s |
17.69s |
10.06s |
5,000 <= Le < 8,000 |
17.03s |
18.20s |
15.98s |
8,000 <= Le < 10,000 |
17.90s |
18.99s |
16.92s |
10,000 <= Le |
25.90s |
35.02s |
18.66s |
5. Dataset for Virus RdRP
1) Fasta
viral RdRP(Postive: 5,979)
The positive sequece fasta file is i data/rdrp/all_dataset_positive.fasta.zip
alldatasetpositive.fasta.zip
No-viral RdRP(Negative: 229434)
The egative sequece fasta file is i dataset/rdrp/all_dataset_egative.fasta.zip
icludig:
- other proteis of the virus
- other protei domais of the virus
- o-viral proteis
alldatasetegative.fasta.zip
2) Structural embeddig(matrix ad vector)
All structural embeddig files of the dataset for model buildig are available at: embs
All structural embeddig files of the predictio data for opeig are i the process(because of the amout of data).
3) PDB (3D Structure)
All 3D-structure PDB files of the model buildig dataset ad predicted data for opeig are i the process (because of the amout of data).
4) Vocab
structure vocab
This vocab file is struct_vocab/rdrp_40_exted/protei/biary_class/struct_vocab.txt
struct_vocab.txt
subword-level vocab
The size of the vocab of sequece we use is 20,000.
This vocab file is vocab/rdrp_40_exted/protei/biary_class/subword_vocab_20000.txt
subwordvocab20000.txt
char-level vocab
This vocab file is vocab/rdrp_40_exted/protei/biary_class/vocab.txt
vocab.txt
5) Label
Viral RdRP idetificatio is a biary-class classificatio task, icludig positive ad egative classes, usig 0 ad 1 to represet a egative ad positive sample, respectively.
The label list file is dataset/rdrp_40_exted/protei/biary_class/label.txt
label.txt
6) Dataset
We costructed a data set with 235,413 samples for model buildig, which icluded 5,979 positive samples of kow viral RdRPs (i.e. the well-curated RdRP database described i the previous sectio of Methods), ad 229,434 (to maitai a 1:40 ratio for viral RdRP ad o-virus RdRPs) egative samples of cofirmed o-virus RdRPs. Ad the o-virus RdRPs cotaied proteis from Eukaryota DNA depedet RNA polymerase (Eu DdRP, N=1,184), Eukaryota RNA depedet RNA polymerase (Eu RdRP, N=2,233), Reverse Trascriptase (RT, N=48,490), proteis obtaied from DNA viruses (N=1,533), o-RdRP proteis obtaied from RNA viruses (N=1,574), ad a wide array of cellular proteis from differet fuctioal categories (N=174,420). We radomly divided the dataset ito traiig, validatio, ad testig sets with a ratio of 8.5:1:1, which were used for model fittig, model fializatio (based o the best F1-score traiig iteratio), ad performace reportig (icludig accuracy, precisio, recall, F1-score, ad Area uder the ROC Curve (AUC)), respectively.
Etire Dataset
This file is dataset/rdrp/all_dataset_with_pdb_emb.csv.zip
alldatasetwithpdbemb.csv.zip
Traiig set
This file copy to dataset/rdrp_40_exted/protei/biary_class/trai_with_pdb_emb.csv
traiwithpdb_emb.csv
Validatio set
This file copy to dataset/rdrp_40_exted/protei/biary_class/dev_with_pdb_emb.csv
devwithpdb_emb.csv
Testig set
This file copy to dataset/rdrp_40_exted/protei/biary_class/test_with_pdb_emb.csv
testwithpdb_emb.csv
Oe row i all the above files represets oe sample. All three files cosist of 9 colums, icludig protid, seq, seqle, pdbfileame, ptm, meaplddt, emb_fileame, label, ad source. The details of these colums are as follows:
- prot_id
the protei id
- seq
the amio acid(aa) sequece
- seq_le
the legth of the protei sequece.
- pdb_fileame
The PDB fileames of 3D-structure are predicted by the calculatio model or obtaied by experimets.
- ptm
the pTM of the predicted 3D-structure.
- mea_plddt
the mea pLDDT of the predicted 3D-structure.
- emb_fileame
The fileame of the embeddig matrix or vector of protei structure.
Note: the embeddig matrics of the dataset eed to prepare i advace.
- label
the sample label, 0 or 1 for biary-class classificatio, [0, 1, …, N-1] for multi-class classificatio, a list of [0, 1, …, N-1] for multi-label classificatio.
- source
optioal, the sample source (such as RdRP, RT, DdRP, o-virus RdRP, ad Other).
Note: if usig strategy oe i structure ecoder, the pdbfileame, the ptm, ad the meaplddt ca be ull.
6. Supported Task Types
biary-class classificatio
The label is 0 or 1 for biary-class classificatio, such as viral RdRP idetificatio.
multi-class classificatio
The label is 0~N-1 for multi-class classificatio, such as the species predictio for proteis.
multi-label classificatio
The labels form a list of 0~N-1 for multi-label classificatio, such as Gee Otology aotatio for proteis.
7. Buildig Your Model
1) Predictio of protei 3D-structure(Optioal)
The script structure_from_esm_v1.py
is i the directory "src/proteistructure", ad it use ESMFold (esmfoldv1) to predict 3D-Structure of protei.
I. Predictio from file
cd LucaProt/src/protei_structure/
export CUDA_VISIBLE_DEVICES=0
pytho structure_from_esm_v1.py \
-i data/rdrp/rdrp.fasta \
-o pdbs/rdrp/ \
--um-recycles 4 \
--trucatio_seq_legth 4096 \
--chuk-size 64 \
--cpu-offload \
--batch_size 1
Parameters:
-i (iput filepaths)
- fasta filepath
- csv filepath
the first row is the header
colum 0: protei_id
colum 1: sequece
- mutil filepaths
comma-cocateatio
-o (save dirpath)
The dir path of savig the predicted 3D-structure data, each protei is stored i a PDB file, ad each PDB file is amed as "protei" + a auto-icremet id + ".pdb", such as "protei1.pdb".
The mappig betwee protei ids ad auto-icremet ids is stored i the file "resultifo.csv" (icludig: "idex", "proteiid(uuid)", "seqle", "ptm", "meaplddt") i this dir path.
For failed samples(CUDA out of memory), this script will save their protei ids i the "ucompleted.txt", ad you ca reduce the value of "trucatioseqlegth" ad add "--try_failure" for retry.
--batch_size
the batch size of ruig, default: 1.
--trucatioseqlegth
trucate sequeces loger tha the give value, recommeded values: 4096, 2048, 1984, 1792, 1536, 1280, 1152, 1022.
--um-recycles
umber of recycles to ru.
--chuk-size
chuks axial attetio computatio to reduce memory usage from O(L^2) to O(L), recommeded values: 128, 64, 32.
--tryfailure
retry the failed samples whe reducig the "trucatioseq_legth" value.
II. Predictio from iput sequeces
cd LucaProt/src/protei_structure/
export CUDA_VISIBLE_DEVICES=0
pytho structure_from_esm_v1.py \
-ame protei_id1,protei_id2 \
-seq VGGLFDYYSVPIMT,LPDSWENKLLTDLILFAGSFVGSDTCGKLF \
-o pdbs/rdrp/ \
--um-recycles 4 \
--trucatio_seq_legth 4096 \
--chuk-size 64 \
--cpu-offload \
--batch_size 1
Parameters:
- -ame
protei ids, comma-cocateatio for multi proteis.
- -seq
protei sequeces, comma-cocateatio for multi proteis.
2) Predictio of protei structural embeddig
The script embeddig_from_esmfold.py
is i "src/proteistructure", ad it use ESMFold (esm2t363BUR50D) to predict protei structural embeddig matrices or vectors.
I. Predictio from file
cd LucaProt/src/protei_structure/
export CUDA_VISIBLE_DEVICES=0
pytho embeddig_from_esmfold.py \
--model_ame esm2_t36_3B_UR50D \
--file data/rdrp.fasta \
--output_dir emb/rdrp/ \
--iclude per_tok cotacts bos \
--trucatio_seq_legth 4094
Parameters:
--modelame
the model ame, default: "esm2t363BUR50D"
-i/--file (iput filepath)
- fasta filepath
- csv filepath
the first row is the header
colum 0: protei_id
colum 1: sequece
-o/--outputdir (save dirpath)
The dir path of savig the predicted structural embeddig data, each protei is stored i a pickle file, ad each embeddig file is amed as "embeddig" + auto-icremet id + ".pt", such as "embeddig1.pt".
The mappig betwee protei ids ad auto-icremet ids is stored i the file "{}embedfastaid2idx.csv"(icludig: "idex", "proteiid(uuid)") i this dir path.
For failed samples(CUDA out of memory), this script will save their protei ids i the "{}embeducompleted.txt", ad you ca reduce the "trucatioseqlegth" value ad add "--tryfailure" for retry.
--trucatioseqlegth
trucate sequeces loger tha the give value. Recommeded values: 4094, 2046, 1982, 1790, 1534, 1278, 1150, 1022.
--iclude
The embeddig matrix or vector type of the predicted structural embeddig data, icludig per_tok, mea, cotacts, ad bos.
- pertok icludes the full sequece, with a embeddig per amio acid (seqle x hidde_dim).
- mea icludes the embeddigs averaged over the full sequece, per layer.
- bos icludes the embeddigs from the begiig-of-sequece toke.
- cotacts icludes the attetio value betwee two amio acids of the the full sequece.
Referece:https://github.com/facebookresearch/esm [Compute embeddigs i bulk from FASTA]
II. Predictio from iput sequeces
cd LucaProt/src/protei_structure/
export CUDA_VISIBLE_DEVICES=0
pytho embeddig_from_esmfold.py \
--model_ame esm2_t36_3B_UR50D \
-ame protei_id1,protei_id2 \
-seq VGGLFDYYSVPIMT,LPDSWENKLLTDLILFAGSFVGSDTCGKLF \
--output_dir embs/rdrp/test/ \
--iclude per_tok cotacts bos \
--trucatio_seq_legth 4094
Parameters:
-ame
protei ids, comma-cocateatio for multi proteis.
-seq
protei sequeces, comma-cocateatio for multi proteis.
3) Costruct dataset for model buildig
Costruct your dataset ad radomly divide the dataset ito traiig, validatio, ad testig sets with a specified ratio, ad save the three sets i dataset/${dataset_ame}/${dataset_type}/${task_type}
, icludig trai.csv, dev.csv, test_*.csv.
The file format ca be .csv (must iclude the header ) or .txt (does ot eed to have the header).
Each file lie is a sample cotaiig 9 colums, icludig protid, seq, seqle, pdbfileame, ptm, meaplddt, emb_fileame, label, ad source.
Colum seq is the sequece, Colum pdbfileame is the saved PDB fileame for structure ecoder strategy 2, Colum ptm ad Colum meaplddt are optioal, which are obtaied from the 3D-Structure computed model, Colum emb_fileame is the saved embeddig fileame for structure ecoder strategy 1, Colum label is the sample class(a sigle value or a list value of label idex or label ame). Colum source is the sample source (optioal).
For example:
like_YP_009351861.1_Meghai_flavivirus,MEQNG...,3416,,,,embeddig_21449.pt,1,rdrp
Note: if your dataset takes too much space to load ito memory at oce,
use "src/dataprocess/datapreprocessitotfrecordsforrdrp.py" to covert the dataset ito "tfrecords". Ad create a idex file: pytho -m tfrecord.tools.tfrecord2idx xxxx.tfrecords xxxx.idex
4) Traiig the model
ru.py
the mai script for buildig model.
Parameters
- data_dir: path, the dataset dirpath
- fileamepatter: the dataset fileame patter, such as "{}withpdbemb.csv", icludig traiwithpdbemb.csv, devwithpdbemb.csv, ad testwithpdbemb.csv i ${datadir}
- separatefile: storetrue, load the etire dataset ito memory, the ames of the pdb ad embeddig files are listed i the trai/dev/test.csv, ad eed to load them.
- tfrecords: storetrue, whether the dataset is i the tfrecords, whe true, oly the specified umber of samples(${shufflequeuesize}) are loaded ito memory at oce. The tfrecords must cosist of "${datadir}/tfrecords/trai/xxx.tfrecords", "${datadir}/tfrecords/dev/xxx.tfrecords" ad "${datadir}/tfrecords/test/xxx.tfrecords". "xxx.tfrecords" is oe of 01-of-01.tfrecords(oly icludig sequece), 01-of-01emb.records (icludig sequece ad structural embeddig), ad 01-of-01pdb_emb.records (icludig sequece, 3D-structure cotact map, ad structural embeddig).
- shufflequeuesize: it, how may samples are loaded ito memory at oce, default: 5000.
- datasetame: str, your dataset ame, such as "rdrp40_exted"
- dataset_type: str, your dataset type, such as "protei"
- tasktype: choices=["multilabel", "multiclass", "biaryclass"], your task type, such as "biary_class"
- model_type: choices=["sequece", "structure", "embeddig", "sef", "ssf"], they represet oly the sequece for iput, oly the 3D-structure cotact map for iput, oly the structural embeddig for iput, the sequece ad the structural embeddig for iput, ad the sequece ad the 3D-structure cotact map for iput, respectively
- subword: store_true, whether to process for sequece at the subword level.
- codesfile: path, subword codes filepath whe usig subword, such as "../subword/rdrp/proteicodesrdrp20000.txt"
- label_type: str, the label type ame, such as "rdrp"
- label_filepath: path, the label list filepath
- cmaptype: choices=["Calpha", "C_bert"], the calculatio type of 3D-structure cotact map
- cmap_thresh: the distace threshold (Uit: Agstrom) i cotact map calculatio. Two amio acids are liked if the distace betwee them is equal to ad less tha the threshold, default: 10.0.
- output_dir: path, the output dirpath
- log_dir: path, the logger savepath
- tblogdir: path, the save path of metric evaluatio records i model traiig, the tesorboardX ca be used to show these metrics.
- cofig_path: path, the cofiguratio filepath of the model.
- seqvocabpath: path, the vocab filepath of sequece tokeizer
- structvocabpath: path, the vocab filepath of 3D-structure ode (Structural Ecoder Strategy 2)
- seqpooligtype: choices=["oe", "max", "value_attetio"], the sequece represetaio matrix poolig type, "oe" represets that \ vector is used.
- structpooligtype: choices=["max", "value_attetio"], the 3D-structure represetaio matrix poolig type.
- embeddigpooligtype: choices=["oe", "max", "value_attetio"], the structual embeddig represetaio matrix poolig type, "oe" represets that \ vector is used.
- evaluatedurigtraiig: store_true, whether to evaluate the validatio set ad the testig set durig traiig.
- doeval: storetrue, whether to use the best saved model to evaluate the validatio set.
- dopredict: storetrue, whether to use the best saved model to evaluate the testig set.
- dolowercase: store_true, whether to lowercase the iput whe tokeizig.
- pergputraibatchsize: it, batch size per GPU/CPU for traiig, default: 16
- pergpuevalbatchsize: it, batch size per GPU/CPU for evaluatio, default: 16
- gradietaccumulatiosteps: it, umber of updates steps to accumulate before performig a backward/update pass, default: 1.
- learig_rate: float, the iitial learig rate for Adam, defaul: 1e-4.
- umtraiepochs: it, the total umber of traiig epochs to perform, default: 50,.
- loggig_steps: log every X updates steps, default: 1000.
- losstype: choices=["focalloss", "bce", "multilabel_cce", "asl", "cce"], loss-fuctio type of model traiig, default: "bce".
- maxmetrictype: choices=["acc", "jaccard", "prec", "recall", "f1", "fmax", "rocauc", "prauc"], which metric is used for model fializatio, default: "f1".
- pos_weight: float, positive samples weight for "bce".
- focallossalpha: float, alpha for focal loss, default: 0.7.
- focallossgamma: float, gamma for focal loss, default:2.0.
- focallossreduce: store_true, "mea" for oe sample whe i multi-label classifcatio, default:"sum".
- aslgammaeg: float, egative gamma for asl, default: 4.0.
- aslgammapos: float, positive gamma for asl, default: 1.0.
- seqmaxlegth: it, the legth of iput sequece more tha max legth will be trucated, shorter will be padded, default: 2048.
- structmaxlegth: it, the legth of iput cotact map more tha max legth will be trucated, shorter will be padded., default: 2048.
- truc_type: choices=["left", "right"], the trucate type for whole iput sequece, default: "right".
- opositioembeddigs: store_true, whether ot use positio embeddig for the sequece.
- otoketypeembeddigs: storetrue, whether ot use toke type embeddig for the sequece.
- embeddigiputsize: it, the dim of the structural embeddig vector/matrix, default: 2560, {"esm2t30150MUR50D": 640, "esm2t33650MUR50D": 1280, "esm2t363BUR50D": 2560, "esm2t4815BUR50D": 5120}.
- embeddig_type: choices=[Noe, "cotacts", "bos", "matrix"], the type of the structural embeddig ifo, default: "matrix.
- embeddigmaxlegth: it, the legth of iput embeddig matrix more tha max legth will be trucated, shorter will be padded, default: 2048.
- saveall: storetrue, the model for each evaluatio is saved.
- deleteold: storetrue, oly save the best metric (${maxmetrictype}) model of all evaluatio o testig set durig traiig.
Traiig
```shell
!/bi/bash
export CUDAVISIBLEDEVICES=0
DATASETNAME="rdrp40exted"
DATASETTYPE="protei"
TASKTYPE="biaryclass"
sequece + structural embedddig
MODELTYPE="sef"
CONFIGNAME="sefcofig.jso"
INPUTMODE="sigle"
LABELTYPE="rdrp"
embeddigiputsize=2560
embeddigtype="matrix"
SEQMAXLENGTH="2048"
embeddigmaxlegth="2048"
TRUNCT_TYPE="right"
oe, max, value_attetio
SEQPOOLINGTYPE="value_attetio"
max, value_attetio
embeddigpooligtype="valueattetio"
VOCABNAME="subwordvocab20000.txt"
SUBWORDCODESNAME="proteicodesrdrp20000.txt"
MAXMETRICTYPE="f1"
timestr=$(date "+%Y%m%d%H%M%S")
pytho ru.py \
--datadir ../dataset/$DATASETNAME/$DATASETTYPE/$TASKTYPE \
--tfrecords \
--fileamepatter {}withpdbemb.csv \
--datasetame $DATASETNAME \
--datasettype $DATASETTYPE \
--tasktype $TASKTYPE \
--modeltype $MODELTYPE \
--subword \
--codesfile ../subword/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$SUBWORDCODESNAME\
--iputmode $INPUTMODE \
--labeltype $LABELTYPE \
--labelfilepath ../dataset/$DATASETNAME/$DATASETTYPE/$TASKTYPE/label.txt \
--outputdir ../models/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \
--logdir ../logs/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \
--tblogdir ../tb-logs/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \
--cofigpath ../cofig/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$CONFIGNAME \
--seqvocabpath ../vocab/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$VOCABNAME\
--seqpooligtype $SEQPOOLINGTYPE \
--embeddigpooligtype $embeddigpooligtype \
--dotrai \
--doeval \
--dopredict \
--evaluatedurigtraiig \
--pergputraibatchsize=16 \
--pergpuevalbatchsize=16 \
--gradietaccumulatiosteps=1 \
--learigrate=1e-4 \
--umtraiepochs=50 \
--loggigsteps=1000 \
--savesteps=1000 \
--overwriteoutputdir \
--sigmoid \
--losstype bce \
--maxmetrictype $MAXMETRICTYPE \
--seqmaxlegth=$SEQMAXLENGTH \
--embeddigmaxlegth=$embeddigmaxlegth \
--tructype=$TRUNCTTYPE \
--otoketypeembeddigs \
--embeddigiputsize $embeddigiputsize\
--embeddigtype $embeddigtype \
--shufflequeuesize 10000 \
--save_all
```
Cofiguratio file
The cofiguratio files of all methods is i "cofig/rdrp40exted/protei/biaryclass/".
If traiig your model, please put the cofiguratio file i "cofig/${datasetame}/${datasettype}/${tasktype}/"
Value meaig i cofiguratio file
referrig to "src/SSFN/README.md"
Baselies
- LGBM (usig the embeddig vector: \ as the iput)
cd src/baselies/
sh ru_lgbm.sh
- XGBoost (usig the embeddig vector: \ as the iput)
cd src/baselies/
sh ru_xgb.sh
- DNN (usig the embeddig vector: \ as the iput)
cd src/baselies/
sh ru_d.sh
Or:
cd src/traiig
ru_subword_rdrp_emb.sh
- Trasoformer-Char Level (usig the sequece as the iput)
cd src/traiig
sh ru_char_rdrp_seq.sh
- Trasoformer-Subword Level (usig the sequece as the iput)
cd src/traiig
sh ru_subword_rdrp_seq.sh
- DNN2 (VALP + DNN, usig the embeddig matrix as the iput)
cd src/traiig
ru_subword_rdrp_emb_v2.sh
Ours
cd src/traiig
ru_subword_rdrp_sef.sh
5) Traiig Loggig Iformatio
logs
The ruig iformatio is saved i "logs/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${time_str}/logs.txt".
The iformatio icludes the model cofiguratio, model layers, ruig parameters, ad evaluatio iformatio.
models
The checkpoits are saved i "models/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${timestr}/checkpoit-${globalstep}/", this directory icludes "pytorchmodel.bi", "cofig.jso", "traiigargs.bi", ad tokeizer iformatio "sequece" or "strcut". The details are show i Figure 2.
Figure 2: The File List i Checkpoit Dir Path
tb-logs
The metrics are recorded i "tb-logs/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${time_str}/evets.out.tfevets.xxxxx.xxxxx"
ru:
tesorboard --logdir=tb-logs/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${timestr --bidall
predicts
The predicted results is saved i "predicts/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${timestr}/checkpoit-${globalstep}", icludig:
- predcofusiomatrix.pg
- pred_metrics.txt
- pred_result.csv
- seqlegthdistributio.pg
The details are show i Figure 3.
Figure 3: The File List i Predictio Dir Path
Note: whe usig the saved model to predict, the "logs.txt" ad the checkpoit dirpath will be used.
8. Related to the Project
1) ClstrSearch
A covetioal approach that clustered all proteis based o their sequece homology.
See ClstrSerch/README.md
for details.
2) src
Costruct RdRP Dataset for Model Buildig
*.py i "src/data_preprocess"
Model
*.py i "src/SSFN"
Predictio Shell Script
*.sh i "src/predictio"
icludig:
rupredictfrom_file.sh
ru predictio for may samples from a file, the structural embeddig iformatio prepared i advace.
rupredictoe_sample.sh
ru predictio for oe sample from the iput.
rupredictmay_samples.sh
ru predictio for may samples from the iput.
We perform ablatio studies o our model by removig specific module(sequece-specific ad embeddig-specific) oe at a time to explore their relative importace.
rupredictolyseqfrom_file.sh
oly usig the sequece to predict ad calculate metrics three positive testig datasets, three egative testig datasets, ad our checked RdRPs by predictio SRA.
rupredictolyembfrom_file.sh
oly usig the structural embeddig to predict ad calculate metrics three positive testig datasets, three egative testig datasets, ad our checked RdRPs by predictio SRA.
rupredictseqembfrom_file.sh
usig the sequetail ifo ad the structural embeddig to predict ad calculate metrics three positive testig datasets, three egative testig datasets, ad our checked RdRPs by predictio SRA.
Baselies
*.py i "src/baselies", usig the embeddig vector as the iput, icludig:
Baselies for Deep Learig
*.py i "src/deep_baselies", icludig:
CHEER: HierarCHical taxoomic classificatio for viral mEtagEomic data via deep learig(2021). code: CHEER
VirHuter: A Deep Learig-Based Method for Detectio of Novel RNA Viruses i Plat Sequecig Data(2022). code: VirHuter
Virtifier: a deep learig-based idetifier for viral sequeces from metageomes(2022). code: Virtifier
RNN-VirSeeker: RNN-VirSeeker: A Deep Learig Method for Idetificatio of Short Viral Sequeces From Metageomes. code: RNN-VirSeeker
rudeepbaselies.sh
the script to trai deep baselie models.
rupredictdeep_baselies.sh
use traied deep baselie models to predict three positive test datasets, three egative test datasets, ad our checked RdRP datasets.
ru.py
the mai script for traiig deep baselie models.
statistics
the script to statistic the accuracy i three kids of test datasets(positive, egative, our checked) after predictio by deep baselies.
Cotact Map Geerator
*.py i "src/biotoolbox"
Loss & Metrics
*.py i "src/commo"
Traiig Model
*.sh i "src/traiig"
Predictio of Model
*.sh i "src/predictio"
3) Data
Raw Data
the raw data is i "data/".
Dataset
the files of the dataset is i "dataset/${datasetame}/${datasettype}/${task_type}/".
4) Model Cofiguratio
the cofiguratio file of all methods is i "cofig/${datasetame}/${datasettype}/${task_type}/".
5) Pic
some pictures is i "pics/".
6) Plot
the scripts of pictures plotig is i "src/plot".
7) Spider
the codes ad results of Geo iformatio Spider i "src/geo_map".
9. Ope Resource
The ope resources of our study ar icludes six subdirectories: Kow_RdRPs
, Results
, All_Cotigs
, All_Protei_Sequeces
, ad LucaProt
, ad Self_Sequecig_Reads
.
LucaProt/
icludes some resources related to LucaProt, icludig code, model buildig dataset, model testig datasets, ad our traied model.
1) Code
As metioed above.
2) Dataset
Model Buildig Dataset
sequetial ifo
traiwithpdb_emb.csv
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/
devwithpdb_emb.csv
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/
testwithpdb_emb.csv
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/
structural ifo
embs
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/embs/
tfrcords
trai
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/tfrecords/trai/
dev
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/tfrecords/dev/
test
copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/tfrecords/test/
Model Testig (Validatio) Dataset
Results
Self-Samples
3) Traied Model
The traied model for RdRP idetificatio is available at:
10. Cotributor
LucaTeam:
Yog He, Zhaorog Li, Xi Hou, Mag Shi
11. FTP
The all data of LucaProt is available at the website: Ope Resources
12. Citatio
the pre-prit versio:
@article { lucaprot,
author = {Xi Hou ad Yog He ad Pa Fag ad Shi-Qiag Mei ad Za Xu ad Wei-Che Wu ad Ju-Hua Tia ad Shu Zhag ad Zhe-Yu Zeg ad Qi-Yu Gou ad Ge-Yag Xi ad Shi-Jia Le ad Yi-Yue Xia ad Yu-La Zhou ad Feg-Mig Hui ad Yua-Fei Pa ad Joh-Sebastia Ede ad Zhao-Hui Yag ad Chog Ha ad Yue-Log Shu ad Deyi Guo ad Ju Li ad Edward C Holmes ad Zhao-Rog Li ad Mag Shi},
title = {Artificial itelligece redefies RNA virus discovery},
elocatio-id = {2023.04.18.537342},
year = {2023},
doi = {10.1101/2023.04.18.537342},
publisher = {Cold Sprig Harbor Laboratory},
URL = {https://www.biorxiv.org/cotet/early/2023/04/18/2023.04.18.537342},
eprit = {https://www.biorxiv.org/cotet/early/2023/04/18/2023.04.18.537342.full.pdf},
joural = {bioRxiv}
}
13. Pip
ame: lucaprot
chaels:
- defaults
depedecies:
- pip:
- h5py==3.8.0
- biopytho==1.80
- biotite==0.35.0
- brotlipy==0.7.0
- umpy==1.24.2
- oboet==0.3.1
- padas==1.5.3
- pickle5==0.0.11
- Pillow==9.3.0
- scikit-lear==1.2.1
- scipy==1.10.1
- seabor==0.12.2
- six==1.16.0
- subword-mt==0.3.8
- tesorboard==2.11.2
- tesorboardX==2.5.1
- tesorflow==2.11.0
- tesorflow-estimator==2.11.0
- tfrecord==1.14.1
- tokeizers==0.13.2
- torch==1.13.1
- torchaudio==0.13.1
- torchvisio==0.14.1
- tqdm==4.64.1
- trasformers==4.26.0
- huggigface-hub==0.12.0
- matplotlib==3.6.3
- Werkzeug==2.2.2
- wget==3.2
- wrapt==1.14.1
- xgboost==1.7.3
- zipp==3.12.0
- lightgbm==3.3.5
- xgboost==1.7.3
- BeautifulSoup4==4.11.1
- requests==2.24.0
- gemmi==0.5.8
- etworkx==3.0
- fair-esm[esmfold]
- dllogger @ git+https://github.com/NVIDIA/dllogger.git
评论