linux集群上的SCHRODINGER 分子对接工作流程

介绍schrodinger在无图形化linux集群上的分子对接流程

贺俊宏

7062人浏览 · 2022-08-26 10:52:00

贺俊宏 · 2022-08-26 10:52:00 发布

文章目录

前言
一、总体工作流简介
二、ligprep参数介绍
三、grid_gen_from_pv.py 参数介绍
四、GVSrun参数介绍
五、相关文件
六、时间预估
六、特别感谢

前言

glide作为广为人知的分子对接软件，提供了非常方便的各类型对接工作，本篇文章旨在介绍如何方便的在无图形化界面的liinux集群中运行整个分子对接流程

一、总体工作流简介

在这里插入图片描述
总体流程大致可以输入，预处理和分子对接三个部分
• 输入为蛋白文件和小分子2D结构数据库。
• 输入文件经过ligprep和grid_gen_from_pv.py进行预处理，其中ligprep是将小分子2D结构转3D。grid_gen_from_pv.py是从蛋白复合物中的已知配体生成蛋白结结合位点网格。
• 分子对接依赖于GSVrun程序进行，GSVrun包括多种常用的对接模式，以及对接过程中的可调参数。详见第4节。
• 输出文件包括每个对接阶段的小分子3D构象输出以及得分。

二、ligprep参数介绍

Bash
usage: ligprep [options] (-ismi|-icsv|-imae|-isd) infile (-osd|-omae) outfile

General:
  -h, -help             Print brief help message.
  -long_help            Print exhaustive help message.
  -inp <filename>       Read arguments from the supplied input file.
  -sif_docs             List supported simplified input file format (SIF)
                        keywords.

Ionization:
  -epik                 Use Epik for ionization and tautomerization
                        (Recommended, overrides -i).
  -emb, -epik_metal_binding
                        Run Epik with the metal_binding option so that states
                        appropriate for interactions with metal ions in
                        protein binding pockets are also generated.
  -i {0,1,2}            Ionization treatment: 0 - do not neutralize or ionize,
                        1 - neutralize only, 2 - neutralize and ionize
                        (Default: 1). Note that -epik option overrides -i and
                        always implies ionization and tautomerization.
  -ph <number>          Effective/target pH.
  -pht <number>         pH tolerance for generated structures.

Stereoisomers:
  -ac                   Do not respect existing chirality properties and do
                        not respect chiralities from the input geometry.
                        Generate stereoisomers for all chiral centers up to
                        the number permitted (specified using the -s option).
                        This is equivalent to "Generate all combinations" in
                        the Ligand Preparation user interface. Default
                        behavior is to respect only explicitly indicated
                        chiralities.
  -g                    Respect chiralities from input geometry when
                        generating stereoisomers.
  -s #                  Generate up to this many (#) stereoisomers per input
                        structure. (Default: 32).

Force-field based geometry optimization:
  -bff {14,16}          Force-field to be used for the final geometry
                        optimization. Default: 14 (OPLS_2005). For S-OPLS
                        specify 16.

Standard and job control options:
  -NJOBS NJOBS          Divide the overall job into NJOBS subjobs.
  -NSTRUCTS NSTRUCTS    Divide the overall job into subjobs with no more than
                        NSTRUCTS structures.
  -HOST <hostname>      Run job remotely on the indicated host entry.
  -WAIT                 Do not return a prompt until the job completes.
  -LOCAL                Do not use temporary directory for job files. Keep
                        files in the current directory.

Typical usage:
  Sample protonation states and stereoisomers, optimize geometry
  using OPLS force-field, and store results in Maestro format:

    ligprep <input> -omae outfile.maegz -epik

  where <input> is one of:
    -ismi infile.smi
    -icsv infile.csv
    -isd  infile.sd
    -imae infile.mae

一般情况下默认参数即可，在slurm集群上采用如下代码提交任务(ligprep的cpu利用率不高，一般建议Njobs设置为cpu核数的2~3倍)：

srun -c 4 -J ${name}_ligprep /home/xiaxichen/hyd/Schrodinger-soft/ligprep -ismi ${name}.smi -omae ${name}.maegz -epik -NJOBS 12 -WAIT

三、grid_gen_from_pv.py 参数介绍

SCHRODINGER本身其实就提供了很多python脚本以实现各种功能
脚本路径：path/to/schrodinger/mmshare-v5.4/python/script
脚本执行方式 $SCHRODINGER/run XXX.py [Options]
脚本查询：https://www.schrodinger.com/scriptcenter

Bash
Options:
  -v, -version          Show the program version and exit.
  -h, -help             Show this help message and exit.
  -o OUT_BASE, -outbase OUT_BASE
                        Base name for gridgen input to be produced.
  -a, -allprint         Report all levels of status message.

可以看到参数非常少，连基本的网格大小都不可以设置。所以建议还是在本地创建好grid文件在传到集群上为宜，首先是所需时间其实也非常短，其次就是能够加深对蛋白结构的理解和更多的网格设置选项。

四、GVSrun参数介绍

Bash
Warning: The GVSrun need a $compound_library environmental variable, or you can use -d to specify a library.

Perform Virtual Screening Workflow.

Usage: GVSrun [OPTION] <parameter>

Input parameter:
  -i        Gird file input.
        Use a file name (Multiple files are wrapped in "", and split by ' ') or regular expression to represent your input Grid file, default is *.zip.
  -D    Which databases do you want to screen? The Database basic path is <>.
        These database are Supported:
            Custom_DB-ab-site-Fast  GVSrun  GVXrun_help.txt  plmd  README.md  Scripts  XDock  
  -d    Provide your own database path, the compounds files are recommended as maegz or SDF file format.
  -R    Optional, reference Ligands correlated with the grid or use for RMSD calculation. 
            As *.mae, *.maegz (for query structure) or *.phypo (for pharmacophore hypothesis).
  -m    Running Mode: a serial combination of different computing tasks.
            Show all mode and task in this option using "-O".
    @ Available Running Mode: <Fast>
        Fast: Fast Virtual Screening.
        Normal: Filter The Drug-like compounds and dock screening.
        Reference: Virtual Screening with reference ligand restrain.
        Prep_Normal: Virtual Screening for un-prepared compounds database.
        *Normal_MMGBSA: Virtual Screening and MMGBSA re-scoring.
        *Cov_Screening: Virtual Screening to discover covalent durg.
        *Induce_Fit_Screening: Induce fit screening.
        *QM_Screening: Virtual Screening and QMMM re-docking.
        Local: Local docking and screening were carried out using the input ligand structure.
        Shape_Screening: perform screening based on reference ligand shape.
        Advance: dock screening and clustering.
        Advance_Shape_Screening: perform screening based on reference ligand shape.
    @ or Custom Running Mode, e.g. -m "EDL+R+HTVS+CD"

Control parameters:
  -F    Force Fields for docking stage (SP/XP/MMGBSA...), OPLS_2005, OPLS3e or OPLS4. <OPLS4>
  -f    Force Fields for other stage (HTVS/LigPrep...), OPLS_2005, OPLS3e or OPLS4. <OPLS_2005>
  -W    Define a Max/Min Molecular weight for MW module, such as "100:400" <100:400>
  -T    Set a Job Name. Default is "Grid_name-Database_name-Run_Mode".
  -C    Aattach residue number on receptor, required in Covalent Docking.
            e.g. "cys:A:1425", the A is chain name and 145 is the atom number(Heavy atom). 
            The cys is residue name for A:1425, Supported residues: cys, ser, lys.
  -q    Define a "DFT:Basis_Set" to QM/QMMM, default is "B3LYP-D3(BJ):6-311G**".
            Other QM setting: "B3LYP-D3M(BJ):6-311G+**","M06‑2X:def2-tzvpp(-g)","wB97M‑V:cc-pVTZ-pp"
  -p    set a PH:PHT for ligPrep, default is 7.0:2.0, means 7.0±2.0. <7.0:2.0>
  -s    Set a SMARTs Expression for compounds filter at first step. such as [B]([O])[O].
  -E    Shape Screening Options. <10000:mmod:rapid:1200>
            The paramter means: "keep_num:atomtypes:shape_sample_method:max_confs";
            NOTE: element and qsar are alternative atomtypes; thorough are alternative shape_sample_method.

OUTPUT parameters:
  -a    The number of Output compounds per Screening Task. <5%>
            e.g. 10% means reatin top 10% compounds.
                10000 means reatin top 10000 compounds.
  -b    The number of Output compounds after Standard docking Task. <4000>
  -c    The number of conformations generated by each ligand in the docking task. <1>
  -e    The number of candidates to IFT/CD/MMGBSA/QMMM. <500>

Job control:
  -H        Host Name of your Queue, defult is CPU.
  -N    The max number of subjobs. <100>
  -G    The number of Glide subjobs. <90>
  -P    The number of Prime subjobs. <10>
  -L    The number of LigPrep subjobs. <30>
  -A    The number of phase subjobs. <10>
  -Q    The number of Qsite subjobs. <10>
  -K    The number of QIKPROP subjobs. <10>
  -M    The number of MACROMODEL subjobs. <10>
  -S        Your Schrodinger path. </home/xiaxichen/hyd/Schrodinger-soft>

GVSrun通过环境变量compound_library来获取系统中化合物数据库的位置；如果不存在该变量需要利用-d参数给定化合物数据库路径（纯SDF或MAEGZ格式的多个文件，不可套叠文件夹）除此以外，脚本需要知道安装好的薛定谔路径：

~/.bashrc中指定好**$SCHRODINGER**环境变量，这也是正常安装薛定谔已经完成的操作；
也可以修改脚本，直接更改默认的SCHRODINGER路径（脚本第22行修改);
可以使用各种参数进行组合进行所需分子对接的定制，在这里重点介绍Running mode参数（其余参数如上所示就不加赘述了），GVSrun一共拥有12种Running mode分别对应这常见的12种分子对接方案，他们分别是：

在这里插入图片描述

以及如何制定新的Running_model（以Advance为例），每个Running_model其实就是个各项任务的组合，对于新的Running_model制定所需要添加的内容有三个部分：方法注释，运行逻辑，任务运行代码：

#===========================
#添加方法注释
#===========================
@Available Running Mode:
     Advance: dock screening.
    "HTVS_Normal+SP_ExtensionA+SP_Enhanced" #任务组合，代码93行

@Screening: Reduce the number of compounds rough docking. Related to -a option.
    HTVS_Normal: The standard HTVS screening.

@Standard_Docking: Standard docking was performed to rank the compounds. Related to -b and -c option.
    SP_ExtensionA: Reward intramolecular ligand hydrogen bonds, and accept halogens as H-bond acceptors.
    SP_Enhanced: Increase the depth of sampling for larger grid or More accurate poses. 

#===========================
#添加运行逻辑
#===========================
Parse_Running_Mode(){
    # Parse_Running_Mode Running_Mode
    Running_Mode=$1
    if [ $Running_Mode == "Normal" ]; then
        Pipeline="RDL+HTVS_Normal+QIKPROP+R5R+SP_ExtensionA"
    elif [ $Running_Mode == "Prep_Normal" ]; then
        Pipeline="No_Dup+RDL+IONIZE+HTVS_Normal+QIKPROP+R5R+SP_ExtensionA"
    elif [ $Running_Mode == "Normal_MMGBSA" ]; then
        Pipeline="RDL+HTVS_Normal+QIKPROP+R5R+SP_ExtensionA+MMGBSA_EN"
    elif [ $Running_Mode == "Reference" ]; then
        Pipeline="HTVS_REF+SP_REF+QIKPROP+R5R"
    elif [ $Running_Mode == "Induce_Fit_Screening" ]; then
        Pipeline="IFT_pre+IFT"
    elif [ $Running_Mode == "Cov_Screening" ]; then
        Pipeline="R+RDL+HTVS_Normal+SP_ExtensionA+CD"
    elif [ $Running_Mode == "Fast" ]; then
        Pipeline="HTVS_Normal+SP_Normal"
    elif [ $Running_Mode == "QM_Screening" ]; then
        Pipeline="RDL+HTVS_Normal+QIKPROP+R5R+SP_ExtensionA+QM_redock+RMSD"
    elif [ $Running_Mode == "Shape_Screening" ];then
        Pipeline="PhaseShape+HTVS_local+SP_local"
    elif [ $Running_Mode == "Local" ];then
        Pipeline="HTVS_local+SP_local+MMGBSA_OPT"
    elif [ $Running_Mode == "Advance" ];then
        Pipeline="HTVS_Normal+SP_ExtensionA+SP_Enhanced"
    elif [ $Running_Mode == "Advance_Shape_Screening" ];then
        Pipeline="HTVS_Shape+SP_Shape"
    else 
        Pipeline=$Running_Mode
        Running_Mode="User_Defined"
    fi
}

#===========================
#添加任务运行代码
#===========================
HTVS_Normal(){
    KEEP_NUM=`parse_num_or_percentage ${HTVS_out_num}`
cat<<HTVS >> ${Inp_Name}
[STAGE:PRE_DOCK_HTVS]
    STAGECLASS   gencodes.RecombineStage
    INPUTS   ${Ligand_name},
    OUTPUTS   HTVS_RECOMBINE_OUT,
    NUMOUT   njobs
    OUTFORMAT   maegz
    MIN_SUBJOB_STS   4000
    MAX_SUBJOB_STS   40000
    GENCODES   YES
    OUTCOMPOUNDFIELD   s_vsw_compound_code
    OUTVARIANTFIELD   s_vsw_variant
    UNIQUEFIELD   NONE
[STAGE:DOCK_HTVS]
    STAGECLASS   glide.DockingStage
    INPUTS   HTVS_RECOMBINE_OUT, GRID
    OUTPUTS   HTVS_OUT,
    RECOMBINE   NO
    PRECISION   HTVS
    UNIQUEFIELD   s_vsw_compound_code
    ${KEEP_NUM}

六、时间预估

流程名	速度	1万个分子所需时长
ligprep	1.5 秒/分子	4.1 小时/1万分子
glide_HTVS	0.5 秒/分子	1.5 小时/1万分子
glide_SP	9 秒/分子	25 小时/1万分子
glide_XP	100 秒/分子	280 小时/1万分子