【Python】根据CID获取化合物数据（调用Pubchem官方API）

简介根据CID从PubChem爬取化合物的数据（基于PubChem PUG REST API）。下载小编已将程序打包为可执行文件，下载即可使用：pubchem-1.0.2-win64.zip演示安装pip install requests用法克隆仓库。git clone https://github.com/XavierJiezou/python-pubchem-api.gitCd到根目录。cd

文章共3,350字 · 阅读需要大约12分钟

一键AI生成摘要，助你高效阅读

问答

Xavier Jiezou

6108人浏览 · 2021-07-30 16:13:29

Xavier Jiezou · 2021-07-30 16:13:29 发布

简介

根据CID从PubChem爬取化合物的数据（基于PubChem PUG REST API），2~3秒即可实现对上千条CID对应的化合物数据的抓取。

下载

小编已将程序打包为可执行文件，下载即可使用：pubchem-1.0.2-win64.zip

演示

在这里插入图片描述

非开发人员直接下载打包好的软件使用即可，无需继续往下看（以此为分界线），如有问题请联系我。

安装

pip install requests

用法

克隆仓库。

git clone https://github.com/XavierJiezou/python-pubchem-api.git

cd python-pubchem-api

将cid列表复制到cid.txt。
运行命令python pubchem.py.
爬取结果保存在data.json或者data.csv.
你也可以根据下面的化合物属性表修改pubchem.py中的变量self.property_list

self.property_list = [
            'IUPACName',
            'IsomericSMILES',
            'MolecularFormula',
            'MolecularWeight',
            'HBondDonorCount',
            'HBondAcceptorCount'
        ]

相关

化合物属性表

如果将以逗号分隔的属性标签列表写入URL中，则可以请求多个属性。属性表的有效输出格式为：XML、ASNT/B、JSON§、CSV和TXT(仅限于单个属性)。可用的属性包括：

属性	描述
MolecularFormula	Molecular formula.
MolecularWeight	The molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in g/mol. In the absence of explicit isotope labelling, averaged natural abundance is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location.
CanonicalSMILES	Canonical SMILES (Simplified Molecular Input Line Entry System) string. It is a unique SMILES string of a compound, generated by a “canonicalization” algorithm.
IsomericSMILES	Isomeric SMILES string. It is a SMILES string with stereochemical and isotopic specifications.
InChI	Standard IUPAC International Chemical Identifier (InChI). It does not allow for user selectable options in dealing with the stereochemistry and tautomer layers of the InChI string.
InChIKey	Hashed version of the full standard InChI, consisting of 27 characters.
IUPACName	Chemical name systematically determined according to the IUPAC nomenclatures.
Title	The title used for the compound summary page.
XLogP	Computationally generated octanol-water partition coefficient or distribution coefficient. XLogP is used as a measure of hydrophilicity or hydrophobicity of a molecule.
ExactMass	The mass of the most likely isotopic composition for a single molecule, corresponding to the most intense ion/molecule peak in a mass spectrum.
MonoisotopicMass	The mass of a molecule, calculated using the mass of the most abundant isotope of each element.
TPSA	Topological polar surface area, computed by the algorithm described in the paper by Ertl et al.
Complexity	The molecular complexity rating of a compound, computed using the Bertz/Hendrickson/Ihlenfeldt formula.
Charge	The total (or net) charge of a molecule.
HBondDonorCount	Number of hydrogen-bond donors in the structure.
HBondAcceptorCount	Number of hydrogen-bond acceptors in the structure.
RotatableBondCount	Number of rotatable bonds.
HeavyAtomCount	Number of non-hydrogen atoms.
IsotopeAtomCount	Number of atoms with enriched isotope(s)
AtomStereoCount	Total number of atoms with tetrahedral (sp3) stereo [e.g., ®- or (S)-configuration]
DefinedAtomStereoCount	Number of atoms with defined tetrahedral (sp3) stereo.
UndefinedAtomStereoCount	Number of atoms with undefined tetrahedral (sp3) stereo.
BondStereoCount	Total number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration].
DefinedBondStereoCount	Number of atoms with defined planar (sp2) stereo.
UndefinedBondStereoCount	Number of atoms with undefined planar (sp2) stereo.
CovalentUnitCount	Number of covalently bound units.
Volume3D	Analytic volume of the first diverse conformer (default conformer) for a compound.
XStericQuadrupole3D	The x component of the quadrupole moment (Qx) of the first diverse conformer (default conformer) for a compound.
YStericQuadrupole3D	The y component of the quadrupole moment (Qy) of the first diverse conformer (default conformer) for a compound.
ZStericQuadrupole3D	The z component of the quadrupole moment (Qz) of the first diverse conformer (default conformer) for a compound.
FeatureCount3D	Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)
FeatureAcceptorCount3D	Number of hydrogen-bond acceptors of a conformer
FeatureDonorCount3D	Number of hydrogen-bond donors of a conformer.
FeatureAnionCount3D	Number of anionic centers (at pH 7) of a conformer.
FeatureCationCount3D	Number of cationic centers (at pH 7) of a conformer.
FeatureRingCount3D	Number of rings of a conformer.
FeatureHydrophobeCount3D	Number of hydrophobes of a conformer.
ConformerModelRMSD3D	Conformer sampling RMSD in
EffectiveRotorCount3D	Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)
ConformerCount3D	The number of conformers in the conformer model for a compound.
Fingerprint2D	Base64-encoded PubChem Substructure Fingerprint of a molecule.

属性API

根据CID获取属性。

实例：
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularFormula,MolecularWeight/JSON

同义词API

根据CID获取同义词。

实例：
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/synonyms/JSON

打包

git clone https://github.com/XavierJiezou/python-pubchem-api.git
cd python-pubchem-api
pip install pipenv
pipenv install
pipenv shell
pip install requests
pip install pyinstaller
pyinstaller -F -i favicon.ico pubchem.py

源码

https://github.com/XavierJiezou/python-pubchem-api

import os, csv, json, requests


class PubchemCrawlFast():
    def __init__(self, cid_path, out_path):
        """Initialization function.

        Args:
            cid_path (str): Input file path of cid list
            out_path (str): Output file path of crawled data 
        """
        self.cid_path = cid_path
        self.out_path = out_path
        self.property_list = [
            'IUPACName',
            'IsomericSMILES',
            'MolecularFormula',
            'MolecularWeight',
            'HBondDonorCount',
            'HBondAcceptorCount'
        ]

    def get_cid_list(self):
        """Get the cid list from the local file
        """
        if os.path.exists(self.cid_path):
            with open(self.cid_path) as f:
                self.cid_list = [i.strip() for i in f.readlines()]
        else:
            self.cid_list = []
            cid = input('Please inpute the CID list below: \n')
            while cid != '':
                self.cid_list.append(cid)
                cid = input()
        self.length = len(self.cid_list)

    def get_property_from_cid(self):
        """Get the property from cid
        """
        limit = 300
        api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/'
        property_str = ','.join(self.property_list)
        return_type = 'json'
        self.prp = []
        for i in range(limit, self.length+limit, limit):
            cid_str = ','.join(self.cid_list[i-limit:i])
            url = f'{api}{cid_str}/property/{property_str}/{return_type}'
            res = requests.get(url).json()
            self.prp += res['PropertyTable']['Properties']

    def get_synonyms_from_cid(self):
        """Get the synonym from cid
        """
        limit = 300
        api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/'
        return_type = 'json'
        self.syn = []
        for i in range(limit, self.length+limit, limit):
            cid_str = ','.join(self.cid_list[i-limit:i])
            url = f'{api}{cid_str}/synonyms/{return_type}'
            res = requests.get(url).json()
            self.syn += res['InformationList']['Information']
        for i in range(len(self.syn)):
            if 'Synonym' not in self.syn[i]:
                self.syn[i]['Synonym'] = []

    def save_as_csv(self, data):
        """Save the crawled data in CSV format
        """
        csv_name = self.out_path.split('.')[0]+'.csv'
        header_list = ['CID']+self.property_list+['Synonym']
        # with open(csv_name, 'w') as f:
        #     f.write(','.join(header_list)+'\n')
        # with open(csv_name, 'a') as f:
        #     for item in data:
        #         line = ['"'+str(item[each])+'"' for each in header_list]
        #         f.write(','.join(line)+'\n')
        with open(csv_name,'w', newline='') as f:
            writer = csv.DictWriter(f, header_list)
            writer.writeheader()
            writer.writerows(data)

    def __main__(self):
        print('Getting CID list: ')
        self.get_cid_list()
        print('CID list acquisition is complete!')
        print('--------------------------------------------')
        print('Querying property list: ')
        self.get_property_from_cid()
        print('Property list query is complete!')
        print('--------------------------------------------')
        print('Querying synonym: ')
        self.get_synonyms_from_cid()
        print('Synonym query is complete!')
        print('--------------------------------------------')
        dt = {
            'InfoList': {
                'Info': [dict(d1, **d2) for d1, d2 in zip(self.prp, self.syn)]
            }
        }
        json_str = json.dumps(dt, indent=2)
        print('The data is being written to the JSON file: ')
        with open(self.out_path, 'w') as f:
            f.write(json_str)
        print('Finished writing the JSON file! ')
        print('--------------------------------------------')
        print('The data is being written to the CSV file: ')
        self.save_as_csv(dt['InfoList']['Info'])
        print('Finished writing the CSV file! ')
        os.system('pause')


if __name__ == '__main__':
    PubchemCrawlFast('cid.txt', 'data.json').__main__()