简介

根据CIDPubChem爬取化合物的数据(基于PubChem PUG REST API),2~3秒即可实现对上千条CID对应的化合物数据的抓取。

下载

小编已将程序打包为可执行文件,下载即可使用:pubchem-1.0.2-win64.zip

演示

在这里插入图片描述


非开发人员直接下载打包好的软件使用即可,无需继续往下看(以此为分界线),如有问题请联系我。


安装

pip install requests

用法

  1. 克隆仓库。
git clone https://github.com/XavierJiezou/python-pubchem-api.git
  1. Cd到根目录。
cd python-pubchem-api
  1. cid列表复制到cid.txt
  2. 运行命令python pubchem.py.
  3. 爬取结果保存在data.json或者data.csv.
  4. 你也可以根据下面的化合物属性表修改pubchem.py中的变量self.property_list
self.property_list = [
            'IUPACName',
            'IsomericSMILES',
            'MolecularFormula',
            'MolecularWeight',
            'HBondDonorCount',
            'HBondAcceptorCount'
        ]

相关

化合物属性表

如果将以逗号分隔的属性标签列表写入URL中,则可以请求多个属性。属性表的有效输出格式为:XML、ASNT/B、JSON§、CSV和TXT(仅限于单个属性)。可用的属性包括:

属性描述
MolecularFormulaMolecular formula.
MolecularWeightThe molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in g/mol. In the absence of explicit isotope labelling, averaged natural abundance is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location.
CanonicalSMILESCanonical SMILES (Simplified Molecular Input Line Entry System) string. It is a unique SMILES string of a compound, generated by a “canonicalization” algorithm.
IsomericSMILESIsomeric SMILES string. It is a SMILES string with stereochemical and isotopic specifications.
InChIStandard IUPAC International Chemical Identifier (InChI). It does not allow for user selectable options in dealing with the stereochemistry and tautomer layers of the InChI string.
InChIKeyHashed version of the full standard InChI, consisting of 27 characters.
IUPACNameChemical name systematically determined according to the IUPAC nomenclatures.
TitleThe title used for the compound summary page.
XLogPComputationally generated octanol-water partition coefficient or distribution coefficient. XLogP is used as a measure of hydrophilicity or hydrophobicity of a molecule.
ExactMassThe mass of the most likely isotopic composition for a single molecule, corresponding to the most intense ion/molecule peak in a mass spectrum.
MonoisotopicMassThe mass of a molecule, calculated using the mass of the most abundant isotope of each element.
TPSATopological polar surface area, computed by the algorithm described in the paper by Ertl et al.
ComplexityThe molecular complexity rating of a compound, computed using the Bertz/Hendrickson/Ihlenfeldt formula.
ChargeThe total (or net) charge of a molecule.
HBondDonorCountNumber of hydrogen-bond donors in the structure.
HBondAcceptorCountNumber of hydrogen-bond acceptors in the structure.
RotatableBondCountNumber of rotatable bonds.
HeavyAtomCountNumber of non-hydrogen atoms.
IsotopeAtomCountNumber of atoms with enriched isotope(s)
AtomStereoCountTotal number of atoms with tetrahedral (sp3) stereo [e.g., ®- or (S)-configuration]
DefinedAtomStereoCountNumber of atoms with defined tetrahedral (sp3) stereo.
UndefinedAtomStereoCountNumber of atoms with undefined tetrahedral (sp3) stereo.
BondStereoCountTotal number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration].
DefinedBondStereoCountNumber of atoms with defined planar (sp2) stereo.
UndefinedBondStereoCountNumber of atoms with undefined planar (sp2) stereo.
CovalentUnitCountNumber of covalently bound units.
Volume3DAnalytic volume of the first diverse conformer (default conformer) for a compound.
XStericQuadrupole3DThe x component of the quadrupole moment (Qx) of the first diverse conformer (default conformer) for a compound.
YStericQuadrupole3DThe y component of the quadrupole moment (Qy) of the first diverse conformer (default conformer) for a compound.
ZStericQuadrupole3DThe z component of the quadrupole moment (Qz) of the first diverse conformer (default conformer) for a compound.
FeatureCount3DTotal number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)
FeatureAcceptorCount3DNumber of hydrogen-bond acceptors of a conformer
FeatureDonorCount3DNumber of hydrogen-bond donors of a conformer.
FeatureAnionCount3DNumber of anionic centers (at pH 7) of a conformer.
FeatureCationCount3DNumber of cationic centers (at pH 7) of a conformer.
FeatureRingCount3DNumber of rings of a conformer.
FeatureHydrophobeCount3DNumber of hydrophobes of a conformer.
ConformerModelRMSD3DConformer sampling RMSD in
EffectiveRotorCount3DTotal number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)
ConformerCount3DThe number of conformers in the conformer model for a compound.
Fingerprint2DBase64-encoded PubChem Substructure Fingerprint of a molecule.

属性API

根据CID获取属性。


实例:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularFormula,MolecularWeight/JSON

同义词API

根据CID获取同义词。


实例:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/synonyms/JSON

打包

git clone https://github.com/XavierJiezou/python-pubchem-api.git
cd python-pubchem-api
pip install pipenv
pipenv install
pipenv shell
pip install requests
pip install pyinstaller
pyinstaller -F -i favicon.ico pubchem.py

源码

https://github.com/XavierJiezou/python-pubchem-api

import os, csv, json, requests


class PubchemCrawlFast():
    def __init__(self, cid_path, out_path):
        """Initialization function.

        Args:
            cid_path (str): Input file path of cid list
            out_path (str): Output file path of crawled data 
        """
        self.cid_path = cid_path
        self.out_path = out_path
        self.property_list = [
            'IUPACName',
            'IsomericSMILES',
            'MolecularFormula',
            'MolecularWeight',
            'HBondDonorCount',
            'HBondAcceptorCount'
        ]

    def get_cid_list(self):
        """Get the cid list from the local file
        """
        if os.path.exists(self.cid_path):
            with open(self.cid_path) as f:
                self.cid_list = [i.strip() for i in f.readlines()]
        else:
            self.cid_list = []
            cid = input('Please inpute the CID list below: \n')
            while cid != '':
                self.cid_list.append(cid)
                cid = input()
        self.length = len(self.cid_list)

    def get_property_from_cid(self):
        """Get the property from cid
        """
        limit = 300
        api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/'
        property_str = ','.join(self.property_list)
        return_type = 'json'
        self.prp = []
        for i in range(limit, self.length+limit, limit):
            cid_str = ','.join(self.cid_list[i-limit:i])
            url = f'{api}{cid_str}/property/{property_str}/{return_type}'
            res = requests.get(url).json()
            self.prp += res['PropertyTable']['Properties']

    def get_synonyms_from_cid(self):
        """Get the synonym from cid
        """
        limit = 300
        api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/'
        return_type = 'json'
        self.syn = []
        for i in range(limit, self.length+limit, limit):
            cid_str = ','.join(self.cid_list[i-limit:i])
            url = f'{api}{cid_str}/synonyms/{return_type}'
            res = requests.get(url).json()
            self.syn += res['InformationList']['Information']
        for i in range(len(self.syn)):
            if 'Synonym' not in self.syn[i]:
                self.syn[i]['Synonym'] = []

    def save_as_csv(self, data):
        """Save the crawled data in CSV format
        """
        csv_name = self.out_path.split('.')[0]+'.csv'
        header_list = ['CID']+self.property_list+['Synonym']
        # with open(csv_name, 'w') as f:
        #     f.write(','.join(header_list)+'\n')
        # with open(csv_name, 'a') as f:
        #     for item in data:
        #         line = ['"'+str(item[each])+'"' for each in header_list]
        #         f.write(','.join(line)+'\n')
        with open(csv_name,'w', newline='') as f:
            writer = csv.DictWriter(f, header_list)
            writer.writeheader()
            writer.writerows(data)

    def __main__(self):
        print('Getting CID list: ')
        self.get_cid_list()
        print('CID list acquisition is complete!')
        print('--------------------------------------------')
        print('Querying property list: ')
        self.get_property_from_cid()
        print('Property list query is complete!')
        print('--------------------------------------------')
        print('Querying synonym: ')
        self.get_synonyms_from_cid()
        print('Synonym query is complete!')
        print('--------------------------------------------')
        dt = {
            'InfoList': {
                'Info': [dict(d1, **d2) for d1, d2 in zip(self.prp, self.syn)]
            }
        }
        json_str = json.dumps(dt, indent=2)
        print('The data is being written to the JSON file: ')
        with open(self.out_path, 'w') as f:
            f.write(json_str)
        print('Finished writing the JSON file! ')
        print('--------------------------------------------')
        print('The data is being written to the CSV file: ')
        self.save_as_csv(dt['InfoList']['Info'])
        print('Finished writing the CSV file! ')
        os.system('pause')


if __name__ == '__main__':
    PubchemCrawlFast('cid.txt', 'data.json').__main__()

参考

https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest

Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐