【文字识别】腾讯云API：提取表格数据并生成Excel文件

一、使用工具及python包介绍腾讯API国内大型互联网公司都提供云服务，如阿里、百度、腾讯等。本文选择腾讯云服务，是因为提供的API说明比较详细，看一遍就能用。更良心的是，提供了在线测试的功能，基本不用写代码也能够测试效果。Python包pandas 数据分析必备包，用来对二维表数据进行分析整合。os 更改系统配置信息，如列出工作目录的文件，更改工作目录等。json 用来处理json...

寒泉Hq

70832人浏览 · 2020-03-20 08:51:47

寒泉Hq · 2020-03-20 08:51:47 发布

一、使用工具及python包介绍

腾讯云API
国内大型互联网公司都提供云服务，如阿里、百度、腾讯等。本文选择腾讯云服务，是因为提供的API说明比较详细，看一遍就能用。更良心的是，提供了在线测试的功能，基本不用写代码也能够测试效果。

用到的Python包
pandas 数据分析必备包，用来对二维表数据进行分析整合。
os 更改系统配置信息，如列出工作目录的文件，更改工作目录等。
json 用来处理json数据，或者把字符串等其他格式的数据转化为json数据。
base64 用来对图片进行base64编码，这是根据API的要求做的。
xlwings 用来与Excel进行交互，几乎可以取代VBA，容易学习。
tencentcloud 腾讯云服务，提供了很多功能，值得探索。
re 正则表达式包，用来处理字符串中的空格等。

二、准备工作

1、注册腾讯云，获取 SecretID 和 SecretKey.
在控制台新建一个API秘钥，获取SecretID和SecretKey.
在这里插入图片描述
2、开通资源包（腾讯云 - 文字识别）

后期需要付费。官方付费说明：腾讯云文字识别（免费公测版）服务限时免费。免费期结束后，文字识别（免费公测版）服务会升级为正式版付费服务，于2020年4月1日00:00起开始正式按月计费。

所以趁还没收费，用一下试试…
在这里插入图片描述
3、pip 安装依赖

需要安装的依赖：

pandas==0.24.2
tencentcloud_sdk_python==3.0.142
numpy==1.16.5

如果国内 pip 网络不佳，可以-i使用清华镜像。例如：
pip install XlsxWriter -i https://pypi.tuna.tsinghua.edu.cn/simple

（附：生成当前项目相关依赖的requirements.txt方法）
在这里插入图片描述
4、准备几张较为清晰的截图

这里直接用截的图做测试吧
在这里插入图片描述

三、代码

# -*- coding: utf-8 -*-
# from PIL import Image
# import pytesseract
##导入通用包
import numpy as np
import pandas as pd
import os
import json
import re
import base64
# import xlwings as xw
# 导入腾讯AI api
from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
from tencentcloud.ocr.v20181119 import ocr_client, models


# 定义函数
def excelFromPictures(picture, SecretId, SecretKey):
    resp = None
    # try:
    with open(picture, "rb") as f:
        img_data = f.read()
    img_base64 = base64.b64encode(img_data)
    cred = credential.Credential(SecretId, SecretKey)  # ID和Secret从腾讯云申请
    httpProfile = HttpProfile()
    httpProfile.endpoint = "ocr.tencentcloudapi.com"

    clientProfile = ClientProfile()
    clientProfile.httpProfile = httpProfile
    client = ocr_client.OcrClient(cred, "ap-shanghai", clientProfile)

    req = models.TableOCRRequest()
    params = '{"ImageBase64":"' + str(img_base64) + '"}'
    # params = '{"ImageBase64":"' + str(img_base64, 'utf-8') + '"}'
    req.from_json_string(params)
    resp = client.TableOCR(req)
    #     print(resp.to_json_string())

    # except TencentCloudSDKException as err:
    #     print(err)

    ##提取识别出的数据，并且生成json
    result1 = json.loads(resp.to_json_string())

    rowIndex = []
    colIndex = []
    content = []

    for item in result1['TextDetections']:
        rowIndex.append(item['RowTl'])
        colIndex.append(item['ColTl'])
        content.append(item['Text'])

    ##导出Excel
    ##ExcelWriter方案
    rowIndex = pd.Series(rowIndex)
    colIndex = pd.Series(colIndex)

    index = rowIndex.unique()
    index.sort()

    columns = colIndex.unique()
    columns.sort()

    data = pd.DataFrame(index=index, columns=columns)
    for i in range(len(rowIndex)):
        data.loc[rowIndex[i], colIndex[i]] = re.sub(" ", "", content[i])

    writer = pd.ExcelWriter("../tables/" + re.match(".*\.", f.name).group() + "xlsx", engine='xlsxwriter')
    data.to_excel(writer, sheet_name='Sheet1', index=False, header=False)
    writer.save()

    # xlwings方案
    # wb = xw.Book()
    # sht = wb.sheets('Sheet1')
    # for i in range(len(rowIndex)):
    #     sht[rowIndex[i],colIndex[i]].value = re.sub(" ",'',content[i])
    # wb.save("../tables/" + re.match(".*\.",f.name).group() + "xlsx")
    # wb.close()


# if not ('tables') in os.listdir():
#     os.mkdir("./tables/")

os.chdir("./pictures/")
pictures = os.listdir('.')
for pic in pictures:
    excelFromPictures(pic, "这里填写秘钥ID", "这里填写秘钥PWD")
    print("已经完成" + pic + "的提取.")