用python处理PDF

文章目录PDF的页操作PdfFileReader读取PDF一些方法PdfFileWriter写入PDF一些方法PagePDF的页操作这里主要用的是pypdf4。虽然pypdf2更加热门，但是它已经停止了维护。目前最新的版本是pypdf4，希望作者可以一直维护下去。安装：pip install PyPDF4github:https://github.com/claird/PyPDF4pypi:htt

73826669

2624人浏览 · 2021-08-12 00:01:41

73826669 · 2021-08-12 00:01:41 发布

文章目录

PDF的页操作
PDF内容的提取

PDF的页操作

这里主要用的是pypdf4。虽然pypdf2更加热门，但是它已经停止了维护。目前最新的版本是pypdf4，希望作者可以一直维护下去。
安装：pip install PyPDF4
github：https://github.com/claird/PyPDF4
pypi：https://www.cnpython.com/pypi/pypdf4
目前的版本是1.27.0，与pypdf2基本一致。
pypdf2文档：https://pythonhosted.org/PyPDF2/

pypdf4有两大类，分别是PdfFileReader和PdfFileWriter。顾名思义，前者用来读取PDF，后者用来写入PDF

PdfFileReader

读取PDF

import os
from PyPDF4 import PdfFileReader

pdf_path = os.path(r"F:\test.pdf")
pdf = PdfFileReader(pdf_path, 'rb')

一些方法

pdf.getDocumentInfo()  # 获取文件信息
pdf.getIsEncrypted()   # 是否加密
pdf.getNumPages()      # 获取页数
pdf.getPage(index)     # 获取该页
pdf.getOutlines()      # 获取大纲

PdfFileWriter

写入PDF

from PyPDF4 import PdfFileWriter
output = PdfFileWriter()
with open(r'F:\output.pdf','wb') as f:
	output.write(f)

一些方法

output.addpage(Page)                 # 追加Page页
output.addBlankPage()                # 追加一张空白页
output.addBookmark(title, pagenum)   # 添加书签，pagenum从0开始计数
output.cloneDocumentFromReader(PdfFileReader)   # 深拷贝
output.insertBlankPage(index=pos)    # 在pos处插入空白页
output.insertPage(page, pos)         # 在pos处插入page页

output.getNumPages()                 # 获取页数
output.getPage(index)                # 获取该页
output.getOutlines()                 # 获取大纲
output.encrypt(user_pwd)             # 加密

PdfFileMerger

可以用来合并多个PDF文件的类，主要方法是merge和append。目前还没看明白。

样例

删除指定页

import os
from PyPDF4 import PdfFileWriter, PdfFileReader

path = r"F:"
index = 1
infile = PdfFileReader(os.path.join(path,'test.pdf'),'rb')
output = PdfFileWriter()

for i in range(infile.getNumPages()):
    if i != index:
        p = infile.getPage(i)
        output.addPage(p)

with open(os.path.join(path,'new_test.pdf'),'wb') as f:
    output.write(f)

合并多个PDF

import os
from PyPDF4 import PdfFileWriter, PdfFileReader

path = r"F:"
pdf_list = os.listdir(path)

output = PdfFileWriter()

for pdf in pdf_list:
    infile = PdfFileReader(os.path.join(path,pdf),'rb')
    # output.cloneDocumentFromReader(infile)  # 没有测试合并顺序
    for i in range(infile.getNumPages()):
        p = infile.getPage(i)
        output.addPage(p)

with open(os.path.join(path,'new.pdf'),'wb') as f:
    output.write(f)

PDF内容的提取

这里主要用的是pdfplumber。其他包虽然也有提取的功能，但大多都不尽人意。这个效果稍好一点，而且还在维护中。
安装pip install pdfplumber

读取PDF

import pdfplumber
pdf = pdfplumber.open(r"F:\test.pdf")
pdf.metadata    # 返回基础信息
pdf.pages       # list, pdf各页

查看page信息

page = pdf.pages[0]
page.page_number # 页码：1
page.width       # 页宽
page.height      # 页高

内容的提取

page.extract_text()   # 提取文字，返回str
page.extract_words()  # 提取文字，返回list，元素为dict包含x0,x1,top,bottom,text字段
page.extract_table()  # 提取表格，返回迭代器，可逐行读取
page.extract_tables() # 提取表格，返回list->row->cel

AtomGit 开源协作平台测评赛

瓜分20万奖金获得内推名额丰厚实物奖励易参与易上手

更多推荐

ADS1292R 使用过程心电图高精度ADC模块

文章目录1 Fundamentals ofPrecision ADC Noise Analysis 精密模数转换器噪声分析基础1 Fundamentals ofPrecision ADC Noise Analysis 精密模数转换器噪声分析基础https://www.ti.com.cn/cn/lit/wp/slyy192/slyy192.pdf?ts=1600659610730&ref_u

开放原子开发者工作坊

实现一个家庭安防与环境监测系统（一）

开放原子开发者工作坊

【cf】Codeforces Round #774 (Div. 2) 前4题

题目A. Square Counting 简单数学题目大意题解代码B. Quality vs Quantity 排序题目大意题解代码C. Factorials and Powers of Two 状态压缩dp+位运算题目大意题解代码D. Weight the Tree 树形dp+dfs题目大意题解代码E. Power Board 看起来像是数论？许多年没打cf了，偶尔打了一盘，恢复紫名了。A. S