pyPdf 用于 IndirectObject 提取

Mangs

297人浏览 · 2022-08-20 19:29:46

Mangs · 2022-08-20 19:29:46 发布

回答问题

按照这个例子,我可以将所有元素列出到一个 pdf 文件中

import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects

现在,我需要从 pdf 文件中提取一个非标准对象。

我的对象是名为 MYOBJECT 的对象,它是一个字符串。

与我有关的python脚本打印的部分是:

{'/MYOBJECT': IndirectObject(584, 0)}

pdf文件是这样的:

558 0 obj
<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources
  <</ColorSpace <</CS0 563 0 R>>
    /ExtGState <</GS0 568 0 R>>
    /Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>
    /ProcSet[/PDF/Text/ImageC]
    /Properties<</MC0<</MYOBJECT 584 0 R>>/MC1<</SubKey 582 0 R>> >>
    /XObject<</Im0 578 0 R>>>>
  /Rotate 0/StructParents 0/Type/Page>>
endobj
...
...
...
584 0 obj
<</Length 8>>stream

1_22_4_1     --->>>>  this is the string I need to extract from the object

endstream
endobj

如何按照584值来引用我的字符串(当然是在 pyPdf 下)?

Answers

pdf.pages中的每个元素都是一个字典,所以假设它在第 1 页,pdf.pages[0]['/MYOBJECT']应该是你想要的元素。

您可以尝试单独打印或在 python 提示符中使用help和dir戳它,以了解有关如何获取所需字符串的更多信息

编辑:

在收到 pdf 的副本后,我在pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT']找到了该对象,并且可以通过 getData() 检索该值

下面的函数提供了一种更通用的方法来通过递归查找有问题的键来解决这个问题

import types
import pyPdf
pdf = pyPdf.PdfFileReader(open('file.pdf'))
pages = list(pdf.pages)

def findInDict(needle,haystack):
    for key in haystack.keys():
        try:
            value = haystack[key]
        except:
            continue
        if key == needle:
            return value
        if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):  
            x = findInDict(needle,value)
            if x is not None:
                return x

answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia