HTML Table Extractor

68747470733a2f2f7472617669732d63692e6f72672f7975616e78752d6c692f68746d6c2d7461626c652d657874726163746f722e7376673f6272616e63683d6d6173746572

HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table

Important links

Installation

pip install 'beautifulsoup4==4.5.3'

pip install html-table-extractor

Usage

Example 1 - Simple

12

34

from html_table_extractor.extractor import Extractor

table_doc = """

12
34

"""

extractor = Extractor(table_doc)

extractor.parse()

extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 2 - Transformer

12

34

from html_table_extractor.extractor import Extractor

table_doc = """

12
34

"""

extractor = Extractor(table_doc, transformer=int)

extractor.parse()

extractor.return_list()

It will print out:

[[1, 2], [3, 4]]

Example 3 - Pass BS4 Tag

12

34

from html_table_extractor.extractor import Extractor

from bs4 import BeautifulSoup

table_doc = """

12
34
not wanted

"""

soup = BeautifulSoup(table_doc, 'html.parser')

extractor = Extractor(soup, id_='wanted')

extractor.parse()

extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 4 - Complex

1

2

3

4

5

from html_table_extractor.extractor import Extractor

table_doc = """

123
4
5

"""

extractor = Extractor(table_doc)

extractor.parse()

extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

Example 5 - Conflicted

1

2

3

4

5

from html_table_extractor.extractor import Extractor

table_doc = """

123
4
5

"""

extractor = Extractor(table_doc)

extractor.parse()

extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

Example 6 - Write to file

12

34

from html_table_extractor.extractor import Extractor

table_doc = """

12
34

"""

extractor = Extractor(table_doc).parse()

extractor.write_to_csv(path='.')

It will write to a given path and create a new csv file called output.csv:

1,2

3,4

Team

Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, report it here

Copyright

Copyright (c) 2017 Justin Li. Released under the MIT License

Third-party copyright in this distribution is noted where applicable.

Misc

How to upload the package to pypi (for the reference of the owner)

python setup.py bdist_wheel --universal

twine upload dist/* --verbose

Logo

瓜分20万奖金 获得内推名额 丰厚实物奖励 易参与易上手

更多推荐