Analysing the LexNLP package for efficiency in NER extraction in legal contracts

Mangs

2人浏览 · 2022-09-07 07:22:20

Mangs · 2022-09-07 07:22:20 发布

Over my summer break I interned at a company named ContractKen where I was tasked with analysing and determining the efficiency of the open source package LexNLP.

LexNLP is a library for working with real, unstructured legal text, including contracts, plans, policies, procedures, and other material.

The LexNLP package has an extract module for extraction of legal entities from contracts.

The entities I tested the package on are :

Definitions
Money and currency usages
PII(Personal Identifiable Information)
Company names

Now most legal documents are drafted in word and the LexNLP package mainly processes raw text, so first we must convert the word documents to text files. For this task I made use of the docx2txt package.

pip install docx2txt

We fill the docpaths variable with the system paths to the word documents you wish to convert and textpaths for the text files in a similar fashion.

Then to feed the text from the files to the package functions as strings I wrote a simple getText() function which I called while I tested the entities.

Setup

The system requirements for LexNLP are having any version of Python 3.6 installed, I used version 3.6.0 which worked well.

After installing the required version of Python we can use the pip package manager to install

pip install lexnlp

Now I moved on to the testing phase of, to see which contracts I tested the package on , here’s a link to the documents : https://github.com/SomneelSaha2004/Internship/tree/main/data

Definitions

This code iterates through each system path to the text files and then prints a list with all the definitions the package has extracted successfully.

The Output has returns a generator object (https://wiki.python.org/moin/Generators)

To get output in readable format print(list(lexnlp.extract.en.definitions.get_definitions(text)))

Enclosing the function in list() will return a python list (string type) containing definition names.

Example : [“B Shares”,”MAA”,”Purchased Shares”]

The package is highly accurate at extracting this entity and classifies and extracts 90% of the entities present successfully.

On taking a look at their code I observed that an actual NER algorithm had been made use of for extraction.

Money and Currency Usages

For money and currency usages the model seems to be primitive, it overpredicts and includes things such as bullet point numbers eg “2.” as instances of money or currency usages.

On taking a closer look at the code I realised there was no NER algorithm implemented, primarily regex was used to extract entities, which caused the accuracy of the model to lower.

Although regex is used it does successfully classify all instances of money and currency usage. Important note :- the model only classifies USD or $ occurrences. The only problem being overpredicting.

PII(Personal Identifiable Information)

The problem of the use of regex for extraction persists in this module as well. The code for the extraction of this entity is regex dependent also resulting in low accuracy. Although PII includes a large variety of information the module only contains methods for US Social Security numbers and phone numbers.

It is proficient at extracting phone numbers and somewhat accurate when it comes to social security numbers with an accuracy of 85% and 60% respectively.

Company names

The module for company names is probably one of the most important as identifying which companies are involved in a legal contract is imperative. However the code sample above will not work as there is an error in this module. On running this code the following error is thrown

AttributeError: module ‘lexnlp.extract.en.entities.nltk_re’ has no attribute ‘get_entities’ occurs

This issue has been raised on their github https://github.com/LexPredict/lexpredict-lexnlp/issues/55

Conclusion

The LexNLP package is efficient/accurate in certain areas only, to anyone wishing to test it out or use I recommend taking a look at the code for the entity first to see if basic regex is being used or an NER algorithm.

For further details into my findings :

[

GitHub - SomneelSaha2004/Internship: Code and Documentation for my Internship

Code and Documentation for my Internship Hey I am a Machine Learning Enthusiast and like to explore python packages…

github.com

](https://github.com/SomneelSaha2004/Internship)

To see the documentation for the package :

[

Welcome to the LexNLP documentation! - LexNLP 2.2.1.0 documentation

Edit description

lexpredict-lexnlp.readthedocs.io

](https://lexpredict-lexnlp.readthedocs.io/en/latest/index.html)

Thank you for reading

向您推荐>>百度飞桨AI Studio社区

学AI，认准AI Studio！GPU算力，限时免费领，邀请好友解锁更多惊喜福利 >>>

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia