NLP: Entity Tagging

Code Converter

Entity recognition is a natural language processing (NLP) technique used to identify and classify key information (entities) in text such as names, dates, locations, and more.

The company that the project was done for is a leader in intelligent file management solutions. The use case here was to filter through unstructured data in pdf documents to retrieve and censor personally identifiable information (PI) on their own software.

Business Problem

The software used by the company used only Spacy, which is not as widely used as Re regex. Thus the need for an internal tool arises which can help mitigate training costs.

While standard regex (e.g., Python's re module) focuses on raw pattern matching using metacharacters and quantifiers, spaCy's regex-based entity matching integrates seamlessly with NLP pipelines, allowing for more structured token-based matching. Unlike traditional regex, spaCy operates on tokenized text rather than raw strings, making it more context-aware and language-friendly.

The Tool

The internal converter tool thus needs to operate in the area between document injection and the rest of the classification algorithms.

The tool needs to be dynamic as the patterns to look out for are always changing (Example: Mr. John Doe), thus the converter accepts an input in Re and converts and returns the same in spaCy.