Use zero-shot large language models on Amazon Bedrock for custom named entity recognition
AWS Machine Learning Blog
Name entity recognition (NER) is the process of extracting information of interest, called entities, from structured or unstructured text. Manually identifying all mentions of specific types of information in documents is extremely time-consuming and labor-intensive. Some examples include extracting players and positions in an NFL game summary, products mentioned in an AWS keynote transcript, or key names from an article on a favorite tech company. This process must be repeated for every new document and entity type, making it impractical for processing large volumes of documents at scale. With more access to vast amounts of reports, books, articles, journals, and research papers than ever before, swiftly identifying desired information in large bodies of text is becoming invaluable.
Traditional neural network models like RNNs and LSTMs and more modern transformer-based models like BERT for NER require costly fine-tuning on labeled data for every custom entity type. This makes adopting and scaling these approaches burdensome for many applications. However, new capabilities of large language models (LLMs) enable high-accuracy NER across diverse entity types without the need for entity-specific fine-tuning. By using the model’s broad linguistic understanding, you can perform NER on the fly for any specified entity type. This capability is called zero-shot NER and enables the rapid deployment of NER across documents and many other use cases. This ability to extract specified entity mentions without costly tuning unlocks scalable entity extraction and downstream document understanding.
In this post, we cover the end-to-end process of using LLMs on Amazon Bedrock for the NER use case. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. In particular, we show how to use Amazon Textract to extract text from documents such PDFs or image files, and use the extracted text along with user-defined custom entities as input to Amazon Bedrock to conduct zero-shot NER. We also touch on the usefulness of text truncation for prompts using Amazon Comprehend, along with the challenges, opportunities, and future work with LLMs and NER.
Solution overview
In this solution, we implement zero-shot NER with LLMs using the following key services:
Amazon Textract – Extracts textual information from the input document.
Amazon Comprehend (optional) – Identifies predefined entities such as names of people, dates, and numeric values. You can use this feature to limit the context over which the entities of interest are detected.
Amazon Bedrock – Calls an LLM to identify entities of interest from the given context.
The following diagram illustrates the solution architecture.
The main inputs are the document image and target entities. The objective is to find values of the target entities within the document. If the truncation path is chosen, the pipeline uses Amazon Comprehend to reduce the context. The output of LLM is postprocessed to generate the output as entity-value pairs.
For example, if given the AWS Wikipedia page as the input document, and the target entities as AWS service names and geographic locations, then the desired output format would be as follows:
AWS service names:
Geographic locations:
In the following sections, we describe the three main modules to accomplish this task. For this post, we used Amazon SageMaker notebooks with ml.t3.medium instances along with Amazon Textract, Amazon Comprehend, and Amazon Bedrock.
Extract context
Context is the information that is taken from the document and where the values to the queried entities are found. When consuming a full document (full context), context significantly increases the input token count to the LLM. We provide an option of using the entire document or local context around relevant parts of the document, as defined by the user.
First, we extract context from the entire document using Amazon Textract. The code below uses the amazon-textract-caller library as a wrapper for the Textract API calls. You need to install the library first:
python -m pip install amazon-textract-caller
Then, for a single page document such as a PNG or JPEG file use the following code to extract the full context:
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json
document_name = “sample_data/synthetic_sample_data.png”
# call Textract
layout_textract_json = call_textract(
input_document = document_name,
features = [Textract_Features.LAYOUT]
)
# extract the text from the JSON response
full_context = get_text_from_layout_json(textract_json = layout_textract_json)[1]
Note that PDF input documents have to be on a S3 bucket when using call_textract function. For multi-page TIFF files make sure to set force_async_api=True.
Truncate context (optional)
When the user-defined custom entities to be extracted are sparse compared to the full context, we provide an option to identify relevant local context and then look for the custom entities within the local context. To do so, we use generic entity extraction with Amazon Comprehend. This is assuming that the user-defined custom entity is a child of one of the default Amazon Comprehend entities, such as “name”, “location”, “date”, or “organization”. For example, “city” is a child of “location”. We extract the default generic entities through the AWS SDK for Python (Boto3) as follows:
import pandas as pd
comprehend_client = boto3.client(“comprehend”)
generic_entities = comprehend_client.detect_entities(Text=full_context,
LanguageCode=”en”)
df_entities = pd.DataFrame.from_dict(generic_entities[“Entities”])
It outputs a list of dictionaries containing the entity as “Type”, the value as “Text”, along with other information such as “Score”, “BeginOffset”, and “EndOffset”. For more details, see DetectEntities. The following is an example output of Amazon Comprehend entity extraction, which provides the extracted generic entity-value pairs and location of the value within the text.
{
“Entities”: [
{
“Text”: “AWS”,
“Score”: 0.98,
“Type”: “ORGANIZATION”,
“BeginOffset”: 21,
“EndOffset”: 24
},
{
“Text”: “US East”,
“Score”: 0.97,
“Type”: “LOCATION”,
“BeginOffset”: 1100,
“EndOffset”: 1107
}
],
“LanguageCode”: “en”
}
The extracted list of generic entities may be more exhaustive than the queried entities, so a filtering step is necessary. For example, a queried entity is “AWS revenue” and generic entities contain “quantity”, “location”, “person”, and so on. To only retain the relevant generic entity, we define the mapping and apply the filter as follows:
query_entities = [‘XX’]
user_defined_map = {‘XX’: ‘QUANTITY’, ‘YY’: ‘PERSON’}
entities_to_keep = [v for k,v in user_defined_map.items() if k in query_entities]
df_filtered = df_entities.loc[df_entities[‘Type’].isin(entities_to_keep)]
After we identify a subset of generic entity-value pairs, we want to preserve the local context around each pair and mask out everything else. We do this by applying a buffer to “BeginOffset” and “EndOffset” to add extra context around the offsets identified by Amazon Comprehend:
StrBuff, EndBuff =20,10
df_offsets = df_filtered.apply(lambda row : pd.Series({‘BeginOffset’:max(0, row[‘BeginOffset’]-StrBuff),’EndOffset’:min(row[‘EndOffset’]+EndBuff, len(full_context))}), axis=1).reset_index(drop=True)
We also merge any overlapping offsets to avoid duplicating context:
for index, _ in df_offsets.iterrows():
if (index>0) and (df_offsets.iloc[index][‘BeginOffset’]
Go to Source
18/06/2024 – 18:58 /Sujitha Martin
Twitter: @hoffeldtcom

About Admin
As an experienced Human Resources leader, I bring a wealth of expertise in corporate HR, talent management, consulting, and business partnering, spanning diverse industries such as retail, media, marketing, PR, graphic design, NGO, law, assurance, consulting, tax services, investment, medical, app/fintech, and tech/programming. I have primarily worked with service and sales companies at local, regional, and global levels, both in Europe and the Asia-Pacific region. My strengths lie in operations, development, strategy, and growth, and I have a proven track record of tailoring HR solutions to meet unique organizational needs. Whether it's overseeing daily HR tasks or crafting and implementing new processes for organizational efficiency and development, I am skilled in creating innovative human capital management programs and impactful company-wide strategic solutions. I am deeply committed to putting people first and using data-driven insights to drive business value. I believe that building modern and inclusive organizations requires a focus on talent development and daily operations, as well as delivering results. My passion for HRM is driven by a strong sense of empathy, integrity, honesty, humility, and courage, which have enabled me to build and maintain positive relationships with employees at all levels.