Inference Llama 2 models with real-time response streaming using Amazon SageMaker
AWS Machine Learning Blog
With the rapid adoption of generative AI applications, there is a need for these applications to respond in time to reduce the perceived latency with higher throughput. Foundation models (FMs) are often pre-trained on vast corpora of data with parameters ranging in scale of millions to billions and beyond. Large language models (LLMs) are a type of FM that generate text as a response of the user inference. Inferencing these models with varying configurations of inference parameters may lead to inconsistent latencies. The inconsistency could be because of the varying number of response tokens you are expecting from the model or the type of accelerator the model is deployed on.
In either case, rather than waiting for the full response, you can adopt the approach of response streaming for your inferences, which sends back chunks of information as soon as they are generated. This creates an interactive experience by allowing you to see partial responses streamed in real time instead of a delayed full response.
With the official announcement that Amazon SageMaker real-time inference now supports response streaming, you can now continuously stream inference responses back to the client when using Amazon SageMaker real-time inference with response streaming. This solution will help you build interactive experiences for various generative AI applications such as chatbots, virtual assistants, and music generators. This post shows you how to realize faster response times in the form of Time to First Byte (TTFB) and reduce the overall perceived latency while inferencing Llama 2 models.
To implement the solution, we use SageMaker, a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. For more information about the various deployment options SageMaker provides, refer to Amazon SageMaker Model Hosting FAQs. Let’s understand how we can address the latency issues using real-time inference with response streaming.
Solution overview
Because we want to address the aforementioned latencies associated with real-time inference with LLMs, let’s first understand how we can use the response streaming support for real-time inferencing for Llama 2. However, any LLM can take advantage of response streaming support with real-time inferencing.
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Llama 2 models are autoregressive models with decoder only architecture. When provided with a prompt and inference parameters, Llama 2 models are capable of generating text responses. These models can be used for translation, summarization, question answering, and chat.
For this post, we deploy the Llama 2 Chat model meta-llama/Llama-2-13b-chat-hf on SageMaker for real-time inferencing with response streaming.
When it comes to deploying models on SageMaker endpoints, you can containerize the models using specialized AWS Deep Learning Container (DLC) images available for popular open source libraries. Llama 2 models are text generation models; you can use either the Hugging Face LLM inference containers on SageMaker powered by Hugging Face Text Generation Inference (TGI) or AWS DLCs for Large Model Inference (LMI).
In this post, we deploy the Llama 2 13B Chat model using DLCs on SageMaker Hosting for real-time inference powered by G5 instances. G5 instances are a high-performance GPU-based instances for graphics-intensive applications and ML inference. You can also use supported instance types p4d, p3, g5, and g4dn with appropriate changes as per the instance configuration.
Prerequisites
To implement this solution, you should have the following:
An AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution.
If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain.
A Hugging Face account. Sign up with your email if you don’t already have account.
For seamless access of the models available on Hugging Face, especially gated models such as Llama, for fine-tuning and inferencing purposes, you should have a Hugging Face account to obtain a read access token. After you sign up for your Hugging Face account, log in to visit https://huggingface.co/settings/tokens to create a read access token.
Access to Llama 2, using the same email ID that you used to sign up for Hugging Face.
The Llama 2 models available via Hugging Face are gated models. The use of the Llama model is governed by the Meta license. To download the model weights and tokenizer, request access to Llama and accept their license.
After you’re granted access (typically in a couple of days), you will receive an email confirmation. For this example, we use the model Llama-2-13b-chat-hf, but you should be able to access other variants as well.
Approach 1: Hugging Face TGI
In this section, we show you how to deploy the meta-llama/Llama-2-13b-chat-hf model to a SageMaker real-time endpoint with response streaming using Hugging Face TGI. The following table outlines the specifications for this deployment.
Specification
Value
Container
Hugging Face TGI
Model Name
meta-llama/Llama-2-13b-chat-hf
ML Instance
ml.g5.12xlarge
Inference
Real-time with response streaming
Deploy the model
First, you retrieve the base image for the LLM to be deployed. You then build the model on the base image. Finally, you deploy the model to the ML instance for SageMaker Hosting for real-time inference.
Let’s observe how to achieve the deployment programmatically. For brevity, only the code that helps with the deployment steps is discussed in this section. The full source code for deployment is available in the notebook llama-2-hf-tgi/llama-2-13b-chat-hf/1-deploy-llama-2-13b-chat-hf-tgi-sagemaker.ipynb.
Retrieve the latest Hugging Face LLM DLC powered by TGI via pre-built SageMaker DLCs. You use this image to deploy the meta-llama/Llama-2-13b-chat-hf model on SageMaker. See the following code:
from sagemaker.huggingface import get_huggingface_llm_image_uri
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
“huggingface”,
version=”1.0.3″
)
Define the environment for the model with the configuration parameters defined as follows:
instance_type = “ml.g5.12xlarge”
number_of_gpu = 4
config = {
‘HF_MODEL_ID’: “meta-llama/Llama-2-13b-chat-hf”, # model_id from hf.co/models
‘SM_NUM_GPUS’: json.dumps(number_of_gpu), # Number of GPU used per replica
‘MAX_INPUT_LENGTH’: json.dumps(2048), # Max length of input text
‘MAX_TOTAL_TOKENS’: json.dumps(4096), # Max length of the generation (including input text)
‘MAX_BATCH_TOTAL_TOKENS’: json.dumps(8192), # Limits the number of tokens that can be processed in parallel during the generation
‘HUGGING_FACE_HUB_TOKEN’: “”
}
Replace for the config parameter HUGGING_FACE_HUB_TOKEN with the value of the token obtained from your Hugging Face profile as detailed in the prerequisites section of this post. In the configuration, you define the number of GPUs used per replica of a model as 4 for SM_NUM_GPUS. Then you can deploy the meta-llama/Llama-2-13b-chat-hf model on an ml.g5.12xlarge instance that comes with 4 GPUs.
Now you can build the instance of HuggingFaceModel with the aforementioned environment configuration:
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env=config
)
Finally, deploy the model by providing arguments to the deploy method available on the model with various parameter values such as endpoint_name, initial_instance_count, and instance_type:
llm = llm_model.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout,
)
Perform inference
The Hugging Face TGI DLC comes with the ability to stream responses without any customizations or code changes to the model. You can use invoke_endpoint_with_response_stream if you are using Boto3 or InvokeEndpointWithResponseStream when programming with the SageMaker Python SDK.
The InvokeEndpointWithResponseStream API of SageMaker allows developers to stream responses back from SageMaker models, which can help improve customer satisfaction by reducing the perceived latency. This is especially important for applications built with generative AI models, where immediate processing is more important than waiting for the entire response.
For this example, we use Boto3 to infer the model and use the SageMaker API invoke_endpoint_with_response_stream as follows:
def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
EndpointName=endpoint_name,
Body=json.dumps(payload),
ContentType=”application/json”,
CustomAttributes=’accept_eula=false’
)
return response_stream
The argument CustomAttributes is set to the value accept_eula=false. The accept_eula parameter must be set to true to successfully obtain the response from the Llama 2 models. After the successful invocation using invoke_endpoint_with_response_stream, the method will return a response stream of bytes.
The following diagram illustrates this workflow.
You need an iterator that loops over the stream of bytes and parses them to readable text. The LineIterator implementation can be found at llama-2-hf-tgi/llama-2-13b-chat-hf/utils/LineIterator.py. Now you’re ready to prepare the prompt and instructions to use them as a payload while inferencing the model.
Prepare a prompt and instructions
In this step, you prepare the prompt and instructions for your LLM. To prompt Llama 2, you should have the following prompt template:
[INST]
{{ system_prompt }}
{{ user_message }} [/INST]
You build the prompt template programmatically defined in the method build_llama2_prompt, which aligns with the aforementioned prompt template. You then define the instructions as per the use case. In this case, we’re instructing the model to generate an email for a marketing campaign as covered in the get_instructions method. The code for these methods is in the llama-2-hf-tgi/llama-2-13b-chat-hf/2-sagemaker-realtime-inference-llama-2-13b-chat-hf-tgi-streaming-response.ipynb notebook. Build the instruction combined with the task to be performed as detailed in user_ask_1 as follows:
user_ask_1 = f”’
AnyCompany recently announced new service launch named AnyCloud Internet Service.
Write a short email about the product launch with Call to action to Alice Smith, whose email is alice.smith@example.com
Mention the Coupon Code: EARLYB1RD to get 20% for 1st 3 months.
”’
instructions = get_instructions(user_ask_1)
prompt = build_llama2_prompt(instructions)
We pass the instructions to build the prompt as per the prompt template generated by build_llama2_prompt.
inference_params = {
“do_sample”: True,
“top_p”: 0.6,
“temperature”: 0.9,
“top_k”: 50,
“max_new_tokens”: 512,
“repetition_penalty”: 1.03,
“stop”: [“”],
“return_full_text”: False
}
payload = {
“inputs”: prompt,
“parameters”: inference_params,
“stream”: True ##
Go to Source
09/01/2024 – 21:06 /Pavan Kumar Rao Navule
Twitter: @hoffeldtcom
About Admin
As an experienced Human Resources leader, I bring a wealth of expertise in corporate HR, talent management, consulting, and business partnering, spanning diverse industries such as retail, media, marketing, PR, graphic design, NGO, law, assurance, consulting, tax services, investment, medical, app/fintech, and tech/programming. I have primarily worked with service and sales companies at local, regional, and global levels, both in Europe and the Asia-Pacific region. My strengths lie in operations, development, strategy, and growth, and I have a proven track record of tailoring HR solutions to meet unique organizational needs. Whether it's overseeing daily HR tasks or crafting and implementing new processes for organizational efficiency and development, I am skilled in creating innovative human capital management programs and impactful company-wide strategic solutions. I am deeply committed to putting people first and using data-driven insights to drive business value. I believe that building modern and inclusive organizations requires a focus on talent development and daily operations, as well as delivering results. My passion for HRM is driven by a strong sense of empathy, integrity, honesty, humility, and courage, which have enabled me to build and maintain positive relationships with employees at all levels.