AWS Machine Learning Blog

Today, we’re pleased to announce the general availability (GA) of Amazon Bedrock Custom Model Import. This feature empowers customers to import and use their customized models alongside existing foundation models (FMs) through a single, unified API. Whether leveraging fine-tuned models like Meta Llama, Mistral Mixtral, and IBM Granite, or developing proprietary models based on popular open-source architectures, customers can now bring their custom models into Amazon Bedrock without the overhead of managing infrastructure or model lifecycle tasks.
Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock offers a serverless experience, so you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using AWS tools without having to manage infrastructure.
With Amazon Bedrock Custom Model Import, customers can access their imported custom models on demand in a serverless manner, freeing them from the complexities of deploying and scaling models themselves. They’re able to accelerate generative AI application development by using native Amazon Bedrock tools and features such as Knowledge Bases, Guardrails, Agents, and more—all through a unified and consistent developer experience.
Benefits of Amazon Bedrock Custom Model Import include:

Flexibility to use existing fine-tuned models:Customers can use their prior investments in model customization by importing existing customized models into Amazon Bedrock without the need to recreate or retrain them. This flexibility maximizes the value of previous efforts and accelerates application development.
Integration with Amazon Bedrock Features: Imported custom models can be seamlessly integrated with the native tools and features of Amazon Bedrock, such as Knowledge Bases, Guardrails, Agents, and Model Evaluation. This unified experience enables developers to use the same tooling and workflows across both base FMs and imported custom models.
Serverless: Customers can access their imported custom models in an on-demand and serverless manner. This eliminates the need to manage or scale underlying infrastructure, as Amazon Bedrock handles all those aspects. Customers can focus on developing generative AI applications without worrying about infrastructure management or scalability issues.
Support for popular model architectures: Amazon Bedrock Custom Model Import supports a variety of popular model architectures, including Meta Llama 3.2, Mistral 7B, Mixtral 8x7B, and more. Customers can import custom weights in formats like Hugging Face Safetensors from Amazon SageMaker and Amazon S3. This broad compatibility allows customers to work with models that best suit their specific needs and use cases, allowing for greater flexibility and choice in model selection.
Leverage Amazon Bedrock converse API: Amazon Custom Model Import allows our customers to use their supported fine-tuned models with Amazon Bedrock Converse API which simplifies and unifies the access to the models.

Getting started with Custom Model Import
One of the critical requirements from our customers is the ability to customize models with their proprietary data while retaining complete ownership and control over the tuned model artifact and its deployment. Customization could be in form of domain adaptation or instruction fine-tuning. Customers have a wide degree of options for fine-tuning models efficiently and cost effectively. However, hosting models presents its own unique set of challenges. Customers are looking for some key aspects, namely:

Using the existing customization investment and fine-grained control over customization.
Having a unified developer experience when accessing custom models or base models through Amazon Bedrock’s API.
Ease of deployment through a fully managed, serverless, service.
Using pay-as-you-go inference to minimize the costs of their generative AI workloads.
Be backed by enterprise grade security and privacy tooling.

Amazon Bedrock Custom Model Import feature seeks to address these concerns. To bring your custom model into the Amazon Bedrock ecosystem, you need to run an import job. The import job can be invoked using the AWS Management Console or through APIs. In this post, we demonstrate the code for running the import model process through APIs. After the model is imported, you can invoke the model by using the model’s Amazon Resource Name (ARN).
As of this writing, supported model architectures today include Meta Llama (v.2, 3, 3.1, and 3.2), Mistral 7B, Mixtral 8x7B, Flan and IBM Granite models like Granite 3B-Code, 8B-Code, 20B-Code and 34B-Code.
A few points to be aware of when importing your model:

Models must be serialized in Safetensors format.
If you have a different format, you can potentially use Llama convert scripts or Mistral convert scripts to convert your model to a supported format.
The import process expects at least the following files:.safetensors, json, tokenizer_config.json, tokenizer.json, and tokenizer.model.
The precision for the model weights supported is FP32, FP16, and BF16.
For fine-tuning jobs that create adapters like LoRA-PEFT adapters, the import process expects the adapters to be merged into the main base model weight as described in Model merging.

Importing a model using the Amazon Bedrock console

Go to the Amazon Bedrock console and choose Foundational models and then Imported models from the navigation pane on the left hand side to get to the Models
Click on Import Model to configure the import process.
Configure the model.

Enter the location of your model weights. These can be in Amazon S3 or point to a SageMaker Model ARN object.
Enter a Job name. We recommend this be suffixed with the version of the model. As of now, you need to manage the generative AI operations aspects outside of this feature.
Configure your AWS Key Management Service (AWS KMS) key for encryption. By default, this will default to a key owned and managed by AWS.
Service access role. You can create a new role or use an existing role which will have the necessary permissions to run the import process. The permissions must include access to your Amazon S3 if you’re specifying model weights through S3.

After the Import Model job is complete, you will see the model and the model ARN. Make a note of the ARN to use later.
Test the model using the on-demand feature in the Text playground as you would for any base foundations model.

The import process validates that the model configuration complies with the specified architecture for that model by reading the config.json file and validates the model architecture values such as the maximum sequence length and other relevant details. It also checks that the model weights are in the Safetensors format. This validation verifies that the imported model meets the necessary requirements and is compatible with the system.
Fine tuning a Meta Llama Model on SageMaker
Meta Llama 3.2 offers multi-modal vision and lightweight models, representing Meta’s latest advances in large language models (LLMs). These new models provide enhanced capabilities and broader applicability across various use cases. With a focus on responsible innovation and system-level safety, the Llama 3.2 models demonstrate state-of-the-art performance on a wide range of industry benchmarks and introduce features to help you build a new generation of AI experiences.
SageMaker JumpStart provides FMs through two primary interfaces: SageMaker Studio and the SageMaker Python SDK. This gives you multiple options to discover and use hundreds of models for your use case.
In this section, we’ll show you how to fine-tune the Llama 3.2 3B Instruct model using SageMaker JumpStart. We’ll also share the supported instance types and context for the Llama 3.2 models available in SageMaker JumpStart. Although not highlighted in this post, you can also find other Llama 3.2 Model variants that can be fine-tuned using SageMaker JumpStart.
Instruction fine-tuning
The text generation model can be instruction fine-tuned on any text data, provided that the data is in the expected format. The instruction fine-tuned model can be further deployed for inference. The training data must be formatted in a JSON Lines (.jsonl) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder, but can be saved in multiple JSON Lines files. The training folder can also contain a template.json file describing the input and output formats.
Synthetic dataset
For this use case, we’ll use a synthetically generated dataset named amazon10Ksynth.jsonl in an instruction-tuning format. This dataset contains approximately 200 entries designed for training and fine-tuning LLMs in the finance domain.
The following is an example of the data format:

instruction_sample = {
“question”: “What is Amazon’s plan for expanding their physical store footprint and how will that impact their overall revenue?”,
“context”: “The 10-K report mentions that Amazon is continuing to expand their physical store network, including 611 North America stores and 32 International stores as of the end of 2022. This physical store expansion is expected to contribute to increased product sales and overall revenue growth.”,
“answer”: “Amazon is expanding their physical store footprint, with 611 North America stores and 32 International stores as of the end of 2022. This physical store expansion is expected to contribute to increased product sales and overall revenue growth.”
}

print(instruction_sample)

Prompt template
Next, we create a prompt template for using the data in an instruction input format for the training job (because we are instruction fine-tuning the model in this example), and for inferencing the deployed endpoint.

import json

prompt_template = {
“prompt”: “question: {question} context: {context}”,
“completion”: “{answer}”
}

with open(“prompt_template.json”, “w”) as f:
json.dump(prompt_template, f)

After the prompt template is created, upload the prepared dataset that will be used for fine-tuning to Amazon S3.

from sagemaker.s3 import S3Uploader
import sagemaker
output_bucket = sagemaker.Session().default_bucket()
local_data_file = “amazon10Ksynth.jsonl”
train_data_location = f”s3://{output_bucket}/amazon10Ksynth_dataset”
S3Uploader.upload(local_data_file, train_data_location)
S3Uploader.upload(“prompt_template.json”, train_data_location)
print(f”Training data: {train_data_location}”)

Fine-tuning the Meta Llama 3.2 3B model
Now, we’ll fine-tune the Llama 3.2 3B model on the financial dataset. The fine-tuning scripts are based on the scripts provided by the Llama fine-tuning repository.

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
model_id=model_id,
model_version=model_version,
environment={“accept_eula”: “true”},
disable_output_compression=True,
instance_type=”ml.g5.12xlarge”,
)

# Set the hyperparameters for instruction tuning
estimator.set_hyperparameters(
instruction_tuned=”True”, epoch=”5″, max_input_length=”1024″
)

# Fit the model on the training data
estimator.fit({“training”: train_data_location})

Importing a custom model from SageMaker to Amazon Bedrock
In this section, we will use a Python SDK to create a model import job, get the imported model ID and finally generate inferences. You can refer to the console screenshots in the earlier section  for how to import a model using the Amazon Bedrock console.
Parameter and helper function set up
First, we’ll create a few helper functions and set up our parameters to create the import job. The import job is responsible for collecting and deploying the model from SageMaker to Amazon Bedrock. This is done by using the create_model_import_job function.
Stored safetensors need to be formatted so that the Amazon S3 location is the top-level folder. The configuration files and safetensors will be stored as shown in the following figure.

import json
import boto3
from botocore.exceptions import ClientError
bedrock = boto3.client(‘bedrock’, region_name=’us-east-1′)
job_name = ‘fine-tuned-model-import-demo’
sagemaker_model_name = ‘meta-textgeneration-llama-3-2-3b-2024-10-12-23-29-57-373’
model_url = {‘s3DataSource’:
{‘s3Uri’:
“s3://sagemaker-{REGION}-{AWS_ACCOUNT}/meta-textgeneration-llama-3-2-3b-2024-10-12-23-19-53-906/output/model/”
}
}

Check the status and get job ARN from the response:
After a few minutes, the model will be imported, and the status of the job can be checked using get_model_import_job. The job ARN is then used to get the imported model ARN, which we will use to generate inferences.

def get_import_model_from_job(job_name):
response = bedrock.get_model_import_job(jobIdentifier=job_name)
return response[‘importedModelArn’]

job_arn = response[‘jobArn’]
import_model_arn = get_import_model_from_job(job_arn)

Generating inferences using the imported custom model
The model can be invoked by using the invoke_model and converse APIs. The following is a support function that will be used to invoke and extract the generated text from the overall output.

from botocore.exceptions import ClientError

client = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)

def generate_conversation_with_imported_model(native_request, model_id):
request = json.dumps(native_request)
try:
# Invoke the model with the request.
response = client.invoke_model(modelId=model_id, body=request)
model_response = json.loads(response[“body”].read())

response_text = model_response[“outputs”][0][“text”]
print(response_text)
except (ClientError, Exception) as e:
print(f”ERROR: Can’t invoke ‘{model_id}’. Reason: {e}”)
exit(1)

Context set up and model response
Finally, we can use the custom model. First, we format our inquiry to match the fined-tuned prompt structure. This will make sure that the responses generated closely resemble the format used in the fine-tuning phase and are more aligned to our needs. To do this we use the template that we used to format the data used for fine-tuning. The context will be coming from your RAG solutions like Amazon Bedrock Knowledgebases. For this example, we take a sample context and add to demo the concept:

input_output_demarkation_key = “nn### Response:n”
question = “Tell me what was the improved inflow value of cash?”

context = “Amazons free cash flow less principal repayments of finance leases and financing obligations improved to an inflow of $46.1 billion for the trailing twelve months, compared with an outflow of $10.1 billion for the trailing twelve months ended March 31, 2023.”

payload = {
“prompt”: template[0][“prompt”].format(
question=question, # user query
context=context
+ input_output_demarkation_key # rag context
),
“max_tokens”: 100,
“temperature”: 0.01
}
generate_conversation_with_imported_model(payload, import_model_arn)

The output will look similar to:

After the model has been fine-tuned and imported into Amazon Bedrock, you can experiment by sending different sets of input questions and context to the model to generate a response, as shown in the following example:

question: “””How did Amazon’s international segment operating income change
in Q4 2022 compared to the prior year?”””
context: “””Amazon’s international segment reported an operating loss of
$1.1 billion in Q4 2022, an improvement from a $1.7 billion
operating loss in Q4 2021.”””
response:

Some points to note
This examples in this post are to demonstrate Custom Model Import and aren’t designed to be used in production. Because the model has been trained on only 200 samples of synthetically generated data, it’s only useful for testing purposes. You would ideally have more diverse datasets and additional samples with continuous experimentation conducted using hyperparameter tuning for your respective use case, thereby steering the model to create a more desirable output. For this post, ensure that the model temperature parameter is set to 0 and max_tokens run time parameter is set to a lower values such as 100–150 tokens so that a succinct response is generated. You can experiment with other parameters to generate a desirable outcome. See Amazon Bedrock Recipes and GitHub for more examples.
Best practices to consider:
This feature brings significant advantages for hosting your fine-tuned models efficiently. As we continue to develop this feature to meet our customers’ needs, there are a few points to be aware of:

Define your test suite and acceptance metrics before starting the journey. Automating this will help to save time and effort.
Currently, the model weights need to be all-inclusive, including the adapter weights. There are multiple methods for merging the models and we recommend experimenting to determine the right methodology. The Custom Model Import feature lets you test your model on demand.
When creating your import jobs, add versioning to the job name to help quickly track your models. Currently, we’re not offering model versioning, and each import is a unique job and creates a unique model.
The precision supported for the model weights is FP32, FP16, and BF16. Run tests to validate that these will work for your use case.
The maximum concurrency that you can expect for each model will be 16 per account. Higher concurrency requests will cause the service to scale and increase the number of model copies.
The number of model copies active at any point in time will be available through Amazon CloudWatch See Import a customized model to Amazon Bedrock for more information.
As of the writing this post, we are releasing this feature in the US-EAST-1 and US-WEST-2 AWS Regions only. We will continue to release to other Regions. Follow Model support by AWS Region for updates.
The default import quota for each account is three models. If you need more for your use cases, work with your account teams to increase your account quota.
The default throttling limits for this feature for each account will be 100 invocations per second.
You can use this sample notebook to performance test your models imported via this feature. This notebook is mere reference and not designed to be an exhaustive testing. We will always recommend you to run your own full performance testing along with your end to end testing including functional and evaluation testing.

Now available
Amazon Bedrock Custom Model Import is generally available today in Amazon Bedrock in the US-East-1 (N. Virginia) and US-West-2 (Oregon) AWS Regions. See the full Region list for future updates. To learn more, see the Custom Model Import product page and pricing page.
Give Custom Model Import a try in the Amazon Bedrock console today and send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

About the authors
Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon SageMaker Training and Processing. In his spare time, Paras enjoys spending time with his family and road biking around the Bay Area.
Jay Pillai is a Principal Solutions Architect at Amazon Web Services. In this role, he functions as the Lead Architect, helping partners ideate, build, and launch Partner Solutions. As an Information Technology Leader, Jay specializes in artificial intelligence, generative AI, data integration, business intelligence, and user interface domains. He holds 23 years of extensive experience working with several clients across supply chain, legal technologies, real estate, financial services, insurance, payments, and market research business domains.
Shikhar Kwatra is a Sr. Partner Solutions Architect at Amazon Web Services, working with leading Global System Integrators. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and support the GSI partners in building strategic industry solutions on AWS.
Claudio Mazzoni is a Sr GenAI Specialist Solutions Architect at AWS working on world class applications guiding costumers through their implementation of GenAI to reach their goals and improve their business outcomes. Outside of work Claudio enjoys spending time with family, working in his garden and cooking Uruguayan food.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers leverage GenAI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a Ph.D. degree in Electrical Engineering. Outside of work, she loves traveling, working out and exploring new things.
Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.
Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.
Go to Source
22/10/2024 – 10:02 /Paras Mehra
Twitter: @hoffeldtcom

AWS Machine Learning Blog

Generative AI adoption among various industries is revolutionizing different types of applications, including image editing. Image editing is used in various sectors, such as graphic designing, marketing, and social media. Users rely on specialized tools for editing images. Building a custom solution for this task can be complex. However, by using various AWS services, you can quickly deploy a serverless solution to edit images. This approach can give your teams access to image editing foundation models (FMs) using Amazon Bedrock.
Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that’s best suited for your use case. Amazon Bedrock is serverless, so you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using AWS tools without having to manage infrastructure.
Amazon Titan Image Generator G1 is an AI FM available with Amazon Bedrock that allows you to generate an image from text, or upload and edit your own image. Some of the key features we focus on include inpainting and outpainting.
This post introduces a solution that simplifies the deployment of a web application for image editing using AWS serverless services. We use AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon Bedrock with the Amazon Titan Image Generator G1 model to build an application to edit images using prompts. We cover the inner workings of the solution to help you understand the function of each service and how they are connected to give you a complete solution. At the time of writing this post, Amazon Titan Image Generator G1 comes in two versions; for this post, we use version 2.
Solution overview
The following diagram provides an overview and highlights the key components. The architecture uses Amazon Cognito for user authentication and Amplify as the hosting environment for our frontend application. A combination of API Gateway and a Lambda function is used for our backend services, and Amazon Bedrock integrates with the FM model, enabling users to edit the image using prompts.

Prerequisites
You must have the following in place to complete the solution in this post:

An AWS account
FM access in Amazon Bedrock for Amazon Titan Image Generator G1 v2 in the same AWS Region where you will deploy this solution
The accompanying AWS CloudFormation template downloaded from the aws-samples GitHub repo.

Deploy solution resources using AWS CloudFormation
When you run the AWS CloudFormation template, the following resources are deployed:

Amazon Cognito resources:

User pool: CognitoUserPoolforImageEditApp
App client: ImageEditApp

Lambda resources:

Function: -ImageEditBackend-

AWS Identity Access Management (IAM) resources:

IAM role: -ImageEditBackendRole-
IAM inline policy: AmazonBedrockAccess (this policy allows Lambda to invoke Amazon Bedrock FM amazon.titan-image-generator-v2:0)

API Gateway resources:

Rest API: ImageEditingAppBackendAPI
Methods:

OPTIONS – Added header mapping for CORS
POST – Lambda integration

Authorization: Through Amazon Cognito using CognitoAuthorizer

After you deploy the CloudFormation template, copy the following from the Outputs tab to be used during the deployment of Amplify:

userPoolId
userPoolClientId
invokeUrl

Deploy the Amplify application
You have to manually deploy the Amplify application using the frontend code found on GitHub. Complete the following steps:

Download the frontend code from the GitHub repo.
Unzip the downloaded file and navigate to the folder.
In the js folder, find the config.js file and replace the values of XYZ for userPoolId, userPoolClientId, and invokeUrl with the values you collected from the CloudFormation stack outputs. Set the region value based on the Region where you’re deploying the solution.

The following is an example config.js file:

window._config = {
cognito: {
userPoolId: ‘XYZ’, // e.g. us-west-2_uXboG5pAb
userPoolClientId: ‘XYZ’, // e.g. 25ddkmj4v6hfsfvruhpfi7n4hv
region: ‘XYZ// e.g. us-west-2
},
api: {
invokeUrl: ‘XYZ’ // e.g. https://rc7nyt4tql.execute-api.us-west-2.amazonaws.com/prod,
}
};

Select all the files and compress them as shown in the following screenshot.

Make sure you zip the contents and not the top-level folder. For example, if your build output generates a folder named AWS-Amplify-Code, navigate into that folder and select all the contents, and then zip the contents.

Use the new .zip file to manually deploy the application in Amplify.

After it’s deployed, you will receive a domain that you can use in later steps to access the application.

Create a test user in the Amazon Cognito user pool.

An email address is required for this user because you will need to mark the email address as verified.

Return to the Amplify page and use the domain it automatically generated to access the application.

Use Amazon Cognito for user authentication
Amazon Cognito is an identity platform that you can use to authenticate and authorize users. We use Amazon Cognito in our solution to verify the user before they can use the image editing application.
Upon accessing the Image Editing Tool URL, you will be prompted to sign in with a previously created test user. For first-time sign-ins, users will be asked to update their password. After this process, the user’s credentials are validated against the records stored in the user pool. If the credentials match, Amazon Cognito will issue a JSON Web Token (JWT). In the API payload to be sent section of the page, you will notice that the Authorization field has been updated with the newly issued JWT.
Use Lambda for backend code and Amazon Bedrock for generative AI function
The backend code is hosted on Lambda, and launched by user requests routed through API Gateway. The Lambda function process the request payload and forwards it to Amazon Bedrock. The reply from Amazon Bedrock follows the same route as the initial request.
Use API Gateway for API management
API Gateway streamlines API management, allowing developers to deploy, maintain, monitor, secure, and scale their APIs effortlessly. In our use case, API Gateway serves as the orchestrator for the application logic and provides throttling to manage the load to the backend. Without API Gateway, you would need to use the JavaScript SDK in the frontend to interact directly with the Amazon Bedrock API, bringing more work to the frontend.
Use Amplify for frontend code
Amplify offers a development environment for building secure, scalable mobile and web applications. It allows developers to focus on their code rather than worrying about the underlying infrastructure. Amplify also integrates with many Git providers. For this solution, we manually upload our frontend code using the method outlined earlier in this post.
Image editing tool walkthrough
Navigate to the URL provided after you created the application in Amplify and sign in. At first login attempt, you’ll be asked to reset your password.

As you follow the steps for this tool, you will notice the API Payload to be Sent section on the right side updating dynamically, reflecting the details mentioned in the corresponding steps that follow.
Step 1: Create a mask on your image
To create a mask on your image, choose a file (JPEG, JPG, or PNG).
After the image is loaded, the frontend converts the file into base64 and base_image value is updated.
As you select a portion of the image you want to edit, a mask will be created, and mask value is updated with a new base64 value. You can also use the stroke size option to adjust the area you are selecting.
You now have the original image and the mask image encoded in base64. (The Amazon Titan Image Generator G1 model requires the inputs to be in base64 encoding.)

Step 2: Write a prompt and set your options
Write a prompt that describes what you want to do with the image. For this example, we enter Make the driveway clear and empty. This is reflected in the prompt on the right.
You can choose from the following image editing options: inpainting and outpainting. The value for mode is updated depending on your selection.

Use inpainting to remove masked elements and replace them with background pixels
Use outpainting to extend the pixels of the masked image to the image boundaries

Choose Send to API to send the payload to the API gateway. This action invokes the Lambda function, which validates the received payload. If the payload is validated successfully, the Lambda function proceeds to invoke the Amazon Bedrock API for further processing.
The Amazon Bedrock API generates two image outputs in base64 format, which are transmitted back to the frontend application and rendered as visual images.

Step 3: View and download the result
The following screenshot shows the results of our test. You can download the results or provide an updated prompt to get a new output.

Testing and troubleshooting
When you initiate the Send to API action, the system performs a validation check. If required information is missing or incorrect, it will display an error notification. For instance, if you attempt to send an image to the API without providing a prompt, an error message will appear on the right side of the interface, alerting you to the missing input, as shown in the following screenshot.

Clean up
If you decide to discontinue using the Image Editing Tool, you can follow these steps to remove the Image Editing Tool, its associated resources deployed using AWS CloudFormation, and the Amplify deployment:

Delete the CloudFormation stack:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Locate the stack you created during the deployment process (you assigned a name to it).
Select the stack and choose Delete.

Delete the Amplify application and its resources. For instructions, refer to Clean Up Resources.

Conclusion
In this post, we explored a sample solution that you can use to deploy an image editing application by using AWS serverless services and generative AI services. We used Amazon Bedrock and an Amazon Titan FM that allows you to edit images by using prompts. By adopting this solution, you gain the advantage of using AWS managed services, so you don’t have to maintain the underlying infrastructure. Get started today by deploying this sample solution.
Additional resources
To learn more about Amazon Bedrock, see the following resources:

GitHub repo: Amazon Bedrock Workshop
Amazon Bedrock User Guide
Amazon Bedrock InvokeModel API
Workshop: Using generative AI on AWS for diverse content types

To learn more about the Amazon Titan Image Generator G1 model, see the following resources:

Amazon Titan Image Generator G1 models
Amazon Titan Image Generator Demo

About the Authors
Salman Ahmed is a Senior Technical Account Manager in AWS Enterprise Support. He enjoys helping customers in the travel and hospitality industry to design, implement, and support cloud infrastructure. With a passion for networking services and years of experience, he helps customers adopt various AWS networking services. Outside of work, Salman enjoys photography, traveling, and watching his favorite sports teams.
Sergio Barraza is a Senior Enterprise Support Lead at AWS, helping energy customers design and optimize cloud solutions. With a passion for software development, he guides energy customers through AWS service adoption. Outside work, Sergio is a multi-instrument musician playing guitar, piano, and drums, and he also practices Wing Chun Kung Fu.
Ravi Kumar is a Senior Technical Account Manager in AWS Enterprise Support who helps customers in the travel and hospitality industry to streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience. In his free time, Ravi enjoys creative activities like painting. He also likes playing cricket and traveling to new places.
Ankush Goyal is a Enterprise Support Lead in AWS Enterprise Support who helps customers streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience.
Go to Source
22/10/2024 – 10:02 /Salman Ahmed
Twitter: @hoffeldtcom

AWS Machine Learning Blog

Many organizations are building generative AI applications powered by large language models (LLMs) to boost productivity and build differentiated experiences. These LLMs are large and complex and deploying them requires powerful computing resources and results in high inference costs. For businesses and researchers with limited resources, the high inference costs of generative AI models can be a barrier to enter the market, so more efficient and cost-effective solutions are needed. Most generative AI use cases involve human interaction, which requires AI accelerators that can deliver real time response rates with low latency. At the same time, the pace of innovation in generative AI is increasing, and it’s becoming more challenging for developers and researchers to quickly evaluate and adopt new models to keep pace with the market.
One of ways to get started with LLMs such as Llama and Mistral are by using Amazon Bedrock. However, customers who want to deploy LLMs in their own self-managed workflows for greater control and flexibility of underlying resources can use these LLMs optimized on top of AWS Inferentia2-powered Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances. In this blog post, we will introduce how to use an Amazon EC2 Inf2 instance to cost-effectively deploy multiple industry-leading LLMs on AWS Inferentia2, a purpose-built AWS AI chip, helping customers to quickly test and open up an API interface to facilitate performance benchmarking and downstream application calls at the same time.
Model introduction
There are many popular open source LLMs to choose from, and for this blog post, we will review three different use cases based on model expertise using Meta-Llama-3-8B-Instruct, Mistral-7B-instruct-v0.2, and CodeLlama-7b-instruct-hf.

Model name
Release company
Number of parameters
Release time
Model capabilities

Meta-Llama-3-8B-Instruct
Meta
8 billion
April 2024
Language understanding, translation, code generation, inference, chat

Mistral-7B-Instruct-v0.2
Mistral AI
7.3 billion
March 2024
Language understanding, translation, code generation, inference, chat

CodeLlama-7b-Instruct-hf
Meta
7 billion
August 2023
Code generation, code completion, chat

Meta-Llama-3-8B-Instruct is a popular language models, released by Meta AI in April 2024. The Llama 3 model has improved pre-training, instant comprehension, output generation, coding, inference, and math skills. The Meta AI team says that Llama 3 has the potential to be the initiator of a new wave of innovation in AI. The Llama 3 model is available in two publicly released versions, 8B and 70B. At the time of writing, Llama 3.1 instruction-tuned models are available in 8B, 70B, and 405B versions. In this blog post, we will use the Meta-Llama-3-8B-Instruct model, but the same process can be followed for Llama 3.1 models.
Mistral-7B-instruct-v0.2, released by Mistral AI in March 2024, marks a major milestone in the development of the publicly available foundation model. With its impressive performance, efficient architecture, and wide range of features, Mistral 7B v0.2 sets a new standard for user-friendly and powerful AI tools. The model excels at tasks ranging from natural language processing to coding, making it an invaluable resource for researchers, developers, and businesses. In this blog post, we will use the Mistral-7B-instruct-v0.2 model, but the same process can be followed for the Mistral-7B-instruct-v0.3 model.
CodeLlama-7b-instruct-hf is a collection of models published by Meta AI. It is an LLM that uses text prompts to generate code. Code Llama is aimed at code tasks, making developers’ workflow faster and more efficient and lowering the learning threshold for coders. Code Llama has the potential to be used as a productivity and educational tool to help programmers write more powerful and well-documented software.
Solution architecture
The solution uses a client-server architecture, and the client uses the HuggingFace Chat UI to provide a chat page that can be accessed on a PC or mobile device. Server-side model inference uses Hugging Face’s Text Generation Inference, an efficient LLM inference framework that runs in a Docker container. We pre-compiled the model using Hugging Face’s Optimum Neuron and uploaded the compilation results to Hugging Face Hub. We have also added a model switching mechanism to the HuggingFace Chat UI to control the loading of different models in the Text Generation Inference container through a scheduler (Scheduler).
Solution highlights

All components are deployed on an Inf2 instance with a single chip instance (inf2.xl or inf2.8xl), and users can experience the effects of multiple models on one instance.
With the client-server architecture, users can flexibly replace either the client or the server side according to their actual needs. For example, the model can be deployed in Amazon SageMaker, and the frontend Chat UI can be deployed on the Node server. To facilitate the demonstration, we deployed both the front and back ends on the same Inf2 server.
Using a publicly available framework, users can customize frontend pages or models according to their own needs.
Using an API interface for Text Generation Inference facilitates quick access for users using the API.
Deployment using AWS Cloudformation, suitable for all types of businesses and developers within the enterprise.

Main components
The following are the main components of the solution.
Hugging Face Optimum Neuron
Optimum Neuron is an interface between the HuggingFace Transformers library and the AWS Neuron SDK. It provides a set of tools for model load, training, and inference for single and multiple accelerator setups of different downstream tasks. In this article, we mainly used Optimum Neuron’s export interface. To deploy the HuggingFace Transformers model on Neuron devices, the model needs to be compiled and exported to a serialized format before the inference is performed. The export interface is pre-compiled (Ahead of-time compilation (AOT)) using the Neuron compiler (Neuronx-cc), and the model is converted into a serialized and optimized TorchScript module. This is shown in the following figure.

During the compilation process, we introduced a tensor parallelism mechanism to split the weights, data, and computations between the two NeuronCores. For more compilation parameters, see Export a model to Inferentia.
Hugging Face’s Text Generation Inference (TGI)
Text Generation Inference (TGI) is a framework written in Rust and Python for deploying and serving LLMs. TGI provides high performance text generation services for the most popular publicly available foundation LLMs. Its main features are:

Simple launcher that provides inference services for many LLMs
Supports both generate and stream interfaces
Token stream using server-sent events (SSE)
Supports AWS Inferentia, Trainium, NVIDIA GPUs and other accelerators

HuggingFace Chat UI
HuggingFace Chat UI is an open-source chat tool built by SvelteKit and can be deployed to Cloudflare, Netlify, Node, and so on. It has the following main features:

Page can be customized
Conversation records can be stored, and chat records are stored in MongoDB
Supports operation on PC and mobile terminals
The backend can connect to Text Generation Inference and supports API interfaces such as Anthropic, Amazon SageMaker, and Cohere
Compatible with various publicly available foundation models (Llama series, Mistral/Mixtral series, Falcon, and so on.

Thanks to the page customization capabilities of the Hugging Chat UI, we’ve added a model switching function, so users can switch between different models on the same EC2 Inf2 instance.
Solution deployment

Before deploying the solution, make sure you have an inf2.xl or inf2.8xl usage quota in the us-east-1 (Virginia) or us-west-2 (Oregon) AWS Region. See the reference link for how to apply for a quota.
Sign in to the AWS Management Consol and switch the Region to us-east-1 (Virginia) or us-west-2 (Oregon) in the upper right corner of the console page.
Enter Cloudformation in the service search box and choose Create stack.
Select Choose an existing template, and then select Amazon S3 URL.
If you plan to use an existing virtual private cloud (VPC), use the steps in a; if you plan to create a new VPC to deploy, use the steps in b.

Use an existing VPC.

Enter https://zz-common.s3.amazonaws.com/tmp/tgiui/20240501/launch_server_default_vpc_ubuntu22.04.yaml in the Amazon S3 URL.
Stack name: Enter the stack name.
InstanceType: select inf2.xl (lower cost) or inf2.8xl (better performance).
KeyPairName (optional): if you want to sign in to the Inf2 instance, enter the KeyPairName name.
VpcId: Select VPC.
PublicSubnetId: Select a public subnet.
VolumeSize: Enter the size of the EC2 instance EBS storage volume. The minimum value is 80 GB.
Choose Next, then Next again. Choose Submit.

Create a new VPC.

Enter https://zz-common.s3.amazonaws.com/tmp/tgiui/20240501/launch_server_new_vpc_ubuntu22.04.yaml in the Amazon S3 URL.
Stack name: Enter the stack name.
InstanceType: Select inf2.xl or inf2.8xl.
KeyPairName (optional): If you want to sign in to the Inf2 instance, enter the KeyPairName name.
VpcId: Leave as New.
PublicSubnetId: Leave as New.
VolumeSize: Enter the size of the EC2 instance EBS storage volume. The minimum value is 80 GB.

Choose Next, and then Next again. Then choose Submit.6. After creating the stack, wait for the resources to be created and started (about 15 minutes). After the stack status is displayed as CREATE_COMPLETE, choose Outputs. Choose the URL where the key is the corresponding value location for Public endpoint for the web server (close all VPN connections and firewall programs).

User interface
After the solution is deployed, users can access the preceding URL on the PC or mobile phone. On the page, the Llama3-8B model will be loaded by default. Users can switch models in the menu settings, select the model name to be activated in the model list, and choose Activate to switch models. Switching models requires reloading the new model into the Inferentia 2 accelerator memory. This process takes about 1 minute. During this process, users can check the loading status of the new model by choosing Retrieve model status. If the status is Available, it indicates that the new model has been successfully loaded.

The effects of the different models are shown in the following figure:

The following figures shows the solution in a browser on a PC:

API interface and performance testing
The solution uses a Text Generation Inference Inference Server, which supports /generate and /generate_stream interfaces and uses port 8080 by default. You can make API calls by replacing that follows with the IP address deployed previously.
The /generate interface is used to return all responses to the client at once after generating all tokens on the server side.

curl :8080/generate
-X POST
-d ‘{“inputs”: “Calculate the distance from Beijing to Shanghai”}’
-H ‘Content-Type: application/json’

/generate_stream is used to reduce waiting delays and enhance the user experience by receiving tokens one by one when the model output length is relatively large.

curl :8080/generate_stream
-X POST
-d ‘{“inputs”: “Write an essay on the mental health of elementary school students with no more than 300 words. “}’
-H ‘Content-Type: application/json’

Here is a sample code to use requests interface in python.
import requests
url = “http://:8080/generate”
headers = {“Content-Type”: “application/json”}
data = {“inputs”: “Calculate the distance from Beijing to Shanghai”,”parameters”:{
“max_new_tokens”:200
}
}
response = requests.post(url, headers=headers, json=data)
print(response.text)

Summary
In this blog post, we introduced methods and examples of deploying popular LLMs on AWS AI chips, so that users can quickly experience the productivity improvements provided by LLMs. The model deployed on Inf2 instance has been validated by multiple users and scenarios, showing strong performance and wide applicability. AWS is continuously expanding its application scenarios and features to provide users with efficient and economical computing capabilities. See Inf2 Inference Performance to check the types and list of models supported on the Inferentia2 chip. Contact us to give feedback on your needs or ask questions about deploying LLMs on AWS AI chips.
References

Amazon EC2 Inf2 Instances
Optimum Neuron
Optimum Neuron on GitHub
Text Generation Inference
Huggingface chat-ui
Introducing Meta Llama 3
Hugging Face Meta-Llama-3-8B
Hugging Face Mistral-7B-Instruct-v0.2
Hugging Face CodeLlama-7b-Instruct-hf
llama3-8b-inf2 Demo
AWS service quotas
Locust

About the authors
Zheng Zhang is a technical expert for Amazon Web Services machine learning products, focus on Amazon Web Services-based accelerated computing and GPU instances. He has rich experiences on large-scale model training and inference acceleration in machine learning.
Bingyang Huang is a Go-To-Market Specialist of Accelerated Computing at GCR SSO GenAI team. She has experience on deploying the AI accelerator on customer’s production environment. Outside of work, she enjoys watching films and exploring good foods.
Tian Shi is Senior Solution Architect at Amazon Web Services. He has rich experience in cloud computing, data analysis, and machine learning and is currently dedicated to research and practice in the fields of data science, machine learning, and serverless. His translations include Machine Learning as a Service, DevOps Practices Based on Kubernetes, Practical Kubernetes Microservices, Prometheus Monitoring Practice, and CoreDNS Study Guide in the Cloud Native Era.
Chuan Xie is a Senior Solution Architect at Amazon Web Services Generative AI, responsible for the design, implementation, and optimization of generative artificial intelligence solutions based on the Amazon Cloud. River has many years of production and research experience in the communications, ecommerce, internet and other industries, and rich practical experience in data science, recommendation systems, LLM RAG, and others. He has multiple AI-related product technology invention patents.
Go to Source
22/10/2024 – 10:02 /Zheng Zhang
Twitter: @hoffeldtcom

AWS Machine Learning Blog

In Part 1 of this series, we explored best practices for creating accurate and reliable agents using Amazon Bedrock Agents. Amazon Bedrock Agents help you accelerate generative AI application development by orchestrating multistep tasks. Agents use the reasoning capability of foundation models (FMs) to create a plan that decomposes the problem into multiple steps. The model is augmented with the developer-provided instruction to create an orchestration plan and then carry out the plan. The agent can use company APIs and external knowledge through Retrieval Augmented Generation (RAG).
In this second part, we dive into the architectural considerations and development lifecycle practices that can help you build robust, scalable, and secure intelligent agents. Whether you are just starting to explore the world of conversational AI or looking to optimize your existing agent deployments, this comprehensive guide can provide valuable long-term insights and practical tips to help you achieve your goals.
Enable comprehensive logging and observability
From the outset of your agent development journey, you should implement thorough logging and observability practices. This is crucial for debugging, auditing, and troubleshooting your agents. The first step to achieve comprehensive logging is to enable Amazon Bedrock model invocation logging to capture prompts and responses securely in your account.
Amazon Bedrock Agents also provides you with traces, a detailed overview of the steps being orchestrated by the agents, the underlying prompts invoking the FM, the references being returned from the knowledge bases, and code being generated by the agent. Trace events are streamed in real time, which allows you to customize UX cues to keep the end-user informed about the progress of their request. You can log your agent’s traces and use them to track and troubleshoot your agents.
When moving agent applications to production, it’s a best practice to set up a monitoring workflow to continuously analyze your logs. You can do so by either creating a custom solution or using an open source solution such as Bedrock-ICYM.
Use infrastructure as code
Just as you would with any other software development project, you should use infrastructure as code (IaC) frameworks to facilitate iterative and reliable deployment. This lets you create repeatable and production-ready agents that can be readily reproduced, tested, and monitored. Amazon Bedrock Agents allows you to write IaC code with AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), or Terraform. We also recommend that you get started using our Agent Blueprints construct. We provide blueprint templates of the most common capabilities of Amazon Bedrock Agents, which can be deployed and updated with a single AWS CDK command.
When creating agents that use action groups, you can specify your function definitions as a JSON object to the agent or provide an API schema in the OpenAPI schema format. If you already have an OpenAPI schema for your application, the best practice is to start with it. Make sure the functions have proper natural language descriptions, because your agent will use them to understand when to use each function. If you’re starting with no existing schema, the simplest way to provide tool metadata for your agent is to use simple JSON function definitions. Either way, you can use the Amazon Bedrock console to quickly create a default AWS Lambda function to get started implementing your actions or tools.
After you start to scale the development of agents, you should consider the reusability of the agent’s components. Using IaC will allow you to have predefined guardrails using Amazon Bedrock Guardrails, knowledge bases using Amazon Bedrock Knowledge Bases, and action groups that are reused over multiple agents.
Building agents that run tasks requires function definitions and Lambda functions. Another best practice is to use generative AI to accelerate the development and maintenance of this code. You can do so directly with the invoke model functionality in Amazon Bedrock, using the Amazon Q Developer support or even by creating an AWS PartyRock application that creates a framework of your Lambda function based on your action group metadata. You can directly generate the IaC required for creating your agents with function definitions and Lambda connections using generative AI. Independently of the approach selected, creating a test pipeline that validates and runs the IaC will help you optimize your agent solutions.
Use SessionState for additional agent context
You can use SessionState to provide additional context to your agent. You can pass information that is only available to the Lambda function in the action groups using SessionAttribute and information that should be available to your prompt as SessionPromptAttribute. For example, if you want to pass a user authentication token for your action to use, it’s best placed as a SessionAttribute. If you want to pass information that the large language model (LLM) needs to reason about, such as the current date and timestamp to define relative dates, it’s best placed as a SessionPromptAttribute. This lets your agent infer things like the number of days before your next payment due date or how many hours it has been since you placed your order using the reasoning capabilities of the underlying LLM model.
Optimize model selection for cost and performance
A key part of the agent building process is to select the underlying FM for your agent (or for each sub-agent). Experiment with available FMs to select the best one for your application based on cost, latency, and accuracy requirements. Implement automated testing pipelines to collect evaluation metrics, enabling data-driven decisions on model selection. This approach allows you to use faster, cheaper models like Anthropic’s Claude 3 Haiku on Amazon Bedrock for simple agents, and more complex applications can use more advanced models like Anthropic’s Claude 3.5 Sonnet or Anthropic’s Claude 3 Opus.
Implement robust testing frameworks
Automating the evaluation of your agent, or any generative AI-powered system, can accelerate the development process and make sure you provide your customers with the best possible solution. You should evaluate on multiple dimensions, including cost, latency, and accuracy of your agents. Use frameworks like Agent Evaluation to assess agent behavior against predefined criteria. By using the Amazon Bedrock agent versioning and alias features, you can unlock A/B testing as part of your deployment stages. You should define different aspects of agent behavior, such as formal or informal HR assistant tone, that can be tested with a subset of your user group. You can then make different agent versions available for each group during initial deployments and evaluate the agent behavior for each group. Amazon Bedrock Agents has built-in versioning capabilities to help you with this key part of testing. The following figure shows how the HR agent can be updated after a testing and evaluation phase to create a new alias pointing to the selected version of the agent for the model invocation.

Use LLMs for test case generation
You can use LLMs to generate test cases based on expected use cases for your agent. As a best practice, you should select a different LLM to generate data than the one that is powering your agent. This approach can significantly accelerate the building of comprehensive test suites, providing thorough coverage of potential scenarios. For example, you could use the following prompt to create test cases for an HR assistant agent that helps employees booking holidays:

Generate the conversation back and forward between an employee and an employee
assistant agent. The employee is trying to reserve time off.
The agent has access to functions for checking the available employee’s time off,
booking and updating time off, and sending notifications that a new time off booking
has been completed. Here’s a sample conversation between an employee and an employee
assistant agent for booking time off. Your conversation should have at least 3
interactions between the agent and the employee. The employee starts by saying hello.

Design robust confirmation and security mechanisms
Implement robust confirmation mechanisms for critical actions in your agent’s workflow. Clearly state in your instructions that the agent should ask for user confirmation before running certain functions, especially those that modify data or perform sensitive operations. This step helps move beyond proof of concept or prototype stages, verifying that your agent operates reliably in production environments. For instance, the following instruction tells your agent to confirm that a vacation request action should be run before updating the database for the user:

You are an HR agent, helping employees … [other instructions removed for brevity]

Before creating, editing or deleting a time-off request, ask for user confirmation
for your actions. Include sufficient information with that ask to be clear about
the action that will be taken. DO NOT provide the function name itself but rather focus
on the actions being executed using natural language.

You can also use the requireConfirmation field for function schema definition or the x-requireConfirmation field for API schema definition during the creation of a new action to enable the Amazon Bedrock Agents built-in functionality for user confirmation request before invoking an action in an action group.
Implement flexible authorization and encryption
You should provide customer managed keys to encrypt your agent’s resources, and confirm that your AWS Identity and Access Management (IAM) permissions follow the least privilege approach, limiting your agent to only have access to required resources and actions. When implementing action groups, take advantage of the sessionAttributes parameter of your sessionState to provide information about your user roles and permissions so that your action can implement fine-grained permissions (see the following sample code). Another best practice is to use the knowledgeBaseConfigurations parameter of the sessionState to provide extra configurations to your knowledge base, such as the user group defining the documents that a user should have access to through knowledge base metadata filtering.
Integrate responsible AI practices
When developing generative AI applications, you should apply responsible AI practices to create systems in an ethical, transparent, and accountable manner. Amazon Bedrock features help you develop your responsible AI practices in a scalable manner. When creating agents, you should implement Amazon Bedrock Guardrails to avoid sensitive topics, filter user input and agent output from harmful content, and redact sensitive information to protect user privacy. You can create organization-level guardrails that can be reused across multiple generative AI applications, thereby preserving consistent responsible AI practices. After you create a guardrail, you can associate it with your agent using the Amazon Bedrock Agents built-in guardrails connection (see the following sample code).
Build a reusable actions catalog and scale gradually
After the successful deployment of your first agent, you can plan to reuse common functionalities, such as action groups, knowledge bases, and guardrails, for other applications. Amazon Bedrock Agents support the creation of agents manually using the AWS Management Console, using code with the SDKs available for the agent API, or using IaC with CloudFormation templates, the AWS CDK, or Terraform templates. To reuse functionality, the best practice is to create and deploy them using IaC and reuse the components across applications. The following figure shows an example of the reusability of a utilities action group across two agents: an HR assistant and a banking assistant.

Follow a crawl-walk-run methodology when scaling agent usage
The final best practice that we would like to highlight is to follow the crawl-walk-run methodology. Start with an internal application (crawl), followed with applications made available for a smaller, controlled set of external users (walk), and finally scale your applications to all customers (run) and eventually use multi-agent collaboration. This approach helps you build reliable agents that support mission-critical business operations, while minimizing risks associated with the rollout of new technology. The following figure illustrates this process.

Conclusion
By following these architectural and development lifecycle best practices, you’ll be well-equipped to create robust, scalable, and secure agents that can effectively serve your users and integrate seamlessly with your existing systems.
For examples to get started, check out the Amazon Bedrock samples repository. To learn more about Amazon Bedrock Agents, get started with the Amazon Bedrock Workshop and the standalone Amazon Bedrock Agents Workshop, which provides a deeper dive. Additionally, check out the service introduction video from AWS re:Invent 2023.

About the Authors
Maira Ladeira Tanke is a Senior Generative AI Data Scientist at AWS. With a background in machine learning, she has over 10 years of experience architecting and building AI applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through generative AI solutions on Amazon Bedrock. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.
Navneet Sabbineni is a Software Development Manager at AWS Bedrock. With over 9 years of industry experience as a software developer and manager, he has worked on building and maintaining scalable distributed services for AWS, including generative AI services like Amazon Bedrock Agents and conversational AI services like Amazon Lex. Outside of work, he enjoys traveling and exploring the Pacific Northwest with his family and friends.
Monica Sunkara is a Senior Applied Scientist at AWS, where she works on Amazon Bedrock Agents. With over 10 years of industry experience, including 6 years at AWS, Monica has contributed to various AI and ML initiatives such as Alexa Speech Recognition, Amazon Transcribe, and Amazon Lex ASR. Her work spans speech recognition, natural language processing, and large language models. Recently, she worked on adding function calling capabilities to Amazon Titan text models. Monica holds a degree from Cornell University, where she conducted research on object localization under the supervision of Prof. Andrew Gordon Wilson before joining Amazon in 2018.
Go to Source
22/10/2024 – 10:02 /Maira Ladeira Tanke
Twitter: @hoffeldtcom

AWS Machine Learning Blog

This post is co-written by Rodrigo Amaral, Ashwin Murthy and Meghan Stronach from Qualcomm.
In this post, we introduce an innovative solution for end-to-end model customization and deployment at the edge using Amazon SageMaker and Qualcomm AI Hub. This seamless cloud-to-edge AI development experience will enable developers to create optimized, highly performant, and custom managed machine learning solutions where you can bring you own model (BYOM) and bring your own data (BYOD) to meet varied business requirements across industries. From real-time analytics and predictive maintenance to personalized customer experiences and autonomous systems, this approach caters to diverse needs.
We demonstrate this solution by walking you through a comprehensive step-by-step guide on how to fine-tune YOLOv8, a real-time object detection model, on Amazon Web Services (AWS) using a custom dataset. The process uses a single ml.g5.2xlarge instance (providing one NVIDIA A10G Tensor Core GPU) with SageMaker for fine-tuning. After fine-tuning, we show you how to optimize the model with Qualcomm AI Hub so that it’s ready for deployment across edge devices powered by Snapdragon and Qualcomm platforms.
Business challenge
Today, many developers use AI and machine learning (ML) models to tackle a variety of business cases, from smart identification and natural language processing (NLP) to AI assistants. While open source models offer a good starting point, they often don’t meet the specific needs of the applications being developed. This is where model customization becomes essential, allowing developers to tailor models to their unique requirements and ensure optimal performance for specific use cases.
In addition, on-device AI deployment is a game-changer for developers crafting use cases that demand immediacy, privacy, and reliability. By processing data locally, edge AI minimizes latency, ensures sensitive information stays on-device, and guarantees functionality even in poor connectivity. Developers are therefore looking for an end-to-end solution where they can not only customize the model but also optimize the model to target on-device deployment. This enables them to offer responsive, secure, and robust AI applications, delivering exceptional user experiences.
How can Amazon SageMaker and Qualcomm AI Hub help?
BYOM and BYOD offer exciting opportunities for you to customize the model of your choice, use your own dataset, and deploy it on your target edge device. Through this solution, we propose using SageMaker for model fine-tuning and Qualcomm AI Hub for edge deployments, creating a comprehensive end-to-end model deployment pipeline. This opens new possibilities for model customization and deployment, enabling developers to tailor their AI solutions to specific use cases and datasets.
SageMaker is an excellent choice for model training, because it reduces the time and cost to train and tune ML models at scale without the need to manage infrastructure. You can take advantage of the highest-performing ML compute infrastructure currently available, and SageMaker can scale infrastructure from one to thousands of GPUs. Because you pay only for what you use, you can manage your training costs more effectively. SageMaker distributed training libraries can automatically split large models and training datasets across AWS GPU instances, or you can use third-party libraries, such as DeepSpeed, Horovod, Fully Sharded Data Parallel (FSDP), or Megatron. You can train foundation models (FMs) for weeks and months without disruption by automatically monitoring and repairing training clusters.
After the model is trained, you can use Qualcomm AI Hub to optimize, validate, and deploy these customized models on hosted devices with Snapdragon and Qualcomm Technologies within minutes. Qualcomm AI Hub is a developer-centric platform designed to streamline on-device AI development and deployment. AI Hub offers automatic conversion and optimization of PyTorch or ONNX models for efficient on-device deployment using TensorFlow Lite, ONNX Runtime, or Qualcomm AI Engine Direct SDK. It also has an existing library of over 100 pre-optimized models for Qualcomm and Snapdragon platforms.
Qualcomm AI Hub has served more than 800 companies and continues to expand its offerings in terms of models available, platforms supported, and more.
Using SageMaker and Qualcomm AI Hub together can create new opportunities for rapid iteration on model customization, providing access to powerful development tools and enabling a smooth workflow from cloud training to on-device deployment.
Solution architecture
The following diagram illustrates the solution architecture. Developers working in their local environment initiate the following steps:

Select an open source model and a dataset for model customization from the Hugging Face repository.
Pre-process the data into the format required by your model for training, then upload the processed data to Amazon Simple Storage Service (Amazon S3). Amazon S3 provides a highly scalable, durable, and secure object storage solution for your machine learning use case.
Call the SageMaker control plane API using the SageMaker Python SDK for model training. In response, SageMaker provisions a resilient distributed training cluster with the requested number and type of compute instances to run the model training. SageMaker also handles orchestration and monitors the infrastructure for any faults.
After the training is complete, SageMaker spins down the cluster, and you’re billed for the net training time in seconds. The final model artifact is saved to an S3 bucket.
Pull the fine-tuned model artifact from Amazon S3 to the local development environment and validate the model accuracy.
Use Qualcomm AI Hub to compile and profile the model, running it on cloud-hosted devices to deliver performance metrics ahead of downloading for deployment across edge devices.

Use case walk through
Imagine a leading electronics manufacturer aiming to enhance its quality control process for printed circuit boards (PCBs) by implementing an automated visual inspection system. Initially, using an open source vision model, the manufacturer collects and annotates a large dataset of PCB images, including both defective and non-defective samples.
This dataset, similar to the keremberke/pcb-defect-segmentation dataset from HuggingFace, contains annotations for common defect classes such as dry joints, incorrect installations, PCB damage, and short circuits. With SageMaker, the manufacturer trains a custom YOLOv8 model (You Only Look Once), developed by Ultralytics, to recognize these specific PCB defects. The model is then optimized for deployment at the edge using Qualcomm AI Hub, providing efficient performance on chosen platforms such as industrial cameras or handheld devices used in the production line.
This customized model significantly improves the quality control process by accurately detecting PCB defects in real-time. It reduces the need for manual inspections and minimizes the risk of defective PCBs progressing through the manufacturing process. This leads to improved product quality, increased efficiency, and substantial cost savings.
Let’s walk through this scenario with an implementation example.
Prerequisites
For this walkthrough, you should have the following:

Jupyter Notebook – The example has been tested in Visual Studio Code with Jupyter Notebook using the Python 3.11.7 environment.
An AWS account.
Create an AWS Identity and Access Management (IAM) user with the AmazonSageMakerFullAccess policy to enable you to run SageMaker APIs. Set up your security credentials for CLI.
Install AWS Command Line Interface (AWS CLI) and use aws configure to set up your IAM credentials securely.
Create a role with the name sagemakerrole to be assumed by SageMaker. Add managed policies AmazonS3FullAccess to give SageMaker access to your S3 buckets.
Make sure your account has the SageMaker Training resource type limit for ml.g5.2xlarge increased to 1 using the Service Quotas console.
Follow the get started instructions to install the necessary Qualcomm AI Hub library and set up your unique API token for Qualcomm AI Hub.
Use the following command to clone the GitHub repository with the assets for this use case. This repository consists of a notebook that references training assets.

$ git clone https://github.com/aws-samples/sm-qai-hub-examples.git
$ cd sm-qai-hub-examples/yolo

The sm-qai-hub-examples/yolo  directory contains all the training scripts that you might need to deploy this sample.
Next, you will run the sagemaker_qai_hub_finetuning.ipynb notebook to fine-tune the YOLOv8 model on SageMaker and deploy it on the edge using AI Hub. See the notebook for more details on each step. In the following sections, we walk you through the key components of fine-tuning the model.
Step 1: Access the model and data

Begin by installing the necessary packages in your Python environment. At the top of the notebook, include the following code snippet, which uses Python’s pip package manager to install the required packages in your local runtime environment.

%pip install -Uq sagemaker==2.232.0 ultralytics==8.2.100 datasets==2.18.0

Import the necessary libraries for the project. Specifically, import the Dataset class from the Hugging Face datasets library and the YOLO class from the ultralytics library. These libraries are crucial for your work, because they provide the tools you need to access and manipulate the dataset and work with the YOLO object detection model.

from datasets import Dataset

from ultralytics import YOLO

Step 2: Pre-process and upload data to S3
To fine-tune your YOLOv8 model for detecting PCB defects, you will use the keremberke/pcb-defect-segmentation dataset from Hugging Face. This dataset includes 189 images of chip defects (train: 128 images, validation: 25 images and test: 36 images). These defects are annotated in COCO format.
YOLOv8 doesn’t recognize these classes out of the box, so you will map YOLOv8’s logits to identify these classes during model fine-tuning, as shown in the following image.

Begin by downloading the dataset from Hugging Face to the local disk and converting it to the required YOLO dataset structure using the utility function CreateYoloHFDataset. This structure ensures that the YOLO API correctly loads and processes the images and labels during the training phase.

dataset_name = “keremberke/pcb-defect-segmentation”
dataset_labels = [
‘dry_joint’,
‘incorrect_installation’,
‘pcb_damage’,
‘short_circuit’
]

data = CreateYoloHFDataset(
hf_dataset_name=dataset_name,
labels_names=dataset_labels
)

Upload the dataset to Amazon S3. This step is crucial because the dataset stored in S3 will serve as the input data channel for the SageMaker training job. SageMaker will efficiently manage the process of distributing this data across the training cluster, allowing each node to access the necessary information for model training.

uploaded_s3_uri = sagemaker.s3.S3Uploader.upload(
local_path=data_path,
desired_s3_uri=f”s3://{s3_bucket}/qualcomm-aihub…”
)

Alternatively, you can use your own custom dataset (non-Hugging Face) to fine-tune the YOLOv8 model, as long as the dataset complies with the YOLOv8 dataset format.
Step 3: Fine-tune your YOLOv8 model
3.1: Review the training script
You’re now prepared to fine-tune the model using the model.train method from the Ultralytics YOLO library.
We’ve prepared a script called train_yolov8.py that will perform the following tasks. Let’s quickly review the key points in this script before you launch the training job.

The training script will do the following: Load a YOLOv8 model from the Ultralytics library

model = YOLO(args.yolov8_model)

Use the train method to run fine-tuning that considers the model data, adjusts its parameters, and optimizes its ability to accurately predict object classes and locations in images.

tuned_model = model.train(
data=dataset_yaml,
batch=args.batch_size,
imgsz=args.img_size,
epochs=args.epochs,

After the model is trained, the script runs inference to test the model output and save the model artifacts to a local Amazon S3 mapped folder

results = model.predict(
data=dataset_yaml,
imgsz=args.img_size,
batch=args.batch_size
)

model.save(“.pt”)

3.2: Launch the training
You’re now ready to launch the training. You will use the SageMaker PyTorch training estimator to initiate training. The estimator simplifies the training process by automating several of the key tasks in this example:

The SageMaker estimator spins up a training cluster of one 2xlarge instance. SageMaker handles the setup and management of these compute instances, which reduces the total cost of ownership.
The estimator also uses one of the pre-built containers managed by SageMaker—PyTorch, which includes an optimized compiled version of the PyTorch framework along with its required dependencies and GPU-specific libraries for accelerated computations.

The estimator.fit() method initiates the training process with the specified input data channels. Following is the code used to launch the training job along with the necessary parameters.

estimator = PyTorch(
entry_point=’train_yolov8.py’,
source_dir=’scripts’,
role=role,
instance_count=instance_count,
instance_type=instance_type,
image_uri=training_image_uri,
hyperparameters=hyperparameters,
base_job_name=”yolov8-finetuning”,
output_path=f”s3://{s3_bucket}/…”
)

estimator.fit(
{
‘training’: sagemaker.inputs.TrainingInput(
s3_data=uploaded_s3_uri,
distribution=’FullyReplicated’,
s3_data_type=’S3Prefix’
)
}
)

You can track a SageMaker training job by monitoring its status using the AWS Management Console, AWS CLI, or AWS SDKs. To determine when the job is completed, check for the Completed status or set up Amazon CloudWatch alarms to notify you when the job transitions to the Completed state.
Step 4 & 5: Save, download and validate the trained model
The training process generates model artifacts that will be saved to the S3 bucket specified in output_path location. This example uses the download_tar_and_untar utility to download the model to a local drive.

Run inference on this model and visually validate how close ground truth and model predictions bounding boxes align on test images. The following code shows how to generate an image mosaic using a custom utility function—draw_bounding_boxes—that overlays an image with ground truth and model classification along with a confidence value for class prediction.

image_mosiacs = []
for i, _key in enumerate(image_label_pairs):
img_path, lbl_path = image_label_pairs[_key][“image_path”], image_label_pairs[_key][“label_path”]
result = model([img_path], save=False)
image_with_boxes = draw_bounding_boxes(
yolo_result=result[0],
ground_truth=open(lbl_path).read().splitlines(),
confidence_threshold=0.2
)
image_mosiacs.append(np.array(image_with_boxes))

From the preceding image mosaic, you can observe two distinct sets of bounding boxes: the cyan boxes indicate human annotations of defects on the PCB image, while the red boxes represent the model’s predictions of defects. Along with the predicted class, you can also see the confidence value for each prediction, which reflects the quality of the YOLOv8 model’s output.
After fine-tuning, YOLOv8 begins to accurately predict the PCB defect classes present in the custom dataset, even though it hadn’t encountered these classes during model pretraining. Additionally, the predicted bounding boxes are closely aligned with the ground truth, with confidence scores of greater than or equal to 0.5 in most cases. You can further improve the model’s performance without the need for hyperparameter guesswork by using a SageMaker hyperparameter tuning job.
Step 6: Run the model on a real device with Qualcomm AI Hub
Now that you’re validated the fine-tuned model on PyTorch, you want to run the model on a real device.
Qualcomm AI Hub enables you to do the following:

Compile and optimize the PyTorch model into a format that can be run on a device
Run the compiled model on a device with a Snapdragon processor hosted in AWS device farm
Verify on-device model accuracy
Measure on-device model latency

To run the model:

Compile the model.

The first step is converting the PyTorch model into a format that can run on the device.
This example uses a Windows laptop powered by the Snapdragon X Elite processor. This device uses the ONNX model format, which you will configure during compilation.
As you get started, you can see a list of all the devices supported on Qualcomm AI Hub, by running qai-hub list-devices.
See Compiling Models to learn more about compilation on Qualcomm AI Hub.

compile_job = hub.submit_compile_job(
model=traced_model,
input_specs={“image”: (model_input.shape, “float32″)},
device=target_device,
name=model_name,
options=”–target_runtime onnx”
)

Inference the model on a real device

Run the compiled model on a real cloud-hosted device with Snapdragon using the same model input you verified locally with PyTorch.
See Running Inference to learn more about on-device inference on Qualcomm AI Hub.

inference_job = hub.submit_inference_job(
model=compile_job.get_target_model(),
inputs={“image”: [model_input.numpy()]},
device=target_device,
name=model_name,
)

Profile the model on a real device.

Profiling measures the latency of the model when run on a device. It reports the minimum value over 100 invocations of the model to best isolate model inference time from other processes on the device.
See Profiling Models to learn more about profiling on Qualcomm AI Hub.

profile_job = hub.submit_profile_job(
model=compile_job.get_target_model(),
device=target_device,
name=model_name,
)

Deploy the compiled model to your device

Run the command below to download the compiled model.
The compiled model can be used in conjunction with the AI Hub sample application hosted here. This application uses the model to run object detection on a Windows laptop powered by Snapdragon that you have locally.

compile_job.download_target_model()

Conclusion
Model customization with your own data through Amazon SageMaker—with over 250 models available on SageMaker JumpStart—is an addition to the existing features of Qualcomm AI Hub, which include BYOM and access to a growing library of over 100 pre-optimized models. Together, these features create a rich environment for developers aiming to build and deploy customized on-device AI models across Snapdragon and Qualcomm platforms.
The collaboration between Amazon SageMaker and Qualcomm AI Hub will help enhance the user experience and streamline machine learning workflows, enabling more efficient model development and deployment across any application at the edge. With this effort, Qualcomm Technologies and AWS are empowering their users to create more personalized, context-aware, and privacy-focused AI experiences.
To learn more, visit Qualcomm AI Hub and Amazon SageMaker. For queries and updates, join the Qualcomm AI Hub community on Slack.
Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. or its subsidiaries

About the authors
Rodrigo Amaral currently serves as the Lead for Qualcomm AI Hub Marketing at Qualcomm Technologies, Inc. In this role, he spearheads go-to-market strategies, product marketing, developer activities, with a focus on AI and ML with a focus on edge devices. He brings almost a decade of experience in AI, complemented by a strong background in business. Rodrigo holds a BA in Business and a Master’s degree in International Management.
Ashwin Murthy is a Machine Learning Engineer working on Qualcomm AI Hub. He works on adding new models to the public AI Hub Models collection, with a special focus on quantized models. He previously worked on machine learning at Meta and Groq.
Meghan Stronach is a PM on Qualcomm AI Hub. She works to support our external community and customers, delivering new features across Qualcomm AI Hub and enabling adoption of ML on device. Born and raised in the Toronto area, she graduated from the University of Waterloo in Management Engineering and has spent her time at companies of various sizes.
Kanwaljit Khurmi is a Principal Generative AI/ML Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.
Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes using state of the art ML techniques. In his free time, he enjoys playing chess and traveling. You can find Pranav on LinkedIn.
Karan Jain is a Senior Machine Learning Specialist at AWS, where he leads the worldwide Go-To-Market strategy for Amazon SageMaker Inference. He helps customers accelerate their generative AI and ML journey on AWS by providing guidance on deployment, cost-optimization, and GTM strategy. He has led product, marketing, and business development efforts across industries for over 10 years, and is passionate about mapping complex service features to customer solutions.
Go to Source
22/10/2024 – 10:02 /Rodrigo Amaral
Twitter: @hoffeldtcom

NIMH News Feed


During this webinar, experts in graduate education and systemic-change management will discuss evidence-based practices and case studies of successful holistic admissions programs. The webinar will provide faculty, admission officers, and other higher education professionals with a roadmap for implementing mission-driven systemic change in graduate admissions.
Go to Source
19/10/2024 – 07:43 /National Institute of Mental Health
Twitter: @hoffeldtcom

error: Content is protected !!