How to work with the ChatGPT and GPT-4 models - Azure OpenAI Service (2023)

  • Article

The ChatGPT and GPT-4 models are language models that are optimized for conversational interfaces. The models behave differently than the older GPT-3 models. Previous models were text-in and text-out, meaning they accepted a prompt string and returned a completion to append to the prompt. However, the ChatGPT and GPT-4 models are conversation-in and message-out. The models expect input formatted in a specific chat-like transcript format, and return a completion that represents a model-written message in the chat. While this format was designed specifically for multi-turn conversations, you'll find it can also work well for non-chat scenarios too.

In Azure OpenAI there are two different options for interacting with these type of models:

  • Chat Completion API.
  • Completion API with Chat Markup Language (ChatML).

The Chat Completion API is a new dedicated API for interacting with the ChatGPT and GPT-4 models. This API is the preferred method for accessing these models. It is also the only way to access the new GPT-4 models.

ChatML uses the same completion API that you use for other models like text-davinci-002, it requires a unique token based prompt format known as Chat Markup Language (ChatML). This provides lower level access than the dedicated Chat Completion API, but also requires additional input validation, only supports ChatGPT (gpt-35-turbo) models, and the underlying format is more likely to change over time.

This article walks you through getting started with the new ChatGPT and GPT-4 models. It's important to use the techniques described here to get the best results. If you try to interact with the models the same way you did with the older model series, the models will often be verbose and provide less useful responses.

Working with the ChatGPT and GPT-4 models

The following code snippet shows the most basic way to use the ChatGPT and GPT-4 models with the Chat Completion API. If this is your first time using these models programmatically, we recommend starting with our .

GPT-4 models are currently only available by request. Existing Azure OpenAI customers can apply for access by filling out this form.

import osimport openaiopenai.api_type = "azure"openai.api_version = "2023-05-15" openai.api_base = os.getenv("OPENAI_API_BASE") # Your Azure OpenAI resource's endpoint value.openai.api_key = os.getenv("OPENAI_API_KEY")response = openai.ChatCompletion.create( engine="gpt-35-turbo", # The deployment name you chose when you deployed the ChatGPT or GPT-4 model. messages=[ {"role": "system", "content": "Assistant is a large language model trained by OpenAI."}, {"role": "user", "content": "Who were the founders of Microsoft?"} ])print(response)print(response['choices'][0]['message']['content'])

Output

{ "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": "The founders of Microsoft are Bill Gates and Paul Allen. They co-founded the company in 1975.", "role": "assistant" } } ], "created": 1679014551, "id": "chatcmpl-6usfn2yyjkbmESe3G4jaQR6bsScO1", "model": "gpt-3.5-turbo-0301", "object": "chat.completion", "usage": { "completion_tokens": 86, "prompt_tokens": 37, "total_tokens": 123 }}

Note

The following parameters aren't available with the new ChatGPT and GPT-4 models: logprobs, best_of, and echo. If you set any of these parameters, you'll get an error.

Every response includes a finish_reason. The possible values for finish_reason are:

  • stop: API returned complete model output.
  • length: Incomplete model output due to max_tokens parameter or token limit.
  • content_filter: Omitted content due to a flag from our content filters.
  • null:API response still in progress or incomplete.

Consider setting max_tokens to a slightly higher value than normal such as 300 or 500. This ensures that the model doesn't stop generating text before it reaches the end of the message.

Model versioning

Note

gpt-35-turbo is equivalent to the gpt-3.5-turbo model from OpenAI.

Unlike previous GPT-3 and GPT-3.5 models, the gpt-35-turbo model as well as the gpt-4 and gpt-4-32k models will continue to be updated. When creating a deployment of these models, you'll also need to specify a model version.

Currently, only version 0301 is available for ChatGPT and 0314 for GPT-4 models. We'll continue to make updated versions available in the future. You can find model deprecation times on our models page.

Working with the Chat Completion API

OpenAI trained the ChatGPT and GPT-4 models to accept input formatted as a conversation. The messages parameter takes an array of dictionaries with a conversation organized by role.

The format of a basic Chat Completion is as follows:

{"role": "system", "content": "Provide some context and/or instructions to the model"},{"role": "user", "content": "The users messages goes here"}

A conversation with one example answer followed by a question would look like:

{"role": "system", "content": "Provide some context and/or instructions to the model."},{"role": "user", "content": "Example question goes here."},{"role": "assistant", "content": "Example answer goes here."},{"role": "user", "content": "First question/message for the model to actually respond to."}

System role

The system role also known as the system message is included at the beginning of the array. This message provides the initial instructions to the model. You can provide various information in the system role including:

  • A brief description of the assistant
  • Personality traits of the assistant
  • Instructions or rules you would like the assistant to follow
  • Data or information needed for the model, such as relevant questions from an FAQ

You can customize the system role for your use case or just include basic instructions. The system role/message is optional, but it's recommended to at least include a basic one to get the best results.

Messages

After the system role, you can include a series of messages between the user and the assistant.

 {"role": "user", "content": "What is thermodynamics?"}

To trigger a response from the model, you should end with a user message indicating that it's the assistant's turn to respond. You can also include a series of example messages between the user and the assistant as a way to do few shot learning.

Message prompt examples

The following section shows examples of different styles of prompts that you could use with the ChatGPT and GPT-4 models. These examples are just a starting point, and you can experiment with different prompts to customize the behavior for your own use cases.

Basic example

If you want the ChatGPT model to behave similarly to chat.openai.com, you can use a basic system message like "Assistant is a large language model trained by OpenAI."

{"role": "system", "content": "Assistant is a large language model trained by OpenAI."},{"role": "user", "content": "Who were the founders of Microsoft?"}

Example with instructions

For some scenarios, you may want to give additional instructions to the model to define guardrails for what the model is able to do.

{"role": "system", "content": "Assistant is an intelligent chatbot designed to help users answer their tax related questions.Instructions: - Only answer questions related to taxes. - If you're unsure of an answer, you can say "I don't know" or "I'm not sure" and recommend users go to the IRS website for more information. "},{"role": "user", "content": "When are my taxes due?"}

Using data for grounding

You can also include relevant data or information in the system message to give the model extra context for the conversation. If you only need to include a small amount of information, you can hard code it in the system message. If you have a large amount of data that the model should be aware of, you can use embeddings or a product like Azure Cognitive Search to retrieve the most relevant information at query time.

{"role": "system", "content": "Assistant is an intelligent chatbot designed to help users answer technical questions about Azure OpenAI Serivce. Only answer questions using the context below and if you're not sure of an answer, you can say 'I don't know'.Context:- Azure OpenAI Service provides REST API access to OpenAI's powerful language models including the GPT-3, Codex and Embeddings model series.- Azure OpenAI Service gives customers advanced language AI with OpenAI GPT-3, Codex, and DALL-E models with the security and enterprise promise of Azure. Azure OpenAI co-develops the APIs with OpenAI, ensuring compatibility and a smooth transition from one to the other.- At Microsoft, we're committed to the advancement of AI driven by principles that put people first. Microsoft has made significant investments to help guard against abuse and unintended harm, which includes requiring applicants to show well-defined use cases, incorporating Microsoft’s principles for responsible AI use."},{"role": "user", "content": "What is Azure OpenAI Service?"}

Few shot learning with Chat Completion

You can also give few shot examples to the model. The approach for few shot learning has changed slightly because of the new prompt format. You can now include a series of messages between the user and the assistant in the prompt as few shot examples. These examples can be used to seed answers to common questions to prime the model or teach particular behaviors to the model.

This is only one example of how you can use few shot learning with ChatGPT and GPT-4. You can experiment with different approaches to see what works best for your use case.

{"role": "system", "content": "Assistant is an intelligent chatbot designed to help users answer their tax related questions. "},{"role": "user", "content": "When do I need to file my taxes by?"},{"role": "assistant", "content": "In 2023, you will need to file your taxes by April 18th. The date falls after the usual April 15th deadline because April 15th falls on a Saturday in 2023. For more details, see https://www.irs.gov/filing/individuals/when-to-file."},{"role": "user", "content": "How can I check the status of my tax refund?"},{"role": "assistant", "content": "You can check the status of your tax refund by visiting https://www.irs.gov/refunds"}

Using Chat Completion for non-chat scenarios

The Chat Completion API is designed to work with multi-turn conversations, but it also works well for non-chat scenarios.

For example, for an entity extraction scenario, you might use the following prompt:

{"role": "system", "content": "You are an assistant designed to extract entities from text. Users will paste in a string of text and you will respond with entities you've extracted from the text as a JSON object. Here's an example of your output format:{ "name": "", "company": "", "phone_number": ""}"},{"role": "user", "content": "Hello. My name is Robert Smith. I'm calling from Contoso Insurance, Delaware. My colleague mentioned that you are interested in learning about our comprehensive benefits policy. Could you give me a call back at (555) 346-9322 when you get a chance so we can go over the benefits?"}

Creating a basic conversation loop

The examples so far have shown you the basic mechanics of interacting with the Chat Completion API. This example shows you how to create a conversation loop that performs the following actions:

  • Continuously takes console input, and properly formats it as part of the messages array as user role content.
  • Outputs responses that are printed to the console and formatted and added to the messages array as assistant role content.

This means that every time a new question is asked, a running transcript of the conversation so far is sent along with the latest question. Since the model has no memory, you need to send an updated transcript with each new question or the model will lose context of the previous questions and answers.

import osimport openaiopenai.api_type = "azure"openai.api_version = "2023-05-15" openai.api_base = os.getenv("OPENAI_API_BASE") # Your Azure OpenAI resource's endpoint value .openai.api_key = os.getenv("OPENAI_API_KEY")conversation=[{"role": "system", "content": "You are a helpful assistant."}]while(True): user_input = input() conversation.append({"role": "user", "content": user_input}) response = openai.ChatCompletion.create( engine="gpt-3.5-turbo", # The deployment name you chose when you deployed the ChatGPT or GPT-4 model. messages = conversation ) conversation.append({"role": "assistant", "content": response['choices'][0]['message']['content']}) print("\n" + response['choices'][0]['message']['content'] + "\n")

When you run the code above you will get a blank console window. Enter your first question in the window and then hit enter. Once the response is returned, you can repeat the process and keep asking questions.

Managing conversations

The previous example will run until you hit the model's token limit. With each question asked, and answer received, the messages array grows in size. The token limit for gpt-35-turbo is 4096 tokens, whereas the token limits for gpt-4 and gpt-4-32k are 8192 and 32768 respectively. These limits include the token count from both the message array sent and the model response. The number of tokens in the messages array combined with the value of the max_tokens parameter must stay under these limits or you'll receive an error.

It's your responsibility to ensure the prompt and completion falls within the token limit. This means that for longer conversations, you need to keep track of the token count and only send the model a prompt that falls within the limit.

The following code sample shows a simple chat loop example with a technique for handling a 4096 token count using OpenAI's tiktoken library.

The code requires tiktoken 0.3.0. If you have an older version run pip install tiktoken --upgrade.

import tiktokenimport openaiimport osopenai.api_type = "azure"openai.api_version = "2023-05-15" openai.api_base = os.getenv("OPENAI_API_BASE") # Your Azure OpenAI resource's endpoint value .openai.api_key = os.getenv("OPENAI_API_KEY")system_message = {"role": "system", "content": "You are a helpful assistant."}max_response_tokens = 250token_limit= 4096conversation=[]conversation.append(system_message)def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"): encoding = tiktoken.encoding_for_model(model) num_tokens = 0 for message in messages: num_tokens += 4 # every message follows <im_start>{role/name}\n{content}<im_end>\n for key, value in message.items(): num_tokens += len(encoding.encode(value)) if key == "name": # if there's a name, the role is omitted num_tokens += -1 # role is always required and always 1 token num_tokens += 2 # every reply is primed with <im_start>assistant return num_tokenswhile(True): user_input = input("") conversation.append({"role": "user", "content": user_input}) conv_history_tokens = num_tokens_from_messages(conversation) while (conv_history_tokens+max_response_tokens >= token_limit): del conversation[1] conv_history_tokens = num_tokens_from_messages(conversation) response = openai.ChatCompletion.create( engine="gpt-35-turbo", # The deployment name you chose when you deployed the ChatGPT or GPT-4 model. messages = conversation, temperature=.7, max_tokens=max_response_tokens, ) conversation.append({"role": "assistant", "content": response['choices'][0]['message']['content']}) print("\n" + response['choices'][0]['message']['content'] + "\n")

In this example once the token count is reached the oldest messages in the conversation transcript will be removed. del is used instead of pop() for efficiency, and we start at index 1 so as to always preserve the system message and only remove user/assistant messages. Over time, this method of managing the conversation can cause the conversation quality to degrade as the model will gradually lose context of the earlier portions of the conversation.

An alternative approach is to limit the conversation duration to the max token length or a certain number of turns. Once the max token limit is reached and the model would lose context if you were to allow the conversation to continue, you can prompt the user that they need to begin a new conversation and clear the messages array to start a brand new conversation with the full token limit available.

The token counting portion of the code demonstrated previously, is a simplified version of one of OpenAI's cookbook examples.

Next steps

Working with the ChatGPT models

Important

Using GPT-35-Turbo models with the completion endpoint remains in preview. Due to the potential for changes to the underlying ChatML syntax, we strongly recommend using the Chat Completion API/endpoint. The Chat Completion API is the recommended method of interacting with the ChatGPT (gpt-35-turbo) models. The Chat Completion API is also the only way to access the GPT-4 models.

The following code snippet shows the most basic way to use the ChatGPT models with ChatML. If this is your first time using these models programmatically we recommend starting with our .

import osimport openaiopenai.api_type = "azure"openai.api_base = "https://{your-resource-name}.openai.azure.com/"openai.api_version = "2023-05-15"openai.api_key = os.getenv("OPENAI_API_KEY")response = openai.Completion.create( engine="gpt-35-turbo", # The deployment name you chose when you deployed the ChatGPT model prompt="<|im_start|>system\nAssistant is a large language model trained by OpenAI.\n<|im_end|>\n<|im_start|>user\nWho were the founders of Microsoft?\n<|im_end|>\n<|im_start|>assistant\n", temperature=0, max_tokens=500, top_p=0.5, stop=["<|im_end|>"])print(response['choices'][0]['text'])

Note

The following parameters aren't available with the gpt-35-turbo model: logprobs, best_of, and echo. If you set any of these parameters, you'll get an error.

The <|im_end|> token indicates the end of a message. We recommend including <|im_end|> token as a stop sequence to ensure that the model stops generating text when it reaches the end of the message. You can read more about the special tokens in the Chat Markup Language (ChatML) section.

Consider setting max_tokens to a slightly higher value than normal such as 300 or 500. This ensures that the model doesn't stop generating text before it reaches the end of the message.

Model versioning

Note

gpt-35-turbo is equivalent to the gpt-3.5-turbo model from OpenAI.

Unlike previous GPT-3 and GPT-3.5 models, the gpt-35-turbo model as well as the gpt-4 and gpt-4-32k models will continue to be updated. When creating a deployment of these models, you'll also need to specify a model version.

Currently, only version 0301 is available for ChatGPT. We'll continue to make updated versions available in the future. You can find model deprecation times on our models page.

Working with Chat Markup Language (ChatML)

Note

OpenAI continues to improve the ChatGPT and the Chat Markup Language used with the models will continue to evolve in the future. We'll keep this document updated with the latest information.

OpenAI trained the ChatGPT on special tokens that delineate the different parts of the prompt. The prompt starts with a system message that is used to prime the model followed by a series of messages between the user and the assistant.

The format of a basic ChatML prompt is as follows:

<|im_start|>system Provide some context and/or instructions to the model.<|im_end|> <|im_start|>user The user’s message goes here<|im_end|> <|im_start|>assistant 

System message

The system message is included at the beginning of the prompt between the <|im_start|>system and <|im_end|> tokens. This message provides the initial instructions to the model. You can provide various information in the system message including:

  • A brief description of the assistant
  • Personality traits of the assistant
  • Instructions or rules you would like the assistant to follow
  • Data or information needed for the model, such as relevant questions from an FAQ

You can customize the system message for your use case or just include a basic system message. The system message is optional, but it's recommended to at least include a basic one to get the best results.

Messages

After the system message, you can include a series of messages between the user and the assistant. Each message should begin with the <|im_start|> token followed by the role (user or assistant) and end with the <|im_end|> token.

<|im_start|>userWhat is thermodynamics?<|im_end|>

To trigger a response from the model, the prompt should end with <|im_start|>assistant token indicating that it's the assistant's turn to respond. You can also include messages between the user and the assistant in the prompt as a way to do few shot learning.

Prompt examples

The following section shows examples of different styles of prompts that you could use with the ChatGPT and GPT-4 models. These examples are just a starting point, and you can experiment with different prompts to customize the behavior for your own use cases.

Basic example

If you want the ChatGPT and GPT-4 models to behave similarly to chat.openai.com, you can use a basic system message like "Assistant is a large language model trained by OpenAI."

<|im_start|>systemAssistant is a large language model trained by OpenAI.<|im_end|><|im_start|>userWho were the founders of Microsoft?<|im_end|><|im_start|>assistant

Example with instructions

For some scenarios, you may want to give additional instructions to the model to define guardrails for what the model is able to do.

<|im_start|>systemAssistant is an intelligent chatbot designed to help users answer their tax related questions. Instructions:- Only answer questions related to taxes. - If you're unsure of an answer, you can say "I don't know" or "I'm not sure" and recommend users go to the IRS website for more information.<|im_end|><|im_start|>userWhen are my taxes due?<|im_end|><|im_start|>assistant

Using data for grounding

You can also include relevant data or information in the system message to give the model extra context for the conversation. If you only need to include a small amount of information, you can hard code it in the system message. If you have a large amount of data that the model should be aware of, you can use embeddings or a product like Azure Cognitive Search to retrieve the most relevant information at query time.

<|im_start|>systemAssistant is an intelligent chatbot designed to help users answer technical questions about Azure OpenAI Serivce. Only answer questions using the context below and if you're not sure of an answer, you can say "I don't know".Context:- Azure OpenAI Service provides REST API access to OpenAI's powerful language models including the GPT-3, Codex and Embeddings model series.- Azure OpenAI Service gives customers advanced language AI with OpenAI GPT-3, Codex, and DALL-E models with the security and enterprise promise of Azure. Azure OpenAI co-develops the APIs with OpenAI, ensuring compatibility and a smooth transition from one to the other.- At Microsoft, we're committed to the advancement of AI driven by principles that put people first. Microsoft has made significant investments to help guard against abuse and unintended harm, which includes requiring applicants to show well-defined use cases, incorporating Microsoft’s principles for responsible AI use<|im_end|><|im_start|>userWhat is Azure OpenAI Service?<|im_end|><|im_start|>assistant

Few shot learning with ChatML

You can also give few shot examples to the model. The approach for few shot learning has changed slightly because of the new prompt format. You can now include a series of messages between the user and the assistant in the prompt as few shot examples. These examples can be used to seed answers to common questions to prime the model or teach particular behaviors to the model.

This is only one example of how you can use few shot learning with ChatGPT. You can experiment with different approaches to see what works best for your use case.

<|im_start|>systemAssistant is an intelligent chatbot designed to help users answer their tax related questions. <|im_end|><|im_start|>userWhen do I need to file my taxes by?<|im_end|><|im_start|>assistantIn 2023, you will need to file your taxes by April 18th. The date falls after the usual April 15th deadline because April 15th falls on a Saturday in 2023. For more details, see https://www.irs.gov/filing/individuals/when-to-file<|im_end|><|im_start|>userHow can I check the status of my tax refund?<|im_end|><|im_start|>assistantYou can check the status of your tax refund by visiting https://www.irs.gov/refunds<|im_end|>

Using Chat Markup Language for non-chat scenarios

ChatML is designed to make multi-turn conversations easier to manage, but it also works well for non-chat scenarios.

For example, for an entity extraction scenario, you might use the following prompt:

<|im_start|>systemYou are an assistant designed to extract entities from text. Users will paste in a string of text and you will respond with entities you've extracted from the text as a JSON object. Here's an example of your output format:{ "name": "", "company": "", "phone_number": ""}<|im_end|><|im_start|>userHello. My name is Robert Smith. I’m calling from Contoso Insurance, Delaware. My colleague mentioned that you are interested in learning about our comprehensive benefits policy. Could you give me a call back at (555) 346-9322 when you get a chance so we can go over the benefits?<|im_end|><|im_start|>assistant

Preventing unsafe user inputs

It's important to add mitigations into your application to ensure safe use of the Chat Markup Language.

We recommend that you prevent end-users from being able to include special tokens in their input such as <|im_start|> and <|im_end|>. We also recommend that you include additional validation to ensure the prompts you're sending to the model are well formed and follow the Chat Markup Language format as described in this document.

You can also provide instructions in the system message to guide the model on how to respond to certain types of user inputs. For example, you can instruct the model to only reply to messages about a certain subject. You can also reinforce this behavior with few shot examples.

Managing conversations

The token limit for gpt-35-turbo is 4096 tokens. This limit includes the token count from both the prompt and completion. The number of tokens in the prompt combined with the value of the max_tokens parameter must stay under 4096 or you'll receive an error.

It’s your responsibility to ensure the prompt and completion falls within the token limit. This means that for longer conversations, you need to keep track of the token count and only send the model a prompt that falls within the token limit.

The following code sample shows a simple example of how you could keep track of the separate messages in the conversation.

import osimport openaiopenai.api_type = "azure"openai.api_base = "https://{your-resource-name}.openai.azure.com/" #This corresponds to your Azure OpenAI resource's endpoint valueopenai.api_version = "2023-05-15" openai.api_key = os.getenv("OPENAI_API_KEY")# defining a function to create the prompt from the system message and the conversation messagesdef create_prompt(system_message, messages): prompt = system_message for message in messages: prompt += f"\n<|im_start|>{message['sender']}\n{message['text']}\n<|im_end|>" prompt += "\n<|im_start|>assistant\n" return prompt# defining the user input and the system messageuser_input = "<your user input>" system_message = f"<|im_start|>system\n{'<your system message>'}\n<|im_end|>"# creating a list of messages to track the conversationmessages = [{"sender": "user", "text": user_input}]response = openai.Completion.create( engine="gpt-35-turbo", # The deployment name you chose when you deployed the ChatGPT model. prompt=create_prompt(system_message, messages), temperature=0.5, max_tokens=250, top_p=0.9, frequency_penalty=0, presence_penalty=0, stop=['<|im_end|>'])messages.append({"sender": "assistant", "text": response['choices'][0]['text']})print(response['choices'][0]['text'])

Staying under the token limit

The simplest approach to staying under the token limit is to remove the oldest messages in the conversation when you reach the token limit.

You can choose to always include as many tokens as possible while staying under the limit or you could always include a set number of previous messages assuming those messages stay within the limit. It's important to keep in mind that longer prompts take longer to generate a response and incur a higher cost than shorter prompts.

You can estimate the number of tokens in a string by using the tiktoken Python library as shown below.

import tiktoken cl100k_base = tiktoken.get_encoding("cl100k_base") enc = tiktoken.Encoding( name="gpt-35-turbo", pat_str=cl100k_base._pat_str, mergeable_ranks=cl100k_base._mergeable_ranks, special_tokens={ **cl100k_base._special_tokens, "<|im_start|>": 100264, "<|im_end|>": 100265 } ) tokens = enc.encode( "<|im_start|>user\nHello<|im_end|><|im_start|>assistant", allowed_special={"<|im_start|>", "<|im_end|>"} ) assert len(tokens) == 7 assert tokens == [100264, 882, 198, 9906, 100265, 100264, 78191]

Next steps

Top Articles
Latest Posts
Article information

Author: Wyatt Volkman LLD

Last Updated: 05/29/2023

Views: 6206

Rating: 4.6 / 5 (66 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Wyatt Volkman LLD

Birthday: 1992-02-16

Address: Suite 851 78549 Lubowitz Well, Wardside, TX 98080-8615

Phone: +67618977178100

Job: Manufacturing Director

Hobby: Running, Mountaineering, Inline skating, Writing, Baton twirling, Computer programming, Stone skipping

Introduction: My name is Wyatt Volkman LLD, I am a handsome, rich, comfortable, lively, zealous, graceful, gifted person who loves writing and wants to share my knowledge and understanding with you.