Format ChatGPT results with PydanticOutputParser

Tackling the variability in ChatGPT outputs requires a methodical and structured approach. Pydantic, a Python library, steps up to this challenge, offering a robust framework for defining and validating structured data models. This ensures that the parsing of ChatGPT's outputs is both consistent and reliable.

By harmonizing ChatGPT's versatile outputs with Pydantic's strict data models, we create a synergy that enhances the reliability and usability of AI-generated content.

This article explores this synergy, focusing on how Pydantic's precision in data structuring, combined with LangChain's capabilities, streamlines the process of interpreting and utilizing ChatGPT's outputs in various complex applications.

Use objects in Python with Pydantic

Pydantic, created by Samuel Colvin, is a pivotal tool in Python for data validation and error handling. With over 10 million downloads and the latest version 1.8.2, it's particularly useful in web development and complex data processing, ensuring that data structures are rigorously maintained.

Pydantic's BaseModel serves as a fundamental building block for creating data models, offering a suite of features crucial for robust data handling. It ensures automatic data validation and enforces type annotations, which are vital for maintaining data integrity and consistency.

BaseModel also adeptly handles default values and optional fields, adding flexibility and resilience in data processing. Its capability to parse data from various formats like JSON and support for custom validation rules make it indispensable.

Furthermore, BaseModel promotes code reusability and maintenance, key in complex applications that integrate dynamic AI outputs, such as those from ChatGPT, ensuring accurate and efficient data management.

To illustrate Pydantic's utility, we'll create a cryptocurrency summary. We're not going to dwell on the veracity of the information returned by the LLM, but rather on the format of the results. The models in Pydantic are defined as follows:

from pydantic import BaseModel

class CryptoCurrencySummary(BaseModel):
    name: str
    high: float
    low: float

class Summary(BaseModel):
    date: str
    crypto_currencies: List[CryptoCurrencySummary]

The aim here is not to go into detail about how Pydantic works, which could be the subject of an entire article.

Models - Pydantic

Data validation using Python type hints

logo

What we are interested in here is the integration of the Pydantic parser into LangChain.

Parse results with PydanticOutputParser

The PydanticOutputParser is a crucial component of LangChain, designed to build parsers that seamlessly integrate with Pydantic models.

This parser plays a pivotal role in translating outputs from language models like ChatGPT into structured Pydantic data models. It essentially acts as a bridge between the dynamic, often unstructured output of language models and the strictly typed, validated structure expected by Pydantic models.

from langchain.output_parsers import PydanticOutputParser

parser = PydanticOutputParser(pydantic_object=Summary)

prompt_template = """
        You're an expert about cryptocurrency.
        Your role is to extract yesterday's high and low about the 10 best cryptocurrencies.
        The date format should be in the format YYYY-MM-DD
                                
        {format_instructions}
    """

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=[],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

The LangChain function, parser.get_format_instructions(), crafts a prompt that ensures ChatGPT's output adheres to our model's structure.

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"$defs": {"CryptoCurrencySummary": {"properties": {"name": {"title": "Name", "type": "string"}, "high": {"title": "High", "type": "number"}, "low": {"title": "Low", "type": "number"}}, "required": ["name", "high", "low"], "title": "CryptoCurrencySummary", "type": "object"}}, "properties": {"date": {"title": "Date", "type": "string"}, "crypto_currencies": {"items": {"$ref": "#/$defs/CryptoCurrencySummary"}, "title": "Crypto Currencies", "type": "array"}}, "required": ["date", "crypto_currencies"]}
```

All that remains is to send the prompt to the chosen LLM, in this case ChatGPT.

chain = prompt | ChatOpenAI(temperature=0.1, model="gpt-4-1106-preview")
json = chain.invoke({})

⚠️

I don't know enough about the impact of temperature on the consistency of the result format, so change with care.

😀

NB: The result takes the latest GPT3.5 update date in the example. If you really want to get the latest values, you'll have to go further !

The result of a call to OpenChatAI always places the response in the content property of the return object, which in our case contains the expected structure:

{
  "date": "2023-04-05",
  "crypto_currencies": [
    {
      "name": "Bitcoin",
      "high": 47000.00,
      "low": 45000.00
    },
    {
      "name": "Ethereum",
      "high": 3200.00,
      "low": 3100.00
    },
    {
      "name": "Binance Coin",
      "high": 420.00,
      "low": 400.00
    },
    {
      "name": "XRP",
      "high": 0.80,
      "low": 0.75
    },
    {
      "name": "Cardano",
      "high": 1.20,
      "low": 1.10
    },
    {
      "name": "Solana",
      "high": 110.00,
      "low": 100.00
    },
    {
      "name": "Avalanche",
      "high": 90.00,
      "low": 85.00
    },
    {
      "name": "Polkadot",
      "high": 25.00,
      "low": 23.00
    },
    {
      "name": "Dogecoin",
      "high": 0.15,
      "low": 0.14
    },
    {
      "name": "Shiba Inu",
      "high": 0.000030,
      "low": 0.000028
    }
  ]
}

All that remains is to parse the result result = parser.parse(json.content) to obtain a filled Summary object:

date='2023-04-05' 
crypto_currencies=[CryptoCurrencySummary(name='Bitcoin', high=47000.0, low=45000.0), CryptoCurrencySummary(name='Ethereum', high=3200.0, low=3100.0), CryptoCurrencySummary(name='Binance Coin', high=420.0, low=400.0), CryptoCurrencySummary(name='XRP', high=0.8, low=0.75), CryptoCurrencySummary(name='Cardano', high=1.2, low=1.1), CryptoCurrencySummary(name='Solana', high=110.0, low=100.0), CryptoCurrencySummary(name='Avalanche', high=90.0, low=85.0), CryptoCurrencySummary(name='Polkadot', high=25.0, low=23.0), CryptoCurrencySummary(name='Dogecoin', high=0.15, low=0.14), CryptoCurrencySummary(name='Shiba Inu', high=3e-05, low=2.8e-05)]

The final result is the following script:

from pydantic import BaseModel, Field
from langchain.output_parsers import PydanticOutputParser
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from typing import List

class CryptoCurrencySummary(BaseModel):
    name: str
    high: float
    low: float

class Summary(BaseModel):
    date: str = Field(description="date of yesterday with the format YYYY-MM-DD")
    crypto_currencies: List[CryptoCurrencySummary] = Field(description="list of 10 best cryptocurrencies summary of yesterday")

parser = PydanticOutputParser(pydantic_object=Summary)

prompt_template = """
        You're an expert about cryptocurrency.
        Your role is to extract yesterday's high and low about the 10 best cryptocurrencies.
        The date format should be in the format YYYY-MM-DD
                                
        {format_instructions}
    """

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=[],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | ChatOpenAI(temperature=0.1, model="gpt-4-1106-preview")
json = chain.invoke({})
result = parser.parse(json.content)

Conclusion

The combination of Pydantic with LangChain and ChatGPT represents a significant advance in the exploitation of language models for concrete, structured applications.

Pydantic, with its ability to rigorously define and validate data models, provides a reliable framework for dealing with the dynamic and sometimes unpredictable output of ChatGPT.

This marriage of AI and data validation paves the way for more robust, reliable and efficient applications, marking an important step towards the seamless integration of artificial intelligence into complex data processing systems.

To go further:

Pydantic (JSON) parser | 🦜️🔗 Langchain

This output parser allows users to specify an arbitrary JSON schema and