Skip to main content

Developing AI vision application with OpenAI Vision API: The Ippon BinSmart Use Case

Screenshot 2024-08-28 at 6.57.42 AM

Introduction

Artificial Intelligence (AI) has seen remarkable breakthroughs in recent years, transforming from a complex and costly technology to an accessible and increasingly common tool in corporate life. At Ippon Australia, we've harnessed this power to tackle a common problem: determining the correct bin for recycling for a given item. With the introduction of bins, we developed BinSmart, an AI-integrated application using OpenAI's vision capabilities to assist team members and guests in identifying the proper waste classifications.

Background

At the Ippon Collaboration Hub, we take our recycling seriously. These efforts were catalyzed by the building management, which also improved their waste management credentials and provided bins designated for specific waste types. While this initiative is environmentally beneficial, it has caused some confusion. To alleviate this, we created the Ippon BinSmart app, an AI-based solution that analyzes waste images and advises on the most appropriate bin.

The Journey to AI Integration

As AI enthusiasts, the team at Ippon is always looking for useful applications of this rapidly emerging technology, although practical applications are sometimes hard to identify. While we have team members with AI certifications from Microsoft and AWS, we ultimately turned to OpenAI for its latest flagship model, GPT-4o, for its impressive capabilities. This state-of-the-art technology reportedly surpasses existing models in vision capacities, offering competitive pricing and exceptional performance in text understanding and response correctness. As a result, it significantly streamlines and accelerates the development process.

Experimenting with ChatGPT Custom Model

To start, we looked at using OpenAI’s ChatGPT Custom Model. ChatGPT is a chatbot developed by OpenAI that has attracted millions of users worldwide. Based on large language models, it enables users to get answers in the form of conversation, and it can refine responses based on prompts from the users. Using ChatGPT, you can configure and set up your customized GPT that can respond based on your own configurations, during which no coding is required. Here is the basis of what we created:

We began with simple tests, such as identifying common waste-ready items, and gradually moved to more complex scenarios. The results were very promising; ChatGPT-4o could not only identify the item but also provide context about its recyclability.

A Simple Test Case

Screenshot 2024-08-28 at 7.16.37 AM

A Complex Test Case

Screenshot 2024-08-28 at 7.16.59 AM

More testing cases were undertaken, and our confidence grew in the model's ability to analyze the condition of items, such as determining if an oily container is recyclable, which was particularly impressive.

Development Stage

Using the OpenAI Vision API

It was disappointing to discover that custom models are not supported by API access. The Assistant API, which serves as an alternative to custom models, doesn't integrate with the Vision API in the current OpenAI offerings. As a result, we focused on using the Vision API to complete this task.

According to OpenAI’s documentation, OpenAI’s latest model, GPT-4o, has vision capabilities, indicating this model can analyze image input and provide a response based on your prompt. To access ChatGPT's vision API, we need to subscribe to OpenAI's pay-as-you-go plan, which is as simple as providing your payment card information to activate the service.

Within the Openai playground, we created a project and generated an API key in the project dashboard. After setting up our account, we need to install the necessary libraries.

For Python:

Screenshot 2024-08-28 at 7.15.07 AM

For Node:

Screenshot 2024-08-28 at 7.15.21 AM

In this blog, we will only use the Node version library. To use the Vision API, we implemented the following code:

Screenshot 2024-08-28 at 7.13.52 AM

In the code above, we added the prompt in the ‘text’ and the image file in the type ‘image_url’, where the ‘image_url’ is a base64-encoded string captured from a mobile device using a web camera.

Defining the Categories

We chose GPT-4o and crafted a prompt to include our five bin categories:

  1. Recycling Bin: Tin-coated steel cans, milk bottles, wine bottles, juice bottles, and clean plastic takeaway containers without the 10 cent mark.
  2. Rubbish Bin: General rubbish, food waste, and other kinds of rubbish that are not suitable for recycling.
  3. CDS Bin: Aluminium cans, small glass bottles, plastic bottles, juice cartons, and containers that come with the 10-cent mark.
  4. Paper and Cardboard Recycling Bin: Clean paper and clean cardboard.
  5. Soft Plastics Bin: Clean soft plastic.

We included this context in the prompt and defined the desired response format:

            {
                type: "text",
                text: `
                Please analyze the following image and classify the object into one of the specified bin categories. Based on the categories provided, determine which bin the item should be placed in. Categories are:
                - Recycling Bin: Tin-coated steel cans, Milk bottles, Wine bottles, Juice bottles, Clean plastic takeaway containers that without 10 cents mark
                - Rubbish Bin: General rubbish, food waste, and other kinds of rubbish that are not suitable for recycling.  
                - CDS Bin: Aluminium Cans, Glass Bottles, Plastic Bottles, Cartons, Containers that come with the 10 cent mark
                - Paper and Cardboard Recycling Bin: Clean paper and Clean Cardboard
                - Soft Plastics Bin: Clear plastic

                Respond with a JSON object in the following format:
                \`\`\`json
                {
                  "binCategory": "category_name",
                  "confidenceRate": "confidence_rate%",
                  "explanation": "detailed_explanation"
                }
                \`\`\`
                `,
              },

In the above, ‘binCategory’ represents the bin we want to put the rubbish in, ‘confidenceRate’ refers to the confidence rate of the bin category selected by OpenAI APIs, and the ‘explanation’ is how OpenAI explains its answer. This information will be presented to our users on the client side by our software.

During the performance tests, we noticed that the response was unstable, and it often responded with invalid JSON objects, though it did provide the information required.  This suggested to us that we needed the model to establish a stronger context for both well-formed JSON, and the specific parameters we need for our application. Therefore, we provided extra prompts as below:

Respond with a JSON object in the following format:
                \`\`\`json
                {
                  "binCategory": "category_name",
                  "confidenceRate": "confidence_rate%",
                  "explanation": "detailed_explanation"
                }
                \`\`\`
                Make sure the response is a valid JSON string and only contains the JSON object.

This modified vision did improve the response, reducing the frequency of invalid JSON, however, the problems persisted. Therefore, we had to handle the object passed from the API call to ensure the correct data format.

Handling Invalid JSON Response

To ensure the expected data format, we needed to further process the data from the API response. As a result, we have included a function to phase the response as shown below:

const sanitizeJsonString = (str) => {
    // Remove any non-JSON parts (Whitespace includes spaces, tabs, and newline characters)
    let sanitized = str.trim();

    // Remove leading and trailing backticks or other non-JSON content
    sanitized = sanitized.replace(/^[`]*json[`]*|[`]*$/g, "").trim();

    // Replace common errors in JSON
    sanitized = sanitized.replace(/\\n/g, ""); // Remove newlines if they are problematic
    sanitized = sanitized.replace(/(\r\n|\n|\r)/gm, ""); // Remove all types of newlines
    return sanitized;
  };

 

This function weeds out any non-JSON values and errors from the response, guaranteeing a cleaner and more dependable data output. As a result, we can provide accurate and usable data to the front end, enhancing overall application performance and user experience.

Additional Features and Enhancements

Handling User Inactivity

As this is an image-based application, the device camera is critical; however, it is desirable to turn off the camera when not in use. To make this happen, we added a timer function on the client page that starts counting down once the page loads, resetting whenever an action is taken. This directs the user to an idle page after a period of inactivity, optimizing resource usage and enhancing the user experience. To develop this page, we need some design help from ChatGPT.

Generating Hero Images with ChatGPT

ChatGPT's image generation function has proven to be quite effective, generating images based on prompts within a minute, though there is still plenty of room for improvement. It helped us to create visually appealing and functional hero images, which can be customized to align with brand identity and user preferences. Please find below the imagery generated by ChatGPT:

 

Final Product

The above images demonstrate a typical use of BinSmart, i.e., users will be navigated to the check-item page of BinSmart once you click on the ‘Try BinSmart Now’  button. Then, the user's camera will be turned on automatically if you grant the browser permission to access your camera. Users can switch cameras between the front and the back cameras if both are available using the ‘Switch Camera’ button. Once an image is captured, BinSmart will start processing the user's request, and a result table will be shown below the image, detailing the bin category information for your item. If no action is taken for more than 1 minute, the check-item page will be automatically redirected to the idle page, thus turning off the camera. 

Conclusion

Creating BinSmart was an educational journey into AI integration. By leveraging OpenAI's powerful Vision API and implementing error handling, we developed a practical solution to support the recycling efforts at the Ippon Australia Collaboration Hub. OpenAI APIs turned the development of core components of the AI application into a process of prompt engineering. While the initial results were promising, continuous testing and prompt refinement were necessary to achieve reliable performance. 

Utilizing OpenAI APIs greatly reduces the amount of time required to develop such an application from years to just weeks, saving us from training and hosting models, as well as significantly reducing ongoing running costs since it is charged by API consumption instead of running time. Nonetheless, such a development process requires patience and persistence for first-time starters. 

Beyond BinSmart, we look forward to exploring other promising fields and building up partnerships with clients to harness the power of AI technology to improve customer experience as well as to drive sustainable development.

Post by Patrick Li
Aug 28, 2024 7:44:46 AM

Comments

©Copyright 2024 Ippon USA. All Rights Reserved.   |   Terms and Conditions   |   Privacy Policy   |   Website by Skol Marketing