Most of us have heard about the advances of GPT (Generative Pre-trained Transformer) models that have finally delivered on the promise that AI evangelists like me have been preaching for around 10 years now. If I look at my own way of working, almost 70% of my daily work now involves some form of generative AI. Be it “soft” things like writing marketing texts, sales e-mails, or product descriptions, I can’t really be bothered to write them on my own anymore. Just feed it bullet points and be sure to validate the outcome. I’m gonna come out and say this, this text WAS NOT written by chatGPT. I enjoy writing about things I’m passionate about which means you can enjoy all of my grammar mistakes that chatGPT would never make.
Recently, however, was talking to a data scientist working with computer vision object detection and she was telling me how hard it still was to collect data for any random object. Sure if you’re detecting people, there are hundreds of datasets available but for super-specific things, you still have to suffer through the data collection process.
This got me thinking that I have not felt the pain of data collection for the last 3–4 months now and how alien this problem felt to me. You see, ever since the release of GPT3, Dalle2, Midjourney, and all the first image generation tools, I’ve been running a semi-secret project inside of Fyma. Let’s call it ObjectX as it doesn’t have a name…also, I’ve been calling it ObjectX.
The goal of ObjectX is to utilize Fyma’s high-performance computer vision pipeline (read more about this at Nvidia) to allow our customers to detect ANY object by just uploading 10 images of that object. Essentially what this means:
- Fyma gets 10 images of the object to be detected.
- We automatically label that using our own algorithms (Future plans to hopefully use SAM (https://segment-anything.com/).
- We generate around 2000 images using GPT algorithms.
- We automatically train a computer vision object detection model.
- We can immediately deploy that model with Fymas computer vision pipeline.
What does this mean for our customers? Essentially if anyone wants to detect any object and create any automation around that all they need to do is:
- Buy an IP camera and add the video stream to Fyma (an example of the platform can be seen here)
- Take 10 images of the object they want to detect.
- Upload those 10 images to Fyma.
- Generate your own computer vision model.
- Define your automation rules.
- Profit.
When I first started working on this, our first goal was to test how AI-generated images can be used for model training. The first object I tested was my breakfast, a bottle of yogurt I had lying on my desk:
My choice of an object as it had a distinct shape and unique graphics.
I took 10 images of the bottle, generated 200 and pushed it into our training pipeline. The results, well see for yourself:
Just because AI-generated images are always a crowd-pleaser, here are the input images I used to train our model, oh and I didn’t use a single real-world image in my training dataset.
That example was around 3 months ago, so where are we now? Well, we’ve gone from detecting bottles to testing this with real-world customers such as cities and airports as well as gone from single-object models to multi-object models. How’s the accuracy? Well, it’s far below what a real image dataset can provide but at the same time, it generalizes the model a lot better. We are combining AI-generated datasets with synthetic datasets using 3D models and also real-world images, and most importantly all of this automatically.
An example of this is a recent model that we trained which was to detect various airport equipment (things like luggage belt vehicles, stair vehicles, etc.). The “airport model” as we call it is a good example of where real-world images are difficult to get a hold of because of privacy and security concerns. If we generate these images, however, we are all good.
I guess the only thing to show you is all of this working in the real world. Obviously, I can’t show a real-world airport from our product but I can show a stock video, none of which was used to train the detection algorithm, working in our Fyma platform:
Yeah…so basically now you can detect any object by just taking 10 images of it. I was at an Estonian startup podcast around two years ago where one of the hosts asked if he could detect birds from his garden with Fyma. The answer was no back then. Now, all you have to do is catch those birds with your camera and you’re good to go.
This article is written by Taavi Tammiste, the CTO and Co-Founder of Fyma. It was originally posted on Medium.
Karen Burns
Driven by her love of helping people and businesses, strategising on adoption of new technologies and building high-performing teams, Karen is responsible for the company’s day to day operations, development of its growing team and the management of Fyma’s international client base.
Want to explore further?
Let us know and we'll help you unlock the full potential of your already existing camera feeds.