You Only Look Once: Unified, Real-Time Object Detection

yolo pic This is a topic that many people are looking for. is a channel providing useful information about learning, life, digital marketing and online courses …. it will help you have an overview and solid multi-faceted knowledge . Today, would like to introduce to you You Only Look Once: Unified, Real-Time Object Detection. Following along are instructions in the video below:

“Name is joe i m gonna be presenting yolo. A real time object detector. This this is joint work with roster shakes untouched. Obama and ally for hottie.

So as ve already heard about in the previous presentations object detection is simply you have an image you want to put boxes around the things in that image and see what those objects are so humans do object detection really easily. But this is something that computers have struggled with historically for a long time. And there have been a lot of advances recently and making detection. More accurate.

However object detection is still pretty slow so for a long time the formal parts models were the gold standard in object detection and they took about 14 seconds to process an image and a couple years ago. Our cnn came out and offered a huge boost in terms of accuracy. But even longer processing time so about 20 seconds for image. Now if you imagine using one of these detectors in an application say a self driving vehicle if you have a car going down the highway at 60 miles per hour in 20 seconds your car s gonna go about a third of a mile and this is the time between when your car sees an object and when your car realizes that object is in front of it obviously a third of a mile isn t going to cut.

So we need something a little bit faster and a lot of recent work has focused on making our cnn faster so fast our cnn offered some improvement in accuracy and also brought processing time down to two seconds per image. So this is a lot better. But our car is still going about half a football field in that time faster. Our cnn.

Which was developed concurrently with yolo brings brings detection speed to 140 milliseconds per image. And this is this is also better so now we re at 7 frames per second our car is only going to go 12 feet in this amount of processing time so obviously better than 1 3. Of a mile but still enough time to plow through sort of a similarly sized vehicle in front of it so just gonna do a little demo here of what these detection speeds. Actually look like in practice.

It s important to sort of get a feel for them so this is a detector running at 2 seconds per frame the same speed as fast our cnn obviously there s a lot of delay between frames you can as i move around there s a lot of latency so if we increase the speed to faster our cnn. This is you know running about seven frames per second yeah. There s still there s still a little bit lag. But it s a little bit more continuous which is nice.

But there s still some latency between frames and the video is pretty choppy. What we really wanted something smoother. So this is yoga running in real time on my laptop. And you can see it automatically sort of tracks me as i move around the frame.

And it s running in real time speeds just on this laptop. So yolo runs in actually more than real time. At 45 frames per second or 22..


Milliseconds per image. And this speed comes in a little bit of price in terms of accuracy. We get 63 about me. An ap on pascal voz as opposed to fast and faster on cnn.

Which are in the 70s and even higher than that as you recently heard. However since acceptance. We ve actually gotten this number up a little bit further to 69. Now rich precision.

And we expect that there some more advances that that we could do so to get detection speeds to be this fast. We actually had to rethink sort of the standard object detection system. So when we started this project. There were two main object detection frameworks two formal parts models and our cnn before all the parts models people are probably familiar with you run a classifier sliding window over an image and high classification scores correspond to detections our cnn instead of using sliding window first extracts region proposals using selective search.

And then classifies those regions with a more powerful classifier. Cnn. Based classifier and our big insight was that both of these methods are using these region. Based classifiers and they re looking at an image.

You know thousands or hundreds of thousands of times to perform detection. So this involves a lot of evaluation of these classifiers over and over again in different parts of the image. And what we really wanted was a neural network. One neural network that you could just give it the full image and get detection zout of in a single pass and the advantage.

This is that instead of looking at a network. You know or instead of looking at an image. Thousands or hundreds of thousands of tons to do detection. Now yolo with yolo yolo.

Look once at an image to perform the full detection pipeline. So to be able to train this neural network. We had to come up with a new parameterization for object detection. So if you have an image that you want to perform detection on first.

We imagine overlaying a grid on top of that image and each cell in this grid. Is going to be responsible for predicting a few different things. The first thing is each cell is going to predict some number of bounding boxes..


So for example this cell in the upper right is going to predict some bounding boxes and also confidence values for each of those bounding boxes. And this is the probability that that box contains an object. And there may be some grid cells. That don t have any objects nearby on them.

So this one in the bottom right is going to predict a few bounding boxes. We don t care what they are we just want their confidence values to be very low since they don t contain any objects. The next thing we re going to do is each object or each grid cell. Sorry so when you when you visualize all of these predictions together you have basically a map of all the objects in the image and a bunch of boxes ranked by their confidence value so now you know basically what where the objects are in the image.

But you don t necessarily know what they are so the next thing. We re going to do is have each grid cell. Predict some class probabilities. And this is going to look like a coarse segmentation map of the immense so nina have things like bicycle and car.

And one important feature of this mask is that it s not saying that this grid cell. Contains that object it s really conditional probability. So if a grid cell predicts car it s not saying that there is a car in this grid cell. It s just saying that if there is an object in this grid.

Cell. Then that object is a car and this parameterization so so if you take the probability the conditional probabilities. And then multiply them by the confidence values that we computed earlier you basically get all of the bounding boxes weighted by their actual probabilities for containing that object and now we have a bunch of detections for this object. We have a lot of boxes.

A lot of them are very low confidence value for any class. So we simply threshold. The predictions perform non mac suppression to get rid of some duplicate. Detections and we have our full detection.

Z for that image. This parameterization fixes. The output size for detection. So now we have this tensor that we re basically trying to predict so we have a grid and each grid cell.

Is going to predict some bounding boxes. And some class probabilities so for example for pascal vlc. We used a 7×7 grid..


We use two bounding boxes per cell and there are 20 classes so this gives us the 7 by 7 by 30 output. Tensor or about 1500 outputs. And this isn t that many parameters for a neural network to predict. Imagenet is a thousand classes.

So we have we have this output. Tensor and all we want to do is train. A neural network to predict this output tensor so now in one pass. We can go from an image straight to this output.

The tents that this output tensor which corresponds to the detections for the image. And this is very powerful because we are good at evaluating one pass through a neural network. So we ve essentially sped up the detection pipeline to the point that it s now it can be the same speed as a classification pipeline. It s just one evaluation of this neural network.

This also means we re predicting all of these detections simultaneously so our model implicitly incorporates global context in the detection process. So it can learn things about which objects tend to co occur together this relative size and location of objects and things like that so we want at rit or we want to predict full full detection from a single image. Which means. We also have to train on full images and let s see how we do that so we re going to get an image and we re going to get some ground truth labels for that image.

The first thing. We want to do is match each each ground truth label with the appropriate grid cell. That we want to predict at test time that detection. So all we do is we take the center of a bounding box and wherever that center falls whatever good salt falls into that grid.

So it s going to be responsible for predicting that detection. So the first thing. We do is adjust that cells class predictions in this case. We want to predict dog and we also have to adjust that cells bounding box proposals so we look at the cells predicted boxes and we re going to figure out which one overlaps.

Most with our ground truth label and we re going to adjust that so we want to increase the confidence and we also want to adjust its coordinates. We also want to look at the other bounding boxes predicted by that cell and decrease their confidence since they don t overlap the object and we re gonna have a lot of cells in this image. That don t have any ground truth detection is over overlapping with them so we just want to look at all of the bounding boxes for those cells and decrease their confidence as well since they don t have they don t contain any objects one important thing to note is that we don t want to adjust the class probabilities or coordinates for those bounding boxes. Since there aren t any actual ground truth objects in that region.

There s no ground truth labels that we want to predict there so our training of this network. Is actually pretty straightforward it corresponds to a lot of standards in the vision community. So we pre trained on image that we use to cast a gradient descent..


We use lots of data. Augmentation and you can find a lot of details about our training methodology in the paper in practice the system works pretty well so it works across a variety of natural images you can see it does make some mistakes. It thinks the person in the right hand image is an airplane. We also found some interesting properties of yellow.

So it generalizes really well to new domains. We trained it on natural images and then ran it on artwork and it still manages despite changes in texture and things like that to detect all of the objects that you would normally expect and and even with sort of abstract representations of people that can scream or cubist artwork. So we tested yellow on on a couple of standard data sets for training on natural images and tested on artwork and found it outperformed a lot of other detection methods like the formal parts models. Nr cnn.

In this generalization process. We also have trained yellow on new datasets. So microsoft coco has 80 object categories instead of the 20 and pascal vlc. So we can just check that quickly and microsoft coco has like i said 80 any classism.

I have to turn out my light so that my my webcam actually can t quite keep up in in low light situation. So this is our model training coco and i don t know what it said. It really likes it really like saying that white shiny. Things are toilets and now it recognizes when i put on my tie.

That i have a tie on we also have a variety of fun things to play with over here. So let s see we have some dogs and things like bicycles sort of standard stuff. And we can do things like zebras and giraffes now we have this this bird. Some potted plants with with their corresponding vases and there s a person riding ours oh and and you can also point it it itself and it knows that it s a laptop.

But bad things start happening. If you go too deep set. So. I also just finally want to say that all of all of this code is available our training and testing and demo code is is online and has been for the last year.

So definitely download it and play with it and we re working on a few things in terms of future directions. But one exciting thing is we ve been combining this with some work on x2 our network. So the binary version of networks to make it faster and try to get it to run on smaller things like cpus or embedded devices. So thank you and i can pick i ” .


Thank you for watching all the articles on the topic You Only Look Once: Unified, Real-Time Object Detection. All shares of are very good. We hope you are satisfied with the article. For any questions, please leave a comment below. Hopefully you guys support our website even more.


Leave a Comment