Conf42 Machine Learning 2021 - Online

Object Detection using Transformers and CNNs - A Drone Case Study

Video size:


Drones with mounted cameras provide significant advantages when compared to fixed cameras for object detection and visual tracking scenarios. Given their recent adoption in the wild and late advances in computer vision models, many aerial datasets have been introduced.

In this talk, we’ll explore recent advances in object detection, comparing the challenges of natural images with those recorded by drones. Given the successes achieved by pretraining image classifiers on large datasets, and transferring the learned representations, a set of object detectors fine-tuned on publicly available aerial datasets will be presented and explained. We’ll highlight existing libraries that mitigate the cost of training large models from scratch, by including pretrained model weights and model variants found in the literature. Both Convolutional Neural Networks and the newly developed Transformers applied to vision will be covered and compared, outlining the main features of each architecture. The presentation will be accompanied by code snippets for aiding understanding and delivering practical examples.

This is aimed at a general audience familiar with Python. Knowledge of Computer Vision is a plus but not a requirement as we’ll introduce the necessary concepts. We’ll ground the presented model architectures and libraries on the task of object detection applied to aerial datasets and demonstrate that state-of-the-art methods are within everyone’s reach.


  • Eduardo Dixo is a senior data scientist at Continental. He talks about object detectors using cnns and transformers applied to images recorded by datasets. Next we'll see some common CNNS based architectures like the faster RCNN and retinate.
  • The transformer was originally proposed as a sequence to sequence model for machine translation. But also it has found its way into computer vision and other tasks. Given large enough scale data, it can learn this from the data and perform on par or even surpassing the cnns.
  • The detection transformer is based on a CNN backbone. It has a fairly good average precision for large objects, but it is very small for small and medium objects. Similar approaches could also help improving the detection transformer further.


This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Eduardo Dixo, I'm a senior data scientist at Continental and today I'm going to talk to you about object detectors using cnns and transformers applied to images recorded by datasets. First I'm going to introduce the task of object detectors and also the data set that we'll be using. Next we'll see some common CNNS based architectures like the faster RCNN and retinate before discussing the transformer and seeing how the detection transformer performs on our data set. So let's begin by first introducing the task of object detection taskoff. Object detection can be regarded as given an input image. We want to find all the objects that are present in that image. So we need to spatially locate them using bounding boxes and also we need to classify them into a set of predefined classes. If we compare the task of object detection with image classification, for example, in image classification we usually have a single main target. In object detection we may have different number of objects present in that image with different poses, with different scales. And this makes the task very challenging, more challenging than the image classification. For example, the data set that we are going to use is those visron data set that contains nearly 6000 training images and 500 validation images. It also contains ten categories from which we are only interested in cars. We are going to build an object detector for detecting only one class, which will be cars. And this data set is interesting because it records the images and the different conditions like different weather, different lighting, different object density of the scenes, different scales of the objects. We have some fast motion artifacts because of the movement of the cars or the movement of those drone during flight. And also the bounding boxes are annotated for occlusion and those truncation. Some applications of training such object detector could be interesting for road safety, traffic monitoring or even driving assistance as finding free tracking slots. First, we make a distinction between the one stage and two stage object detectors. Two stage object detectors contain a region proposal network that will output high confidence region proposals that should contain an object on it. So it's not concerned what is the class of the object in it, it's only concerned if there is an object or not. And then the object detector head that typically drones bounding box regression for finding the position of the object and object classification to find its class can attend to these proposed regions and by doing so it will have a much smaller set of candidate regions that might have an object. And this will eliminate many of the case positives that we would have otherwise. A one shot detector, on the other hand, generates a dense sampling of possible object locations. So it will generate lots of object candidate locations with different shapes and different aspect ratios, and it will process them directly to learn the class labels and bounding boxes. The first model that we are going to discuss is the faster RCNN. The faster RCNn is a two stage object detector that employs two models, a region proposal network, and also the classifier head that has the bounding box, regression and object classification. We will start by following those typical data flow of the image as it goes through the architecture. So the image goes through the backbone. The backbone goal is to extract eye level semantic feature maps from the image. That will be useful later for the region proposal network and for the classifier. This can be typically achieved by any of the shelf convolutional architectures like Rasnet or Vgg. As the image goes through these several convolutional layers, it gets downsampled so it will have smaller width and smaller height, but much more depth, meaning that the feature map of the last stage of the backbone will have many channels. Next we have this region proposal network. This region proposal network will predict the object bounds as well as the objectness cars. Meaning if it is an object or not and it's a fully convolutional network, it will receive as input the feature maps from the backbone. It will slide those window over these feature maps. At each point of those sliding window it will generate k anchor boxes. The number of anchor boxes is parameterized by this k and it will have two sibling networks for the outputs, one that is two times the number of anchor boxes for the score classification in foreground and background, and the other one will be four times the number of anchor boxes for the bounding box coordinates. Finally, now we have a set of regions proposed by this region proposal network module, and in a very naive way, we could simply crop the image using these proposal regions and feed it into another classifier just to get the object class. However, we want to make those end to end and to reuse the feature maps that we have computed from the backbone. And for doing so, we are going to map the feature maps to the proposals of the region proposal network using this region of interest pulling layer that will extract then fixed size features maps from each of these proposals from the feature map. The reason these are fixed size is because we are going to use fully connected layer that expects fixed size. Then we have this classifier that will predict the object class as well as the bounding box coordinates. We are going to use the detection tool library which is a pytorch based deep learning framework for object detectors and also semantic segmentation. And we are using to use faster RCNn with a Resnet 50 backbone using fully features pyramid networks the reason we are going to use these feature pyramid networks is because we have images in our data set that have very small scale so we have small cars and also large cars that we want to detectors depending on the altitude that the drone is flying. And by using these feature pyramid networks we can improve the multiscale object detection because those goal of the feature pyramid network is to build these eye level semantic feature maps across all the pyramid levels from a single image of a single resolution. This is done by merging the bottom pathway which is the feature maps from our CNN backbone that then are upsampled through those top line pathway and merged through lateral connections in the feature pyramid network architecture. For training the faster RCNN the first step is to register our data set. We do this so that the detectors two knows how to obtain it. If we already have the annotations in adjacent cocoa format, we can use the register cocoa instances directly. In this case we have prepared the annotations in this format so we can use those register cocoa instances and we also pass the base path images so it knows where to fetch the images from. Next, detectors two uses the key value config system based on YaML files that provide already some common functionality and operations. If we require more advanced features, we can drop down to the Python's API or also derive from a base config file and implement the attributes. And in here what we are going to do is first we load the default configuration file. We then inherit from the configuration file of the model that we want to fine tune. We specify the training and test data sets that we already registered previously. We specify the number of workers for the multiprocessing part and we load the pretrained model weights from the detectors two model zoo. Then we have the learning rate, the maximum number of iterations, the batch size, and the steps at which to decay the learning rate. All of these are very important parameters that we should tune to get the best metrics, but also to squeeze the best performance out of the GPU. And then we specify the number of classes for this particular architecture, which is one because we are only interested in detectors cars. Finally, we can launch the training using the default trainer class that provide out of the box standard training logic. If we require, we could also implement our own Python training loop or also subclass this default trainer in here. Since we are not loading from a checkpoint. We pass this resemb equal false. Now we take a look at a one stage detector. So retinate is a powerful one stage detector that employs the feature pyramid network that we have seen before that helps with a multiscale detection of the objects and also two civilian networks, one for classification and the other for bounding box regression. The one stage detectors were typically regarded as being faster than the two stage, but they were lagging the accuracy of the two stage detectors. So the authors of retinate attributed this to the eye class imbalance between foreground and background that may happen. And the reason is if you remember these one stage detectors, they will examples a large set of candidate regions, many of them will be background, will be easy negatives and they will not contribute with a useful learning signal for the network or they can even overwhelm the training loss. So what they propose is this nova loss called those focal loss that adds this modulating factor to the standard cross entropy and it will downweight the well classified examples so that the model can focus more on the other examples. For the retinate we are also going to use REsnet 50 backbone for comparison with the faster RCNN. We also use the detectors and two library for doing so. Registering the data set requires no changes. Launching the training also requires no changes, but we need to change the configuration file. So in this case we need to inherit from the appropriate model. We also need to load the appropriate models from the model zoo. And now for setting the number of classes we need to access a different attribute of the config which is under those retinate num classes. After training both models we see that they both have good cocoa evaluation metrics. We are using the average precision which basically penalizes missing detections and also detecting having too many duplicate detections for the same object. And we see that the average precision is very similar for each model. In this case the retina net is better at detecting larger objects but worse at smaller objects. But if we look at the average precision, they are very equally matched and also the inference results as well. Another thing that is commonly employed in computer vision is that augmentation for aiding in the generalization of the network. And the reason is that we want our object detection to work under different lighting, viewpoint scales, et cetera. So we can generate an augmentation policy that will bake these transformations there. And so we pass our data set through this augmentation policy, enriching our data set that we will then use for training our model. In this example we have an horizontal flip and also some we can see on the left the augmentation policy used and also some random brightness, some random saturation, some random contrast. For using this augmentation policy we use those that takes a data set. We use this data set mapper that takes a data set in detection two and then we map our data set into a format that will be used by the model, which is this dictionary with the keys it with image instances. So we read the image, we transform it using the augmentation policy that we have defined. We also need to be careful for transferring as well the bounding boxes and then of generating those data in the format the model expects. But we are not limited to use representations only from detectors tool. We can also integrate external libraries like algorithmations or cornea and these libraries have a very large collection of transformations that are not readily available in detectors tool like this random sandflower and that we can also use. One comment is that we used data augmentation for training the faster RCNN and the retina net, but we didn't see improvements even when training for more iteration steps. Now we will discuss those transformers. The transformer was originally proposed as a sequence to sequence model for machine translation and it is now a standard in natural language processing. But also it has found its way into computer vision and other tasks. It's a very general purpose architectures that lacks the inductive biases of cnns, for example the locality and translation invariance. But given large enough scale data, it can learn this from the data and perform on par or even surpassing the cnns. The vanilla transformers uses an encoder and a decoder. The encoder has two modules and the decoder, the multi head self attention and the feed forward network. And we employ around each models a residual connection and also layer normalization. The decoder also uses cross attentions. So in those cross attention the keys and values come from the encoder and the queries come from the decoder. And we also have when we talk about differences between applying these transformers from NLP to visions, we have these differences in scale and resolution scale being that in NLP the words serve as the basic elements of pre processing. But when we're talking about object detection, those objects may vary in scale, so they may be compared of a different number of pixels and resolution. If we think that, for example, those images are comprised of a big number of a large number of pixels. Since the soft attention is very central to the transformers, let's see what makes it so appealing when compared to other layers. We see that self attention. So in here, this table on the bottom left, the t stands for the sequence size and the d stands for the representation dimensionality of each part of the sequence. And we see that self attention is more parameter efficient and fully connected layers as well, better at handling arbitrary variants input sizes. And if we compare this to recurrent layers, it's also more perimeter efficient if the size of the sequence is smaller than the representation dimensionality when compared to convolutional layers. Convolutional layers for achieved a global receptive field, meaning that every pixel would interact with every pixel, we typically need to stack many of these convolutional layers on top of each other. And in self attention, all parts of those sequence interact with each other within a single layer. Let's take a look at how the self attention works. So the self attention relates different positions of a sequence to compute a different representation of that sequence. So we feed it as an input a sequence z, in this case of size t and dimensional td, and we compute three matrices, the queries, keys and values. We do so by multiplying the input with this matrix UQV and slicing along the last dimension, the dimension of the tree times the dimension of the head, and this will generate the queries, keys and values for us. Next, we compute the dot product between the queries and keys, so the queries and keys must have the same dimension and we divide by a scaling factor. To alleviate vanishing gradient problems, we apply a soft max in a row wise manner, and this will be our attention matrix that has size t by t. So it's quadratic to the size of the input sequence, which is one of the bottlenecks of the transformer. And then we multiply this by v, our matrix value to retrieve the final computation. However, the transformer doesn't use the regular self attention, it use a generalization of it, which is called a multi head self attention. Multi head self attention is an extension of the self attention in which we run case of attention operations in parallel. So we run many self attentions in parallel, we concatenate them, and then we do a linear projection again to the dimension d. To not explode the dimensionality. Let's now revisit the transformers after having seen how the self attention works. So in the original transformer we add an encoder and a recorded but we can also use only a part of the architecture. For example, architectures that only use those encoder part like vert, are important when we only want global representation of those sequence. And you want to build classification on top of it. For example for performing sentiment analysis. When we architectures that only use a decoder are used for language modeling like GPD two. And we also have architectures that use both encoder and decoder like detection transformer that we'll see next. A fact that is also important is that self attention is invariant to the position of those tokens. So it's very common to add these position encodings to the input so that those model can reason about the positions of the parts of the sequence during the self attention in the encoder and decoder blocks. And now we are going to talk about the detection transformer. The detection transformer is a very simple architectures that is based on a CNN and a transformer recorded architectures and it uses a CNN backbone. So we feed it an images and this image goes through the CNN backbone and generates a feature map with lower width and lower weights, but with a much deeper number of channels. And now we have this distancer of width, height and channels, but we want to feed it into the transformer encoder. But the transformer encoder is expecting a sequence. So the way we can do this is by flattening the spatial dimensions of the input, by multiplying the height and width and then we can feed it into the transformer encoder. Then we have the transformer decoder that has these object queries which are learned by the model as the input. And these object queries are the number of objects that we are trying to detect in an image. So it must be set to be larger than the largest number of objects that we have in an image to provide us some slack. And they will learn to attend to specific areas and specific bounding boxes sizes in an image. Then the decoder is also conditioned on the encoder output and we predict the classes, the object class and the bounding box through parallel decoding. So it's not in an autoregressive way. We output them in parallel and we are treating the object detectors problem as a direct set case prediction. So we need an appropriate loss for that. They use those bipartite matching loss based on the hungarian algorithm that is permutation invariant and also forces a unique assignment between the ground truth and the predicted objects. We are going to use the egging phase library that contains many transformers and they recently added those visual transformers for image classification like the visual transformer VIT and also this detection transformer for object detection. They added this to the library and we are going to use based on a REsnet 50 backbone the reason we use this dilated convolutional is that the dilated convolution will increase the resolution by a factor of two at the expense of more computations, but it will help detecting small scale objects. Egin face provides a very comprehensive set of documentation. It also explains the internal part of those model and we also have these example notebooks by Niels rogue that are linked at the page and at the bottom of this slide. That explains how we can fine tune object object object object object object object object object object object detection transformers CNNS drone case feature extractor used for pre processing the input for the model or for post processing the output of the model in the cocoa notation format. For example for running the cocoa evaluation metrics we also have the data for object detection model that exposes those logits and the prediction boxes and also we have this data config that can be used for institiating data for object detection model. Through this configuration. The modifications that we do when compared to those notebook is that we use the REsnet with a little convolutions instead of the REsnet 50. We also set the maximum size of the image to 1100 to not eat good out of memory aircorse and we also use a smaller batch size of two instead of four because on v 100 gpu we use get out of memory aircraft. Otherwise, after training detectors transformers on our data set we see that average precision is very poor compared to the objects detections. Based on cnns we have seen previously, the model is able to detect large objects. It has a fairly good average precision for large objects, but it is very small for small and medium objects which can be attributed to the detection transformer not being suitable for these small scale object detectors problems and as feature pyramid networks did for cnns for helping addressing those multiscale object detectors problem. Similar approaches could also help improving the detection transformer further. We see in the inference results. We have some duplicate detections that could be probably removed by using non maximal suppression and we also have some missed detections. So how can we improve these results further? We can for example scale the backbone. In all of those experiments we use the resonate 50 but we could use a larger backbone like a REsnet 101. The results we had for the documentation didn't improve our results, but we could fine tune the probabilities or also change the augmentation transformations to find if we could get better results. Right now we also have more publicly available data sets recorded by drones like Miva, UAVDT and so on and we could use this to build a larger data set to see if we can get better results out of this. Also, we only used static images for the object detection part. But if we think about video object detection, we can exploit these temporal cues of the different frames to reduce the number of false positives. We also have different transformer architectures, for example the using transformer or the focal transformer that could be used and tested to see if they provide better results. To conclude, we see that CNNs make for very powerful baselines. We used off the shelf pre trained CNN architectures, the faster CNN and retinate and got very good average precision results in Visdron for detecting cars. The transformer architectures are being increasingly used in research and practice and we can see that they are being added to these mainstream libraries. Like egging case, for example, the detection transformer is better suited for medium to large to large objects. But developments similar to the feature pyramid network as it was used for CNNs can also help. The detection transformer and the transformers will continue to be used into downstream tasks like object detection, images classification and image representations. We can see many research papers coming from these areas and last but not least, transformers make from a unifying framework for different fields. So before we encode all of these inductive biases that we have for those CNNs and for the OSTMs. On the other hand, the transformer makes for a very general purpose architecture that lacks these inductive biases, but it can learn them from large scale data and it has given very good results for natural language processing and it's now also giving some state of the art results in image. And so it can maybe unify both fields and also unify the practitioners and researchers from both areas. So today, this concludes my presentation. I want to thank you for listening.

Eduardo Dixo

Senior Data Scientist @ Continental

Eduardo Dixo's LinkedIn account Eduardo Dixo's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways