AirSim

Since building and testing a drone is expensive, we will use Microsoft AirSim. In short, AirSim is a tool created by Microsoft for developing autonomous vehicles. It is built on the Unreal Engine and has premade environments with moving actors. Furthermore, it comes with both a Python and a C++ API which allows users to programmatically control the drone.

To create the dataset, AirSim’s cinematography mode was used in combination with the Python API to teleport the camera around and take 672×672 pixel images. The camera was teleported to face the desired object, with some random offset, distance, and angle. The Python script is set to generate a fixed number of images of each specified actor. For example, given a list of deer, the script will randomly select a deer and teleport the camera such that the camera is looking at the deer at a random distance, angle, and offset (such that the deer is not center frame). An example image is shown below. To diversify the dataset, there are some images included that do not contain any actors of interest. The script accomplishes this by taking a picture of a miscellaneous object in the environment with a random chance that is set by the user. In the actual script, this is a 30% chance. This procedure is repeated for multiple environments to generate data for the following seven classes: deer, car, bench, pedestrian, truck, bus, and ambulance.

Example Image from dataset

LabelImg

Once the images have been generated, LabelImg [2] was used to generate the annotation files according to the YOLO format. A screenshot showing the labeling process is shown below.

Example of LabelImg annotation process. Source: https://github.com/tzutalin/labelImg

This YOLO format is used to train both YOLO models (YOLOv4 and YOLOv4-tiny). The YOLO format consists of a plain text file for every image where each line contains the following numbers:

Class_id x_center y_center width height

The values x_center, y_center, width, and height are normalized to 1.0 according to the height and width of the image.

The MobileNet repository used to train was not compatible with the YOLO format, so the data must be converted to either the VOC or the OpenImages format. The OpenImages format was chosen, which consists of one comma separated file that contains every annotation. Each line of the csv is as follows:

Image_id x_min y_min x_max y_max class_name

The values x_min, y_min, x_max, and y_max are normalized to 1.0 similarly to the YOLO format. The Image_id field is simply the image file name excluding the file extension.