Image classification: The most typical CNN approaches to perform road damage detection and classification tasks are usually trained by designing a neural network consisting of convolutional and fully connected (FC) layers. For example, An et al.  classified images into two types with or without potholes by replacing the backbone feature extraction network in CNN and comparing the accuracy of different backbone networks in colour and colour grayscale frames in a cross-sectional manner. Bhatia et al.  developed a method to predict whether an input thermal image is a pothole or a non-pothole, demonstrating that using the residual network as the backbone network can improve the model detection rate applied in night-time and foggy weather environments. Fan et al.  experimentally evaluated 30 CNNs for road crack image classification, where Progressive neural architecture search (PNASNet) achieved the best balance between speed and accuracy. However, the image classification only presents the object image and does not detect the details of road damage in the image.