Semantic segmentation of citrus-orchard using deep neural networks and multispectral UAV-based imagery

Accurately mapping farmlands is important for precision agriculture practices. Unmanned aerial vehicles (UAV) embedded with multispectral cameras are commonly used to map plants in agricultural landscapes. However, separating plantation fields from the remaining objects in a multispectral scene is a difficult task for traditional algorithms. In this connection, deep learning methods that perform semantic segmentation could help improve the overall outcome. In this study, state-of-the-art deep learning methods to semantic segment citrus-trees in multispectral images were evaluated. For this purpose, a multispectral camera that operates at the green (530–570 nm), red (640–680 nm), red-edge (730–740 nm) and also near-infrared (770–810 nm) spectral regions was used. The performance of the following five state-of-the-art pixelwise methods were evaluated: fully convolutional network (FCN), U-Net, SegNet, dynamic dilated convolution network (DDCN) and DeepLabV3 + . The results indicated that the evaluated methods performed similarly in the proposed task, returning F1-Scores between 94.00% (FCN and U-Net) and 94.42% (DDCN). It was also determined the inference time needed per area and, although the DDCN method was slower, based on a qualitative analysis, it performed better in highly shadow-affected areas. This study demonstrated that the semantic segmentation of citrus orchards is highly achievable with deep neural networks. The state-of-the-art deep learning methods investigated here proved to be equally suitable to solve this task, providing fast solutions with inference time varying from 0.98 to 4.36 min per hectare. This approach could be incorporated into similar research, and contribute to decision-making and accurate mapping of plantation fields.


INTRODUCTION
Effective farming decisions require accurate mapping of agricultural fields.Many techniques were employed for attending this task in the past years, with the majority of them associated with remote sensing approaches (Hunt and Daughtry, 2018;Weiss et al., 2020).In the agricultural context, remote sensing data is important to monitor nutrient content (Delloye et al., 2018;Osco et al., 2019a), detect water-stress effects (Krishna et al., 2019;Osco et al., 2019b), identify leaf damage (Safonova et al., 2019), predict chlorophyll content (Kalacska et al., 2015;Shah et al., 2019), yield estimation (Chen et al., 2017;Hunt et al., 2019;Jin et al., 2019;Sun et al., 2019), among others.Most of these tasks were performed at leaf and/or canopy level, with data collected by different types of sensors at proximal, terrestrial, aerial and orbital platforms (Surový et al., 2018;Ozdarici-Ok, 2015;Paoletti et al., 2018;Osco et al., 2020a).Unmanned Aerial Vehicle (UAV) platforms gained more lately attention in many application areas, including precision agriculture mainly because of its relatively low-cost and high capacity to map areas with very high spatial-resolution.UAV-based images are largely used to substitute the visual inspection of agricultural landscapes since some practices are often labeled as labor-intensive, biased, and time-consuming (Leiva et al., 2017).
The aforementioned examples were achieved with a combination of different methodologies, which consist of regression/correlation analysis, spectral indices, morphological operations, spectral classification, etc.But another type of technique applied to data associated with remote sensing comes from artificial intelligence.Recently, deep learning is quickly gaining momentum as a method for image processing and data analysis (Goodfellow et al., 2016).Deep learning is a type of machine learning technique that is constructed as a deeper type of artificial neural network that allows hierarchical data representation (LeCun et al., 2015;Ghamisi et al., 2017;Badrinarayanan et al., 2017).A deep neural network can be constructed with different kinds of layers, which tends to improve its performance and returns a larger learning capability than most common networks or other types of learners (LeCun et al., 2015;Ball et al., 2017).Although known for a high demand for computational power and high requirement for labeled data, deep neural networks have achieved impressive performances in many tasks, such as image classification (Krizhevsky et al., 2012;Nogueira et al., 2017), semantic segmentation (Badrinarayanan et al., 2017;Nogueira et al., 2019a), object detection (Ren et al., 2015;Nogueira et al., 2019b;Santos et al., 2020;Osco et al., 2020b) and others.
Different components constitute the architecture of a deep neural network, and among the most frequently adopted architectures, Convolutional Neural Networks (CNN) have presented a better performance, in general, for image and pattern recognition (Alshehhi et al., 2017).As for agricultural studies, approximately 42% of the deep learning architectures implemented were based on CNN (Kamilaris and Prenafeta-Boldú, 2018).The most common components of a CNN architecture are convolution and deconvolution layers, pooling and max-pooling layers, fully-connected layers, activation functions, and others (Goodfellow et al., 2016).Regarding remote sensing, data extraction methods consider the spectral (Ghamisi et al., 2017), spatial (Li et al., 2017) and spectral-spatial information (Zhang et al., 2017).Approaches that consider both spectral and spatial information in their model can improve estimates significantly (Zhang et al., 2017).This has been the most common strategy when dealing with vegetation analysis and deep networks in the last few years (Djerriri et al., 2018;Li et al., 2017;Csillik et al., 2018;Safonova et al., 2019;Weinstein et al., 2019;Osco et al., 2020b).
The deep learning method chosen to deal with a specific task is linked to the scene, data and also the target's characteristics.When counting plants in a high-density object detection scenario, we discovered, in previous research in a citrus-orchard (Osco et al., 2020b), that a dense map refined in a multi-stage type of CNN architecture was better than state-of-the-art bounding-box methods.Still, the idea of separating the detected vegetation from the remaining objects in the image is more appropriate with a segmentation process.Semantic segmentation is able to assign a class-label to every pixel of an image.Deep learning-based approaches designed to tackle this task receive, as input, an image, that may be composed of a given number of bands, and return, as output, another image, generally with the same size of the input data with each pixel associated with one class.This outcome, commonly called a thematic map, may help in the full understanding of the scene which, in turn, may assist several applications, including disaster relief (Nogueira et al., 2018), urban planning (Vakalopoulou et al., 2015;Nogueira et al., 2019a) and others.
In agriculture-related problems, most semantic segmentation processes with deep neural networks use RGB (Red-Green-Blue) images or include a combination with other information to help solve a specific issue.For orange-fruit detection and segmentation, a Mask R-CNN architecture (He et al., 2017) was proposed in a combination with RGB and RGB + HSI (Hue-Saturation-Intensity) images (Ganesh et al., 2019).The SegNet architecture (Badrinarayanan et al., 2017) was also compared with FCN (Long et al., 2015) method in an RGB data-set for rice lodging identification (Yang et al., 2020).Another study, implementing RGB and near-infrared (NIR) information from a sensor embedded in a ground-robot, was able to transfer the knowledge from a network trained on a different crop to semantic segment weeds with the SegNet-Basic architecture (Bosilj et a., 2019).
SegNet was also used to segment out trunks, branches, and trellis wires in apple-tree canopies in RGB image (Majeed et al., 2020).An FCN (Long et al., 2015), in conjunction with UAV RGB-based imagery, was used for winter-wheat ear segmentation (Ma et al., 2020).Nonetheless, up to the time of writing, no literature information regarding the semantic segmentation of images in a non-RGB domain capture with UAV based systems, specifically related to agriculture datasets, was found.
In arboreous vegetation types, the automatic delineation in images often requires information related to the spectral heterogeneity, shadow complexity and background effects (Nevalainen et al., 2017).These techniques mostly rely on the spectral divergence between the trees and non-tree types of a pixel.Normally, brighter pixels are associated with the vegetation while darker pixels are viewed as their boundary (Özcan et al., 2017).In agricultural fields, an important culture worldwide is citrus, and recent deep learning-based methods have been proposed to assist precision farming in different orchards.A study (Ampatzidis and Partel, 2019) investigated the performance of the YOLOv3 network to detect and count citrus-trees, archiving high precision and accuracies values.A variation of a CNN with a refinement algorithm based on superpixels (Csillik et al., 2018) was used in an object detection approach to also count the trees.As mentioned, one of our previous studies (Osco et al., 2020) was conducted in a citrus-orchard, and we proposed an object detection approach that, differently from the previous methods, performed better in high-density plantations.Although the individual detection of a tree is important for many agriculture practices, this high-density is a reality in many citrus-orchards.
This may be a problem for most state-of-the-art deep learning methods of object detection (Ampatzidis and Partel, 2019).In this manner, a semantic segmentation, while performing a different type of approach (i.e.not used to count trees), could be used to properly map these highly-dense areas and estimate vegetation-cover and used as in input for plantation-lines extractions.
Separating and extracting plantation fields of a remote sensing image provides important information on plantation cover-area and location.The semantic segmentation of plantation fields by deep networks is a novel and improved computational manner to accurately separate the vegetation from the remaining objects in an image-scene.As a benefit, it should provide an accurate mapping of the area while demanding low effort from the human counterpart.Novel deep learning-based methods are being constantly proposed, and their robustness assessment should be performed in several applications.We intend, here, to fill a part of this gap in the agriculture context.To the best of our knowledge, no study investigated the performance of deep learning-based methods to fulfill this task in UAV-based multispectral imagery, especially for citrus-orchards.In this paper, we evaluate the performance of five state-of-the-art deep neural networks to semantic segment citrus-trees in multispectral images.The rest of this paper is organized as follows: Section 2 provides highly-detailed info of the methods applied; Section 3 exposes the obtained results while also discusses their implications.Finally, Section 4 concludes this study.

MATERIALS AND METHOD
Our workflow was divided into five main stages (Figure 1).We began collecting our data (a) in a Citrus-orchard area where the UAV flight was performed.Later, we processed our image data (b) and generated an orthomosaic.We then processed our Citrus data by labeling it in a Geographical Information System (GIS) environment and split it into groups training and testing subsets (c).We proceed to perform the semantic segmentation process (d) in a computational environment, selecting five state-of-the-art deep neural networks for the proposed task.Finally, we performed the evaluation (e) of the tested methods and compared it against each other.Details regarding these processes are described in the following subsections.

Data Acquisition and Image Processing
To compose our experiment's data-set, we used a Valencia-orange orchard (planted over a Citrumelo Swingle rootstock), located in the interior of the municipality of Ubirajara, São Paulo, Brazil.
During our survey, the Citrus trees were in their vegetative stage, around 5-years old, with 3-meter averaged high ground-related.The labeled area (Figure 2) is around 70.4 ha, with high-density trees at 7x1.9 meters in-line spacing, resulting in a total of 750 trees per hectare.Aside from trees, the UAV flight also registered dirt-roads between the plantation plots, streets for locomotion, different densities of grasses, buildings, and other objects.Months prior to the flight, the orchard's soil was fertilized with 250 kg/ha of Nitrogen in the form of Urea, 125 kg/ha of Phosphorus excreted, expressed as P 2 O 5 , and 167 kg/ha of Potassium Oxide (K 2 O).The field is predominantly composed of red-yellow podzolic soil, situated in a C W A Köppen subtropical climate type unit.We conducted two flights with an eBee SenseFly UAV platform shipped with a Parrot Sequoia camera to perform the imaging of the area.Parrot Sequoia camera records images in the following spectral regions: green (530-570 nm), red (640-680 nm), red-edge (730-740 nm) and near-infrared (770-810 nm).The flight took place at the end of the summer (south-hemisphere), March 22, 2018, in partially cloudy conditions, 29 °C and 0 mm precipitation, with light air breeze at 1 to 2 m/s from the northeast (Yr, 2018).We adopted a flight altitude of 120 meters-high in relation to the terrain.At this height, Parrot Sequoia can generate images with a GSD (Ground Sample Distance) equivalent to 12.9 cm.We program two flight routes to register areas beyond the boundaries of the area of interest.
The first flight occurred at 13:30, while the second followed around 14:15 (local time).Alongside both flights, we registered a total of 9 control points (Figure 2) throughout the area to support the phototriangulation process.Each point was surveyed with the GNSS (Global Navigation Satellite System) Leica Plus GS15, in RTK (Real-Time Kinematic) mode, remaining in operation between 11:00 and 17:00 (local time).The RTK position per point report registered an accuracy of 0.003 m.
From the conducted flights, we processed two different blocks of images (Figure 2), whose first block was formed by 1,183 images and the second one was formed by 1,206 images.We used the Pix4Dmapper software for the calibration processes.We optimized the interior and exterior parameters and created a sparse-dense cloud.The 9 control points were given as reference locations to optimize the Structure-From-Motion (SfM) method.Dense point-clouds were generated based on the MVS (Multi-View Stereo) method.The RMSE (Root Mean Square Error) of this process was about 0.129 meters.We generated the DSM (Digital Surface Model) of both blocks.For the Parrot Sequoia radiometric calibration method, we registered, prior to both flights, a calibration plaque inherent to this sensor.We converted the Digital Number (DN) values to surface reflectance using the calibration parameters described in the Parrot Sequoia manual.An orthorectified surface reflectance image was generated for each band at each block (I and II).Both image blocks were used to create a unique mosaic.The orthomosaic was composed of 2,389 scenes altogether.

Fully Convolutional Network (FCN)
Fully Convolutional Network (FCN) (Long et al., 2015) was one of the primary deep learning-based techniques proposed to perform semantic segmentation.This network extracts features and generates an initial coarse classification map using a set of convolutional layers that, due to their internal configuration, outputs a spatially reduced (when compared to the original input) outcome.In order to restore the original resolution and output the thematic map, this approach employs deconvolution layers (Zeiler and Fergus, 2014) that learn how to upsample the initial classification map and produce the final dense prediction.
The FCN architecture experimented in this work is presented in Figure 3.This network has 6 convolutional layers (in which the first three are followed by a pooling layer) responsible to extract the features and generate an initial coarse classification.This outcome is further processed by 3 deconvolution layers (Zeiler and Fergus, 2014), which are responsible to spatially upsample the prediction producing the final dense output, with the same height and width of the input image.The input of the last two deconvolutions is, in fact, the output of the previous layer combined (via element-wise addition) with the prediction generated by an extra convolutional layer that receives, as input, the features extracted by one of the pooling layer (as presented in Figure 3).This concept of combining features from multiple layers to produce the final prediction is advantageous, as it allows the model to exploit low-, mid-and high-level information, captured from the input data, to generate the final outcome.(Long et al., 2015).

U-Net
U-Net (Ronneberger et al., 2015) was one of the first networks to propose encoder-decoder architectures to perform semantic segmentation.In this design, the encoder is usually composed of several convolution and pooling layers, and responsible to extract the features and generate an initial coarse prediction map.The decoder, commonly composed of convolution, deconvolution (Zeiler and Fergus, 2014) and/or unpooling layers (Goodfellow et al., 2016), is responsible to further process the initial prediction map, increasing its spatial resolution gradually and generating the final prediction.
Note that, normally, the decoder can be seen as a mirrored/symmetrical version of the encoder, with the same number of layers but replacing some of the operations with their counterparts (i.e., convolution with deconvolution, pooling with unpooling, etc).
The U-Net architecture exploited in this work is presented in Figure 4.The encoder of this network is composed of two blocks, each composed of two convolutions and one pooling layer, plus a final convolutional layer.It receives the input image and outputs coarse feature maps four times smaller.The decoder is composed of a single convolutional layer followed by two blocks, each consisting of deconvolution and two convolution layers.This part receives the coarse feature map (produced by the encoder) and outputs the final fine prediction image.Two interesting aspects of this architecture should be highlighted.The first one is that the downsampling process performed by the pooling operations in the encoder is reverted using deconvolutional layers in the decoder, i.e., the pooling layers in the encoder are replaced by deconvolutions in the decoder phase.These are the only layers capable of changing the spatial resolution of the data.The second one is that after each deconvolutional layer, there is an operation that concatenates the features produced by this layer with the ones extracted from the convolution before a pooling layer (as presented in Figure 4).Similarly to the FCN (Long et al., 2015), this is performed so that the model is able to exploit multi-level features to improve the final prediction.

SegNet
SegNet (Badrinarayanan et al., 2017) is another type of encoder-decoder network proposed specifically for semantic segmentation.However, differently from the previous model, this network employs unpooling operations, instead of deconvolution layers, in the decoder to increase the spatial resolution of the coarse map generated by the encoder.Figure 5

DeepLabV3+
More recently, researchers observed that smoother predictions could be produced if the input image was not considerably downsampled.However, the conservation of the data resolution over the network would imply the model's inability to efficiently explore the receptive field concept (i.e., the input area of influence on the output) (Goodfellow et al., 2016).This is an important drawback given that when outputting dense predictions, it is critical for each output pixel to have a big receptive field, such that no important information is left out when making the prediction (Luo et al., 2016).To overcome this, dilated convolutions (Yu and Koltun, 2015) were introduced.Such layers are capable of increasing the receptive field without downsampling the input.The DeepLab networks (Chen et al., 2014, Chen et al., 2017a, Chen et al., 2017b, Chen et al., 2017b) were one of the first to exploit the benefits of dilated convolutional layers (Yu and Koltun, 2015).Such models propose to use some initial convolutional layers to moderately reduce the input resolution, which is then kept constant by the final (dilated and standard) convolutions.Specifically, in this work, we evaluated the latest version of the DeepLab networks, i.e., the DeepLabV3+ (Chen et al., 2018), whose architecture is presented in Figure 6.This network starts with three blocks, each composed of two convolutions followed by one pooling layer, responsible to learn an initial representation.Such features are further processed by a special module, called Atrous Spatial Pyramid Pooling, composed of several parallel dilated convolution layers (Yu and Koltun, 2015) that process the same input features using distinct dilation rates, thus allowing the model to capture multi-scale information.This representation is concatenated with low-level features extracted from the first pooling and then processed by one extra convolutional layer.Finally, three convolutional layers further process the concatenated features, which are then upsampled by a bilinear interpolation to produce the final dense prediction map.

Dynamic Dilated Convolutional Network (DDCN)
Dynamic Dilated Convolutional Network (DDCN) (Nogueira et al., 2019a) takes the previous concept of preserving the input image resolution to the extreme.Specifically, this approach proposes a novel multi-scale training strategy that uses dynamically-generated input images to converge a dilated model that never downsamples the input data.Technically, this technique receives as input the original images and a probability distribution over the possible input sizes, i.e., over the sizes that might be used to generate the input patches.In each iteration of the training procedure, a size is randomly selected from this distribution and is then used to create a totally new batch.By processing these batches, each composed of several images with one specific pre-selected size, the model is capable of capturing multi-scale information.Furthermore, in the prediction step, the algorithm selects, based on scores accumulated during the training phase for each evaluated input size, the best resolution.Then, the technique processes the testing images using batches composed of images with the best-evaluated size.
Although several models (including FCNs, U-Nets, and SegNet) could be used with the aforementioned training strategy, they might have complications if the input size is too small to the point that it does not allow the creation of the coarse map and, consequently, of the final thematic map.To overcome this, Nogueira et al. (2019a) proposed to use a network composed entirely of dilated convolutions (Yu and Koltun, 2015) that never reduces the input image.This fully dilated model fits perfectly into the proposed multi-scale training strategy as it is capable of processing inputs of any size.Following this concept, Figure 7 presents the fully dilated network architecture tested in this work.It has a total of 8 dilated blocks, each one composed of a dilated convolution and a pooling layer, followed by a standard convolutional layer responsible for the final prediction.It is important to observe that, although pooling layers are employed, they do not reduce the spatial resolution of the input due to a specific configuration of stride and padding.

Protocol
All models were trained using the same training/test protocol.Precisely, the original data, collected over the citrus orchard (Figure 2), was divided into training and test sets, as presented in Figure 8.The former set, consisting of approximately ⅔ of the total amount of labeled pixels, was employed to converge the networks whereas the latter one, composed of the remaining labeled pixels (⅓), was used to evaluate the models.Obtained results are reported in terms of accuracy, averaged accuracy, F1-Score, precision and recall values based on the performance on the test set.Aside from this, it is important to emphasize that all aforementioned networks were trained from scratch.All of them employed input patches of 32x32 pixels, except the DDCN (Nogueira et al., 2019a) that used a uniform distribution that allows the method to select an input resolution from 2 possibilities: 32x32, and 64x64.Note that other input patch sizes were evaluated but they did not produce significant improvement in the results only increasing the training time.During training, all the approaches used the same set of hyperparameters, which was defined based on previous convergence analyses.Specifically, the learning rate, weight decay, momentum, and the number of iterations were 0.01, 0.005, 0.9, and 200,000, respectively.After every 50,000 iterations, the learning rate was reduced following an exponential decay with parameter 0.5.All deep learning-based models exploited in this work were implemented using TensorFlow, a Python framework conceived to allow efficient exploitation of deep learning with Graphics Processing Units (GPUs).The code will be made publicly available after acceptance of the work.All experiments conducted were performed on 64-bit Intel i7-8700K@3.70GHzCPU workstation, with 64 GB memory, and NVIDIA® GTX 1080 GPU with 12Gb of memory, under a 10.0 CUDA version.Debian 4.19.98-1version was used as the operating system.

RESULTS AND DISCUSSION
The evaluated deep learning methods returned similar accuracies in the proposed approach, ranging from 94.88% to 95.46%.(Table 1).Although a slight difference aside indicates that the DDCN performed better in a quantitative point-of-view, it is accurate to say that all of the five state-of-the-art networks are capable of segmenting citrus-trees satisfactorily in the multispectral imagery data-set.This information is important, as of until the moment, no agricultural field segmentation was evaluated with this kind of spatial-spectral data.The fact that these deep neural networks are able to highly separate vegetated covered-area from other targets while maintaining the original resolution of the image input is an important characteristic.Multispectral images are largely used for monitoring vegetation health, as in multiple precision farming applications (Citation, Year).By accurately mapping the plantation area with a few false-positives, it is possible to extract more accurate information solely from the culture itself.Another information obtained from our experiment was the processing time needed to perform the training and inference (Table 2).Most of the evaluated methods returned proximal inference time for both GPU and CPU tests, with the exception of the DDCN method (Nogueira et al., 2019a), which took around four times the amount of time needed to perform the same task.However, an estimation of this inference time per area demonstrates how rapidly these neural networks can segment trees in the given data-set once they are trained.This information is crucial for precision agriculture since this response could be incorporated into decision-making regarding area-size and priority.Regardless, it should be noted that the times informed here are considering the system used to train these methods (see Section 2.3.6).Despite that, this information is rarely considered when performing this task, and future research could benefit from the intention exposed here.Our tested area was composed of ⅓ of the total of the experimental field site (Figure 2).As previously stated, this experimental area is around 70.4 ha, of which our test-data is around 23.5 ha.
During the image labeling process, we verified that our ground-truth data (i.e.citrus-trees labeled as polygon features) occupied approximately 9.2 ha and 5.5 ha in both training and testing data-sets, respectively.The data-sets were composed of different plantation line orientations/directions (Figure 8) and the same target, objects, and challenges.Figure 9 displays the resulting segmentation of the five state-of-the-art methods evaluated in this study.FCN (Long et al., 2015), U-Net (Ronneberger et al., 2015), SegNet (Badrinarayanan et al., 2017), DeepLabV3+ (Chen et al., 2018), DDCN (Nogueira et al., 2019a).
Given the size of the area evaluated, flight time and atmospheric conditions, it produced a heterogeneous data-set with different lumination conditions (see Section 2.2). Figure 10 demonstrates a comparison with a false-color combination and the segmented result from the quantitatively overall best method (DDCN).In the top-row of Figure 10, it is noticeable that the segmentation performance was satisfactory, although differences in illumination geometry and in the orientation of the plantation-line should oppose a hindrance for the network to handle.However, in the bottom-row of Figure 10, it is notable that the segmentation worsened.This condition could be explained by two major factors in the data-set: highly shadow-affected areas, and; highly dense-grassland areas.As stated, a qualitative evaluation of the results indicated that most of the problems faced by the investigated deep networks are related to shadowy areas.In the previous comparison (Figure 10) we observed only the DDCN result.As an example, Figure 11 highlights all the five methods in one particular highly shadow-affected area.All networks, with the exception of the DDCN (Nogueira et al., 2019a), performed quite poorly here.The DDCN method captures multi-scale information and selects the best resolution based on scores accumulated during the training phase.This difference aside may have helped this network to map more accurately these shadowy areas than the other methods, even though it returned a slight amount of accuracy in quantitative terms (Table 1).However, as for recommending the application of this architecture over the others, the processing and inference times are something to be taken into consideration (Table 2).
The spectral similarity among objects in a scene is known to be a potential problem for most image processes (Citation, Year).In the orchard data-set, most of the plantation-streets are protected by a living-cover (i.e.grassland).This practice, in Brazil citrus-plantations, has become customary in commercial sites since it assists in water infiltration, environmental control, and soil-erosion protection.However, for remote sensing imagery, this type of additional protection in land produces high spectral similarity with the citrus culture.This characteristic was mostly discernible with the Red-edge and the Near-infrared regions (Figure 12).Interestingly, this was not particularly a problem in plantation-gaps, where a line was interrupted by a cut tree.This is because not many of these areas were filled with highly-dense grassland, as single-trees were previously chopped because of greening infestations.above 94%).The remaining percentage not classified in our test data-set could be explained by the qualitative approach presented.But indeed, CNNs are also commonly known for producing errors in image-boundaries (Citation, Year), which could too explain the cap reached here.The importance of this approach, as stated, comes with the robustness assessment of these methods in a challenging multispectral data-set, in the agricultural context with UAV-based imagery.From the practical view, the significance of segmenting almost pixel-exclusively the plantation itself, while knowing the inference time needed per area, is a valuable advantage.For precision agriculture practices, the addition of the deep learning methods demonstrated and its subsequent results could auxiliary decision-making.

CONCLUSION
We conducted experiments in order to evaluate five deep neural networks to semantic segment citrus-trees in UAV-based multispectral images.Our study demonstrated that the semantic segmentation is highly appropriate for separating and extracting plantation fields of remote sensing imagery in the evaluated conditions.Our data indicate that the investigated methods performed similarly in the proposed task, returning accuracies between 94.88% (FCN) and 95.46% (DDCN).
Based on a qualitative analysis, the DDCN method performed better in highly shadow-affected areas.
We conclude that the semantic segmentation of citrus orchards is highly achievable with state-of-the-art deep networks.The deep learning methods investigated provided fast solutions to segment the plantation cover-area, with an inference time varying from 0.98 to 4.36 minutes per hectare.The framework presented here may help other studies while providing primary information for exploring these methods in said context.Our approach could be also incorporated into similar agricultural areas and contribute to decision-making and accurate mapping of plantation fields.

Figure 1 .
Figure 1.Workflow summarizing the fundamental steps of the conducted approach.

Figure 2 .
Figure 2. Citrus-orchard area used to evaluate state-of-the-art deep neural network methods.
presents the SegNet architecture exploited in this work.The encoder of this network is composed of three blocks, each one composed of two convolutions and one pooling layer.It receives the input image and outputs a coarse feature map six times smaller.The decoder also has three blocks, each one composed of an unpooling and two convolution layers.This part receives the coarse feature map (produced by the encoder) and outputs the final fine prediction image.In the exploited architecture, each pooling layer of the encoder is directly replaced by an unpooling layer in the decoder.These are the only operations able to change the spatial resolution of the input, with the former one reducing it and the latter one restoring it.

Figure 9 .
Figure 9. Ground-truth and visual results of the evaluated methods.From top-to-left: Ground-truth,

Figure 10 .
Figure 10.Challenges faced by the investigated approach (false-color combination compared with the DDCN method).

Figure 11 .
Figure 11.Highly shadow-affected area and segmentation results from the tested methods.

Figure 12 .
Figure 12.The spectral behavior of the objects most commonly present in the scene.This graphic was constructed with the selection of multiple pixels of a minimum of one-hundred samples per object.

Table 1 .
Evaluation metrics obtained for the experimented methods.

Table 2 .
Information obtained with the experimented methods.