D-STOR: A Novel Framework of Deep-Semantic Traffic Object Recognition

. Deep learning techniques such as Convolutional Neural Networks (CNNs) have proven the efficiency in recognizing image objects. Moreover, this recognition work has been extended to discover relations among detected objects. Although this research line of mining semantic information in image has become more attractive, it was not investigated thoroughly. This paper introduces a deep-semantic traffic object recognition based on a knowledge model to reveal relations among detected objects, named D-STOR. In order to confirm the efficiency of the D-STOR framework, an experiment on a dataset of traffic images in Vietnam was conducted and then yielded promising experimental results.


Introduction
The rise of machine learning has been applied to the field of image analysis with many applications such as face detection, object counting, or traffic monitoring, just to name a few [1].
Although various classifiers of machine learning had been used to detect image objects, their performances were not as high as expected.Since the dawn of the CNNs of deep learning [2], this limitation has been broken-down.
With the use of deep learning, not only images can be classified efficiently but visual objects residing inside images can be also recognized correctly [3].Furthermore, the problem of image recognition has been broadened to recognizing relations between and/or among detected objects.In other words, mining semantic information has become an attractive research topic of image analysis.To this purpose, the Semantic Web technology exposed the efficiency in describing the image contents and finding semantic relations through its advantages of reasoning capability [4].A novel approach is presented in this study, which seamlessly integrates CNN to semantic reasoning engine, to support mining semantic information in traffic images.
The contributions of this study are twofold.Firstly, a tailored model of InceptionNet-v3 [5] is deployed to detect visual objects in images and a SWRL-based reasoning engine is constructed to discover semantic information of the detected objects.Secondly, the research framework named D-STOR (Deep-Semantic Traffic Object Recognition), which explains the collaborative operations of these two components, is presented and its prototype is also built up.In order to evaluate the proposed approach, a prototype of the D-STOR framework was developed.In addition, a number of experiments were conducted to validate the D-STOR knowledge base, to measure the performances of CNN-based models in detecting visual objects and to evaluate the ability of the reasoning engine in discovering semantic information.The experiments showed promising results and confirmed the efficiency of this study.
The rest of this paper is structured as follows.Section 2 presents the state-of-the-art studies of visual object detection and semantic web applications in image analysis.Section 3 elaborates the D-STOR framework.Lastly, sections 4 and 5 mention the experiment and future research, respectively.

Related work
In this section, we elaborate and summarize the recent methods in the fields of deep learningbased approach to object detection and semantic web-based approach to image description.The literature review of the related work is out of the scope of this work, therefore, readers are suggested to find valuable information in the following surveys [2], [6] and [4].
Since the 2010s, the rise of deep learning has solved the problem of object detection with very high accuracy.By using CNNs, which has the ability of automatic finding feature representation of image objects, the literature has been witnessed many approaches to object detection like ResNet-50 [7], InceptionNet-v3 [5], DenseNet [8] or MobileNet-v2 [9], just to name a few.Specifically, Kaiming He et al. [7] presented a residual learning framework, which makes the task of training deep neural networks more easily, to cope with the object detection in images.In this approach, the stacked layers fit a residual mapping based on the hypothesis that optimizing the residual mapping is easier than optimizing the unreferenced mapping.Resnet-50 was experimented with large scale image datasets and yielded promising results.Similarly, Szegedy et al. [5] coped with the problem of increasing cost and model size in training CNN by scaling networks with the aim at utilizing added computation.The key techniques of this research included factorized convolutions and aggressive regulation.
Huang et al. [8] demonstrated the performance of DenseNet by alleviating the vanishinggradient problem, strengthening the feature propagation, reusing feature, and reducing the size of parameters.In another effort, Sendler et al. [9] introduced MobileNet-v2, which was tailored for mobile and resource constrained environments, to match the requirements of computer vision models in decreasing the number of operations and memory space while maintaining the accuracy performance.The aforementioned works have played the important role of the recently CNN-based approaches to visual object detections, some typical research can be listed as follows [10]- [13].
Although the advantages of deep learning in detecting visual objects have been proved, the image descriptions require much more efforts that only deep learning-based approach is not enough.In order to catch up with this requirement, the Semantic Web technology has been used to semantically describe relations between and/or among detected visual objects in images.
Gurevich et al. [14] early proposed an image analysis ontology which provided a fundamental knowledge-base for the image analysis system.However, this work was presented many years before the birth of the deep learning model, hence the abilities of detecting visual objects were limited.In other words, this research outlined the future cooperation between ontological knowledge base and deep learning model.In another effort, Othmani et al. [15] combined the low-level image analysis functions with high-level ontology reasoning in order to process medical images.In another approach, Rajbhandari et al. [16] used machine learning models to predict threshold values of visual objects which were then transferred to SWRL rules to implement rule-based classification tasks.Similarly, Li et al. [17] solved the weakness of datadriven deep learning methods by incorporating ontological reasoning to achieve higher performance of segmentation of remote sensing images.For further details of the state-of-theart combination of deep learning and ontological approach, readers can find valuable information in these suggested reviews [18]- [20].

Deep learning-based object detection
The object detection module using CNN for recognizing objects takes color images as its input.
Generally, an image is defined as  ∈ ℝ ×ℎ× where w, h and c are respectively the width, the height and the color channels of the image I (, ℎ,  ∈ ℕ).
The CNN-based object detection process is described as a function ( Ω , ∁  ) where  Ω is the set of images which have color channel as Ω ; and ∁  is the CNN and its optimized parameters .
To be more specific, the convolution neural network ∁  is often considered as ∁  = 〈, ,   〉 where CV, FC and   are the convolutional network layers, the fully connected layers and the classes of detected objects, respectively.The convolutional network layers consist of multiple convolution layers (  ) and pooling layers (  ),  = {(  ,   )},  = 1,  ̅̅̅̅̅ .In the convolution layer, the convolution operator is defined as ( * )(, ) = ∑ ∑ (, )( −   ,  − ), where I is the image of size ( ×  × 1) and K is the ( × ) kernel.The pooling layer uses a fixed-size window to slide over all of the regions of the image I and performs either maxpooling or average-pooling operator to compute single output for each traversed-region.The fully connected neural network takes the output of CV as its input and produces classification output.We integrated this CNN into the semantic reasoning engine by using the concepts of the domain ontology, which is elaborated in sub-section 3.2, as the vocabulary for labeling detected objects -  .

Semantic reasoning engine
The semantic reasoning engine discovers hidden information between/among detected objects in images through the implementation of SWRL rules.These rules are constructed based on a domain ontology of traffic.This ontology, which is specified in the Definition 1, provides vocabulary for not only constructing SWRL rules but also labeling detected objects in CNN.
Definition 1 -D-STOR ontology: Given   is the traffic domain,    is the set of concepts of   ,    is the set of relations of   ,    is the set of data properties of   , and I is the set of instances of   .The traffic domain ontology    is defined as    = 〈   ,    ,    , 〉.
In order to build up D-STOR ontology, the NeON collaborative methodology [21] is accepted and is applied to the three-phase process of ontology engineering which is summarized as follows.In the phase 1, domain experts and ontological engineers are invited to collaborate via the working environments including Protégé1 and GitHub2 .In the phase 2, the specifications of the traffic ontology are figured out through an Ontology Requirements Specification Document.In this phase, the knowledge of the traffic domain is specified.In the phase 3, the reuse of existing ontological resources (e.g.FOAF 3 or OWL Time 4 ) is also clarified.
This three-phase ontological engineering process is repeated until all of the members reach consensus.Fig. 2 shows an excerpt of this ontology.Based on the D-STOR ontology, the semantic reasoning engine is defined as where: -   is the  ℎ rule of the rule set following the SWRL syntax.Specifically, a SWRL is written as antecedent → consequent, where both antecedent and consequent are expressed as the conjunctions of atoms  1 ∧  2 ∧ … ∧   .In which, each atom uses the concept and/or relation defined in the D-STOR ontology    for its logical expression.
-{    {   } } is the set of algorithms which implement the reasoning process to mine semantic information based on the use of D-STOR ontology    and the SWRL rule set In this study, the SWRL rule base, which has 67 rules, is constructed and grouped into three groups including: (i) discover object -to -object relation(s); (ii) discover object groups; and (iii) discover additional information of detected objects.The examples of these three-groups of SWRL rules are presented in Table 1.

Experiment
The experiment targeted at: (i) validating the D-STOR ontology to confirm its quality through experts' evaluations; and (ii) measuring the D-STOR performances in discovering objects and relations.

Ontology validation
In order to validate the D-STOR ontology, the FOCA metric [22], which is the currently popular method of validating ontology, was accepted.Basically, this method applies a question-answer process to exploring experts' evaluations about the domain ontology.Table 2 shows four groups of questions used in this study.Each question has a 0-100 score given by experts based on his/her opinion.All of the experts' scores were then collected and were used to compute the FOCA metric following Equation 2.

Table 2. List of questions
Group Question   ̂=  (−0.44+0.03(̅ 1 )  +0.02( ̅ 2 )  + 0.01( ̅ 3 )  + 0.02( ̅ 4 )  −0.66  ) 1+ (−0.44+0.03(̅ 1 )  +0.02( ̅ 2 )  + 0.01( ̅ 3 )  + 0.02( ̅ 4 )  −0.66  ) ( where -̅ 1 , ̅ 2 , ̅ 3 , and ̅ 4 are the means of group 1, 2, 3, and 4, respectively; - is the weight of expert's experience. To serve this purpose, 7 domain experts were invited and agreed to verify the D-STOR knowledgebase.They spent 5 days reading D-STOR ontology documents and 3 days reviewing this model.Finally, these experts evaluated D-STOR by giving scores for each question listed in Table 2. Additionally, the distributions of collected scores are visualized in Fig. 3.The calculated results of FOCA metric and Kruskal-Wallis analysis are presented in Table 3.As shown in Table 3, the p-value of Kruskal-Wallis test was 0.176 (> 0.05) which indicated that there were no statistical differences among experts' evaluations.Additionally, the FOCA metric, which reached 0.993, figured out that the experts appreciated the quality and structure of D-STOR ontology.In summary, this promising result showed that the D-STOR ontology received experts' agreement and therefore it could be used in this research.

Evaluation of the D-STOR framework
The D-STOR framework aimed at recognizing both image objects and their relations.Hence, the evaluation of this framework focused on measuring the performances of both object recognition and relationship recognition.To serve these two experimental targets, an image dataset, which had annotations of image objects and their relationships, was built up.Specifically, a dataset of 2000 traffic images in Vietnam was carefully selected in a traffic image set crawled web wide for 2 weeks.Then, YAT 5 -an image annotation tool and the vocabulary of D-STOR ontology were used to label these images.Next, this image set was randomly divided into training set and test set following the ratio of 70% and 30%, respectively.
For the purpose of measuring object recognition performance, we accepted to apply the transfer learning technique to the following deep learning models: (i) ResNet50 [7]; (ii) InceptionNet-v3 [5]; (iii) DenseNet [8]; and (iv) MobileNet-v2 [9].The accuracy performances of these models were visualized in Fig. 4 which depicted the outperformance of InceptionNet-v3.Therefore, we selected InceptionNet-v3 as the deep learning model for image object recognition in the D-STOR framework.For the purpose of discovering relations among detected objects, the semantic reasoning engine used the detected objects as the inputs for its inferencing process.The number of relations discovered by the semantic reasoning engine was compared to that annotated by domain experts, and these results were visualized by cumulative lines in Figure 5. Ongoing work will focus on improving the D-STOR performance and extending this framework to other domains.

1 Q1: 2 Q4: 3 Q6: 4 Q8:
Were the competency questions defined?Q2: Were the competency questions answered?Q3: Did the ontology reuse other ontologies?Did the ontology impose a maximum ontological commitment?Q5: Are the ontology properties coherent with the domain?Are there contradictory axioms?Q7: Are there redundant axioms?Does the reasoner bring modelling errors?Q9: Does the reasoner perform quickly?

Fig. 4 .
Fig. 4. The accuracy performances of selected deep learning models

Fig. 5 . 5 Conclusion
Fig. 5. Relations discovered by the semantic reasoning engine (D-STOR) and relations annotated by domain experts