With the growing quantity and diversity of publicly available web datasets, most notably Linked Open Data, recommending datasets, which meet specific criteria, has become an increasingly important, yet challenging problem. This task is of particular interest when addressing issues such as entity retrieval, semantic search and data linking. Here, we focus on that last issue. We introduce a dataset recommendation approach to identify linking candidates based on the presence of schema overlap between datasets. While an understanding of the nature of the content of specific datasets is a crucial prerequisite, we adopt the notion of dataset profiles, where a dataset is characterized through a set of schema concept labels that best describe it and can be potentially enriched by retrieving their textual descriptions. We identify schema overlap by the help of a semantico-frequential concept similarity measure and a ranking criterium based on the tf*idf cosine similarity. The experiments, conducted over all available linked datasets on the Linked Open Data cloud, show that our method achieves an average precision of up to \(53\,\%\) for a recall of \(100\,\%\) . As an additional contribution, our method returns the mappings between the schema concepts across datasets – a particularly useful input for the data linking step.
You have full access to this open access chapter, Download conference paper PDF
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
With the emergence of the Web of Data, in particular Linked Open Data (LOD) [1], an abundance of data has become available on the web. Dataset recommendation is becoming an increasingly important task to support challenges such as entity interlinking [2], entity retrieval or semantic search [3]. Particularly with respect to interlinking, the current topology of the LOD cloud underlines the need for practical and efficient means to recommend suitable datasets: currently, only very few, well established knowledge graphs show a high amount of inlinks, with DBpedia being the most obvious target [4], while a large amount of datasets is largely ignored.
This is due in part to the challenge to identify suitable linking candidates without prior knowledge of the available datasets and their characteristis. Linked datasets vary significantly with respect to represented resource types, currentness, coverage of topics and domains, size, used languages, coherence, accessibility [5] or general quality aspects [6]. This heterogeneity poses significant challenges for data consumers when attempting to find useful datasets. Hence, a long tail of datasets from the LOD cloud Footnote 1 has hardly been reused and adopted, while the majority of data consumption, linking and reuse focuses on established knowledge graphs such as DBpedia [7] or YAGO [8].
In line with [9], we define dataset recommendation as the problem of computing a rank score for each of a set of datasets \(D_T\) (for Target Dataset) so that the rank score indicates the relatedness of \(D_T\) to a given dataset, \(D_S\) (for Source Dataset). The rank scores provide information of the likelihood of a \(D_T\) dataset to contain linking candidates for \(D_S\) .
We adopt the notion of a dataset profile, defined as a set of concept labels that describe the dataset. By retrieving the textual descriptions of each of these labels, we can map the label profiles to larger text documents. This representation provides richer contextual and semantic information and allows to compute efficiently and inexpensively similarities between profiles.
Although different types of links can be defined across datasets, here we focus on the identity relation given by the statement “owl:sameAs”. Our working hypothesis is simple: datasets that share at least one concept, i.e., at least one pair of semantically similar concept labels, are likely to contain at least one potential pair of instances to be linked by a “owl:sameAs” statement. We base our recommendation procedure on this hypothesis and propose an approach in two steps: (1) for every \(D_S\) , we identify a cluster Footnote 2 of datasets that share schema concepts with \(D_S\) and (2) we rank the datasets in each cluster with respect to their relevance to \(D_S\) .
In step (1), we identify concept labels that are semantically similar by using a similarity measure based on the frequency of term co-occurence in a large corpus (the web) combined with a semantic distance based on WordNet without relying on string matching techniques [10]. For example, this allows to recommend to a dataset annotated by “school” one annotated by “college”. In this way, we form clusters of “comparable datasets” for each source dataset. The intuition is that for a given source dataset, any of the datasets in its cluster is a potential target dataset for interlinking.
Step (2) focuses on ranking the datasets in a \(D_S\) -cluster with respect to their importance to \(D_S\) . This allows to evaluate the results in a more meaningful way and of course to provide quality results to the user. The ranking criterium should not be based on the amount of schema overlap, because potential to-link instances can be found in datasets sharing 1 class or sharing 100. Therefore, we need a similarity measure on the profiles of the comparable datasets. We have proceeded by building a vector model for the document representations of the profiles and computing cosine similarities.
To evaluate the approach, we have used the current topology of the LOD as evaluation data (ED). As mentioned in the beginning, the LOD link graph is far from being complete, which complicates the interpretation of the obtained results—many false positives are in fact missing positives (missing links) from the evaluation data—a problem that we discuss in detail in the sequel. Note that as a result of the recommendation process, the user is not only given candidate datasets for linking, but also pairs of classes where to look for identical instances. This is an important advantage allowing to run more easily linking systems like SILK [11] in order to verify the quality of the recommendation and perform the acutal linking. Our experimental tests with SILK confirm the hypothesis on the incompleteness of the ED.
To sum up, the paper contains the following contributions: (1) new definitions of dataset profiles based on schema concepts, (2) a recommendation framework allowing to identify the datasets sharing schema with a given source dataset, (3) an efficient ranking criterium for these datasets, (4) an output of additional metadata such as pairs of similar concepts across source and target datasets, (5) a large range of reproducible experiments and in depth analysis with all of our results made available.
We proceed to present the theoretical grounds of our technique in Sect. 2. Section 3 defines the evaluation framework that has been established and reports on our experimental results. Related approaches are presented and discussed in Sect. 4 before we conclude in Sect. 5.
Our recommendation approach relies on the notion of a dataset profile, providing comparable representations of the datasets by the help of characteristic features. In this section, we first introduce the definitions of a dataset profile that we are using in this study. Afterwards, we describe the profile-based recommendation technique that we apply.
A dataset profile is seen as a set of dataset characteristics that allow to describe in the best possible way a dataset and that separate it maximally from other datasets. A feature-based representation of this kind allows to compute distances or measure similarities between datasets (or for that matter profiles), which unlocks the dataset recommendation procedure. These descriptive characteristics, or features, can be of various kinds (statistical, semantic, extensional, etc.). As we observe in [12], a dataset profile can be defined based on a set of types (schema concepts) names that represent the topic of the data and the covered domain. In line with that definition, we are interested here in intensional dataset characteristics in the form of a set of keywords together with their definitions that best describe a dataset.
(Dataset Label Profile). The label profile of a dataset D, denoted by \(\mathcal _l(D)\) , is defined as the set of n schema concept labels corresponding to D: \(\mathcal _l(D)=\_^n.\)
Note that the representativity of the labels in \(\mathcal _l(D)\) with respect to D can be improved by filtering out certain types. We rely on two main heuristics: (1) remove too popular types (such as \(\mathrm \) ), (2) remove types with too few instances in a dataset. These two heuristics are based on the intuition that the probability of finding identical instances of very popular or underpopulated classes is low. We support (1) experimentally in Sect. 3 while we leave (2) for future work.
Each of the concept labels in \(\mathcal _l(D)\) can be mapped to a text document consisting of the label itself and a textual description of this label. This textual description can be the definition of the concept in its ontology, or any other external textual description of the terms composing the concept label. We define a document profile of a dataset in the following way.
(Dataset Document Profile). The document profile of a dataset D, \(\mathcal _d(D)\) , is defined as a text document constructed by the concatenation of the labels in \(\mathcal _l(D)\) and the textual descriptions of the labels in \(\mathcal _l(D)\) .
Note that there is no substantial difference between the two definitions given above. The document profile is an extended label profile, where more terms, coming from the label descriptions, are included. This allows to project the profile similarity problem onto a vector space by indexing the documents and using a term weighting scheme of some kind (e.g., tf*idf).
By the help of these two definitions, a profile can be constructed for any given dataset in a simple and inexpensive way, independent on its connectivity properties on the LOD. In other words, a profile can be easily computed for datasets that are already published and linked, just as for datasets that are to be published and linked, allowing to use the same representation for both kinds of datasets and thus allowing for their comparison by the help of feature-based similarity measures.
As stated in the introduction, we rely on the simple intuition that datasets with similar intension have extensional overlap. Therefore, it suffices to identify at least one pair of semantically similar types in the schema of two datasets in order to select these datasets as potential linking candidates. We are interested in the semantic similarity of concept labels in the dataset label profiles. There are many off-the-shelf similarity measures that can be applied, known from the ontology matching literature. We have focused on the well known semantic measures Wu and Palmer [13] and Lin’s [14], as well as the UMBC [10] measure that combines semantic distance in WordNet with frequency of occurrence and co-occurrence of terms in a large external corpus (the web). We provide the definition of that measure, since it is less well-known and it showed to perform best in our experiments. For two labels, x and y, we have
$$\beginwhere \(sim_>(x,y)\) is the Latent Semantic Analysis (LSA) [15] word similarity, which relies on the words co-occurrence in the same contexts computed in a three billion words corpus Footnote 3 of good quality English. D(x, y) is the minimal WordNet [16] path length between x and y. According to [10], using \(e^\) to transform simple shortest path length has shown to be very efficient when the parameter \(\alpha \) is set to 0.25.
With a concept label similarity measure at hand, we introduce the notion of dataset comparability, based on the existence of shared intension.
(Comparable Datasets). Two datasets \(D'\) and \(D''\) are comparable if there exists \(L_i\) and \(L_j\) such that \(L_i \in \mathcal _l(D')\) , \(L_j \in \mathcal _l(D'')\) and \(sim_>(L_i, L_j) \ge \theta \) , where \(\theta \in [0,1]\) .
A dataset recommendation procedure for the linking task returns, for a given source dataset, a set of target datasets ordered by their likelihood to contain instances identical to those in the source dataset.
Let \(D_S\) be a source dataset. We introduce the notion of a cluster of comparable datasets related to \(D_S\) , or \(CCD(D_S)\) for short, defined as the set of target datasets, denoted by \(D_T\) , that are comparable to \(D_S\) according to Definition 3. Thus, \(D_S\) is identified by its CCD and all the linking candidates \(D_T\) for this dataset are found in its cluster, following our working hypothesis.
Finally, we need a ranking function that assigns scores to the datasets in \(CCD(D_S)\) with respect to \(D_S\) expressing the likelihood of a dataset in \(CCD(D_S)\) to contain identical instances with those of \(D_S\) . To this end, we need a similarity measure on the dataset profiles.
We have worked with the document profiles of the datasets (Definition 2). Since datasets are represented as text documents, we can easily build a vector model by indexing the documents in the corpus formed by all datasets of interest – the ones contained in one single CCD. We use a tf*idf weighting scheme, which allows to compute the cosine similarity between the document vectors and thus assign a ranking score to the datasets in a CCD with respect to a given dataset from the same CCD. Note that this approach allows to consider the information of the intensional overlap between datasets prior to ranking and indexing – we are certain to work only with potential linking candidates when we rank, which improves the quality of the ranks. For a given dataset \(D_S\) , the procedure returns datasets from \(CCD(D_S)\) , ordered by their cosine similarity to \(D_S\) .
Finally, an important outcome of the recommendation procedure is the fact that, along with an ordered list of linking candidates, the user is provided the pairs of types of two datasets—a source and a target—where to look for identical instances. This information facilitates considerably the linking process, to be performed by an instance matching tool, such as SILK.
We illustrate our approach by an example. We consider education-data-gov-uk Footnote 4 as a source dataset ( \(D_S\) ). The first step consists in retrieving the schema concepts from this dataset and constructing a clean label profile (we filter out noisy labels, as discussed above), as well as its corresponding document profile (Definitions 1 and 2, respectively). We have \(\mathcal _l(education-data-gov-uk ) =\) . We perform a semantic comparison between the labels in \(\mathcal _l(education-data-gov-uk )\) and all labels in the profiles of the accessible LOD datasets. By fixing \(\theta =0.7\) , we generate \(CCD(education-data-gov-uk )\) containing the set of comparable datasets \(D_T\) , as described in Definition 3. The second step consists of ranking the \(D_T\) datasets in \(CCD(education-data-gov-uk )\) by computing the cosine similarity between their document profiles and \(\mathcal _d(education-data-gov-uk )\) . The top 5 ranked candidate datasets to be linked with education-data-gov-uk are (1) rkb-explorer-courseware Footnote 5 , (2) rkb-explorer- courseware Footnote 6 , (3) rkb-explorer-southampton Footnote 7 , (4) rkb-explorer-darmstadt Footnote 8 , and (5) oxpoints Footnote 9 .
Finally, for each of these datasets, we retrieve the pairs of shared (similar) schema concepts extracted in the comparison part:
We proceed to report on the experiments conducted in support of the proposed recommendation method.
The quality of the outcome of a recommendation process can be evaluated along a number of dimensions. Ricci et al. [17] provide a large review of recommender systems evaluation techniques and cite three common types of experiments: (i) offline experiments, where recommendation approaches are compared without user interaction, (ii) user studies, where a small group of subjects experiments with the system and reports on the experience, and (iii) online experiments, where real user populations interact with the system.
For the task of dataset recommendation, the system suggests to the user a list of n target datasets candidates to be linked to a given source dataset. There does not exist a common evaluation framework for the datasets recommendation, thus, we evaluate our method with an offline experiment by using a pre-collected set of linked data considered as evaluation data (ED). The most straightforward, although not unproblematic (see below) choice of evaluation data for the data linking recommendation task is the existing link topology of the current version of the LOD cloud.
In our recommendation process, for a given source dataset \(D_S\) , we identify a cluster of target datasets, \(D_T\) , that we rank with respect to \(D_S\) (cf. Sect. 2.2). To evaluate the quality of the recommendation results given the ED of our choice, we compute the common evaluation measures for recommender systems, precision and recall, defined as functions of the true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) as follows:
$$\beginThe number of potentially useful results that can be presented to the user has to be limited. Therefore, to assess the effectiveness of our approach, we rely on the measure of precision at rank k denoted by P@k. Complementarily, we evaluate the precision of our recommendation when the level of recall is \(100\,\%\) by using the mean average precision at \(Recall = 1\) , MAP@R, given as:
$$\beginwhere R(q) corresponds to the rank, at which recall reaches 1 for the qth dataset and \(\text _\) is the entire number of source datasets in the evaluation.
We started by crawling all available datasets in the LOD cloud group on the Data Hub Footnote 10 in order to extract their profiles. In this crawl, only 90 datasets were accessible via endpoints or via dump files. In the first place, for each accessible dataset, we extracted its implicit and explicit schema concepts and their labels, as described in Definition 1. The explicit schema concepts are provided by resource types, while the implicit schema concepts are provided by the definitions of a resource properties [18]. As noted in Sect. 2, some labels such as “Group”, “Thing”, “Agent”, “Person” are very generic, so they are considered as noisy labels. To address this problem, we filter out schema concepts described by generic vocabularies such as VoID Footnote 11 , FOAF Footnote 12 and SKOS Footnote 13 . The dataset document profiles, as defined in Definition 2, are constructed by extracting the textual descriptions of labels by querying the Linked Open Vocabularies Footnote 14 (LOV) with each of the concept labels per dataset.
To form the clusters of comparable datasets from Definition 3, we compute the semantico-frequential similarity between labels (given in Eq. (1)). We apply this measure via its available web API service Footnote 15 . In addition, we tested our system with two more semantic similarity measures based on WordNet: Wu Palmer and Lin’s. For this purpose, we used the 2013 version of the WS4J Footnote 16 java API.
The evaluation data (ED) corresponds to the outgoing and incoming links extracted from the generated VoID file using the datahub2void tool Footnote 17 . It is made available on http://www.lirmm.fr/benellefi/void.ttl. We note that out of 90 accessible datasets, only those that are linked to at least one accessible dataset in the ED are evaluated in the experiments.
We started by considering each dataset in the ED as an unlinked source (newly published) dataset \(D_S\) . Then, we ran the CCD-CosineRank workflow, as described in Sect. 2.2. The first step is to form a \(CCD(D_S)\) for each \(D_S\) . The CCD construction process depends on the similarity measure on dataset profiles. Thus, we evaluated the CCD clusters in terms of recall for different levels of the threshold \(\theta \) (cf. Definition 3) for the three similarity measures that we apply. We observed that the recall value remains \(100\,\%\) in the following threshold intervals per similarity measure: Wu Palmer: \(\theta \in [0,0.9]\) ; Lin: \(\theta \in [0,0.8]\) ; UMBC: \(\theta \in [0,0.7]\) .
The CCD construction step ensures a recall of \(100\,\%\) for various threshold values, which will be used to evaluate the ranking step of our recommendation process by the Mean Average Precision (MAP@R) at the maximal recall level, as defined in Definition 3. The results in Fig. 1 show highest performance of the UMBC’s measure with a \(MAP@R \cong 53\,\%\) for \(\theta = 0.7\) , while the best MAP@R values for Wu Palmer and Lin’s measures are, respectively, \(50\,\%\) for \(\theta = 0.9\) and \(51\,\%\) for \(\theta = 0.8\) . Guided by these observations, we evaluated our ranking in terms of precision at ranks \(k= \\) , as shown in Table 1. Based on these results, we choose UMBC at a threshold \(\theta = 0.7\) as a default setting for CCD-CosineRank, since it performs best for three out of four k-values and it is more stable than the two others especially with MAP@R.
To the best of our knowledge, there does not exist a common benchmark for dataset interlinking recommendation. Since our method uses both label profiles and document profiles, we implemented two recommendation approaches to be considered as baselines – one using document profiles only, and another one using label profiles:
We begin by a note on the vocabulary filtering that we perform (Sect. 3.2). We underline that we have identified the types which improve/decrease the performance empirically. As expected, vocabularies, which are very generic and wide-spread have a negative impact, acting like hub nodes, which dilute the results. For comparison, the results of the recommendation before removal are made available on http://www.lirmm.fr/benellefi/RankNoFilter.csv.
The different experiments described above show a high performance of the introduced recommendation approach with an average precision of \(53\,\%\) for a recall of \(100\,\%\) . Likewise, it may be observed that this performance is completely independent of the dataset size (number of triples) or the schema cardinality (number of schema concepts by datasets). However, we note that better performance was obtained for datasets from the geographic and governmental domains with precision and recall of \(100\,\%\) . Naturally, this is due to the fact that a recommender system in general and particularly our system performs better with datasets having high quality schema description and datasets reusing existing vocabularies (the case for the two domains cited above), which is considered as linked data modeling best practice. An effort has to be made for improving the quality of the published dataset [19].
We believe that our method can be given a more fair evaluation if better evaluation data in the form of ground truth are used. Indeed, our results are impacted by the problem of false positives overestimation. Since data are not collected using the recommender system under evaluation, we are forced to assume that the false positive items would have not been used even if they had been recommended, i.e., that they are uninteresting or useless to the user. This assumption is, however, generally false, for example when the set of unused items contains some interesting items that the user did not select. In our case, we are using declared links in the LOD cloud as ED, which is certain but far from being complete for it to be considered as ground truth. Thus, in the recommendation process the number of false positives tends to be overestimated, or in other words an important number of missing positives in the ED translates into false positives in the recommendation process.
To further illustrate the effect of false positives overestimation, we ran SILK as an instance matching tool to discover links between \(D_S\) and their corresponding \(D_T\) s that have been considered as false positives in our ED. SILK takes as an input a Link Specification Language file, which contains the instance matching configuration. We recall that our recommendation procedure provides pairs of shared or similar types between \(D_S\) and every \(D_T\) in its corresponding CCD, which are particularly useful to configure SILK. However, all additional information, such as the datatype properties of interest, has to be given manually. This makes the process very time consuming and tedious to perform over the entire LOD. Therefore, as an illustration, we ran the instance matching tool on two flagship examples of false positive \(D_T\) s:
We provide the set of newly discovered linksets to be added to the LOD topology and we made the generated linksets and the corresponding SILK configurations available on http://www.lirmm.fr/benellefi/Silk_Matching.
It should be noted that the recommendation results provided by our approach may contain some broader candidate datasets with respect to the source dataset. For example, two datasets that share schema labels such as books and authors are considered as candidates even when they are from different domains like science vs. literature. This outcome can be useful for predicting links such as “rdfs:seeAlso” (rather than “owl:sameAs”). We have chosen to avoid the inclusion of instance-related information in order to keep the complexity of the system as low as possible and still provide reasonable precision by guaranteeing a \(100\,\%\) recall.
As a conclusion, we outline three directions of work in terms of dataset quality that can considerably facilitate the evaluation of any recommender system in that field: (1) improving descriptions and metadata; (2) improving accessibility; (3) providing a reliable ground truth and benchmark data for evaluation.
With respect to finding relevant datasets on the Web, we cite briefly several studies on discovering relevant datasets for query answering. Based on well-known data mining strategies, [20, 21] present techniques to find relevant datasets, which offer contextual information corresponding to the user queries. A feedback-based approach to incrementally identify new datasets for domain-specific linked data applications is proposed in [22]. User feedback is used as a way to assess the relevance of the candidate datasets.
In the following, we cite approaches that have been devised for the datasets interlinking candidates recommendation task and which are, therefore, directly relevant to our work. Nikolov and d’Aquin [23] propose a keyword-based search approach to identify candidate sources for data linking consisting of two steps: (i) searching for potentially relevant entities in other datasets using as keywords randomly selected instances over the literals in the source dataset, and (ii) filtering out irrelevant datasets by measuring semantic concept similarities obtained by applying ontology matching techniques.
Mehdi et al. [24] propose a method to automatically identify relevant public SPARQL endpoints from a list of candidates. First, the process needs as input a set of domain-specific keywords, which are extracted from a local source or can be provided manually by an expert. Then, using natural languages processing techniques and queries expansion techniques, the system generates a set of queries that seek to exact literal matches between the introduced keywords and the target datasets, i.e., for each term supplied to the algorithm, the system runs a comparison to a set of eight queries: \(\\) \(\times \) \(\\) . Finally, the produced output consists of a list of potentially relevant SPARQL endpoints of datasets for linking. In addition, an interesting contribution of this technique is the bindings returned for the subject and predicate query variables, which are recorded and logged when a term match is found on some particular SPARQL endpoint. The records are useful in the linking step.
Leme et al. [25] present a ranking method for datasets with respect to their relevance for the interlinking task. The ranking is based on Bayesian criteria and on the popularity of the datasets, which affects the generality of the approach. The authors extend this work and overcome this drawback in [9] by exploring the correlation between different sets of features—properties, classes and vocabularies—and the links to compute new rank score functions for all the available linked datasets.
None of the studies outlined above have evaluated the ranking measure in terms of Precision/Recall, except for [9] which, according to the authors, achieves a mean average precision of around \(60\,\%\) and an excepted recall of \(100\,\%\) with rankings over all LOD datasets. However, a direct comparison to our approach seems unfair since the authors did not provide the list of the datasets and their rank performance by datasets considered as source.
In comparison to the work discussed above, our approach has the potential of overcoming a series of complexity related problems, precisely, considering the complexity to generate the matching in [23], to produce the set of domain-specific keywords as input in [24] and to explore the set of features of all the network datasets in [9]. Our recommendation results are much easier to obtain since we only manipulate the schema part of the dataset. They are also easier to interpret and apply since we automatically recommend the corresponding schema concept mappings together with the candidate datasets.
Following the linked data best practices, metadata designers reuse and build on, instead of replicating, existing RDF schema and vocabularies. Motivated by this observation, we propose the CCD-CosineRank interlinking candidate dataset recommendation approach, based on concept label profiles and schema overlap across datasets. Our approach consists of identifying clusters of comparable datasets, then, ranking the datasets in each cluster with respect to a given dataset. We discuss three different similarity measures, by which the relevance of our recommendation can be achieved. We evaluate our approach on real data coming from the LOD cloud and compare it two baseline methods. The results show that our method achieves a mean average precision of around \(53\,\%\) for recall of \(100\,\%\) , which reduces considerably the cost of dataset interlinking. In addition, as a post-processing step, our system returns sets of schema concept mappings between source and target datasets, which decreases considerably the interlinking effort and allows to verify explicitly the quality of the recommendation.
In the future, we plan to improve the evaluation framework by developing a more reliable and complete evaluation data for dataset recommendation. We plan to elaborate a ground truth based on certain parts of the LOD, possibly by using crowdsourcing techniques, in order to deal with the false positives overestimation problem. Further work should go into btaining high quality profiles, in particular by considering the population of the schema elements. We also plan to investigate the effectiveness of machine learning techniques, such as classification or clustering, for the recommendaiton task. One of the conclusions of our study shows that the recommendation approach is limited by the lack of accessibility, explicit metadata and quality descriptions of the datasets. As this can be given as an advice to data publishers, in the future, we will work on the development of recommendation methods for datasets with noisy and incomplete descriptions.
We note that we use the term “cluster” in its general meaning, referring to a set of datasets grouped together by their similarity and not in a machine learning sense.