OPAL Open Data Hackathon

On 6 April 2020, the OPAL Open Data Hackathon took place in the form of a remote event. It focused on mobility data, for which participants developed ideas and software solutions. For this purpose, information on possible tasks, data formats and data collections were provided on the Hackathon website, allowing the participants to explore the Semantic Web and description languages on their own. Among the solutions submitted, two winners were chosen.

Winner: Spatial visualization of available datasets

Nikit Srivastava developed his solution Show-Geo for displaying data records on a map. For this purpose, a software component queries datasets and the associated spatial data in DCAT format using SPARQL. The processed related polygons are provided for online access via webservices via a REST interface. A second component then manages the preparation of data for displaying in web browsers. Users have access to worldwide data on a map, showing the respective number of available datasets in different geographical regions. Depending on the zoom level of the map, the number of datasets is displayed in clusters. Additional technologies used are Java, Spring Boot, Apache Jena as well as JavaScript, Node, React (Angular in development) and Mapbox.

Winner: Classification of datasets for category assignment

Ana Alexandra Morim da Silva contributes Theme-Classify, a method for classifying datasets. Actual data collections contain datasets, which partly are not assigned to a category. In order to automatically assign categories, the description texts of datasets are analyzed. Thereby it is assumed that similar words are used in descriptions of datasets of the same category. Based on this assumption, the system determines which word combinations of already categorized datasets are statistically most similar to an uncategorized dataset.

In the contribution, a SPARQL query is used to determine names, descriptions and categories of records. Afterwards the data is split into training and test data. After a normalization of the words and removal of stopwords, vectors are computed. TF-IDF (term frequency and inverse document frequency) and decision trees are used in the analysis. Users can specify whether J48 or Naive Bayes is used and choose between the size of n-grams. Finally, an evaluation of the correctness of the allocations is provided. The developed software uses Java, WEKA, Apache Jena, SPARQL and Stanford NLP, among others.

Closing of the event

We sincerely thank all participants. A certificate of participation will be provided for all students who have submitted a Hackathon result. The two winners will also be sharing a prize, preferably in the form of Hasentalers, to support the local economy in Paderborn. Links to Open Data for the Paderborn region can still be found on the website; perhaps they can be used at a future event. If you have any questions about the event, please contact Adrian Wilke, details are available on the DICE website. This event has been supported by the German Federal Ministry of Transport and Digital Infrastructure (BMVI) in the project OPAL (no. 19F2028A).

OPAL at conferences in 2019

The results of the OPAL project are mainly published as Deliverables, which partly consist of Software at GitHub. Additionally, research related to underlying concepts in the domain of the Semantic Web is conducted and presented. In this way, OPAL becomes known through scientific conferences. This text consists of abstracts and links regarding research related to OPAL.

Research articles and abstracts

The 18th International Semantic Web Conference (ISWC 2019)

LimesWebUI – Link Discovery Made Simple

Abstract: In this paper we present LimesWebUI, our web interface of Limes. Limes, the Link Discovery Framework for Metric Spaces, is a framework for discovering links between entities contained in Linked Data sources. LimesWebUI assists the end user during the link discovery process. By representing the link specifications (LS) as interlocking blocks, our interface eases the manual creation of links for users who already know which LS they would like to execute. However, most users do not know which LS suits their linking task best and therefore need help throughout this process. Hence, our interface provides wizards which allow the easy configuration of many link discovery machine learning algorithms, that does not require the user to enter a manual LS. We evaluate the usability of the interface by using the standard system usability scale questionnaire. Our overall usability score of 76.5 suggests that the online interface is consistent, easy to use, and the various functions of the system are well integrated.

Sherif, Mohamed Ahmed ; Svetlana, Pestryakova ; Dreßler, Kevin ; Ngomo, Axel-Cyrille Ngonga: LimesWebUI – Link Discovery Made Simple. In: 18th International Semantic Web Conference (ISWC 2019) : CEUR-WS.org, 2019
PDFMore information

THOTH: Neural Translation and Enrichment of Knowledge Graphs

Abstract: Knowledge Graphs are used in an increasing number of applications. Although considerable human effort has been invested into making knowledge graphs available in multiple languages, most knowledge graphs are in English. Additionally, regional facts are often only available in the language of the corresponding region. This lack of multilingual knowledge availability clearly limits the porting of machine learning models to different languages. In this paper, we aim to alleviate this drawback by proposing THOTH, an approach for translating and enriching knowledge graphs. THOTH extracts bilingual alignments between a source and target knowledge graph and learns how to translate from one to the other by relying on two different recurrent neural network models along with knowledge graph embeddings. We evaluated THOTH extrinsically by comparing the German DBpedia with the German translation of the English DBpedia on two tasks: fact checking and entity linking. In addition, we ran a manual intrinsic evaluation of the translation. Our results show that THOTH is a promising approach which achieves a translation accuracy of 88.56%. Moreover, its enrichment improves the quality of the German DBpedia significantly, as we report +18.4% accuracy for fact validation and +19% F11 for entity linking.

Moussallem, Diego ; Soru, Tommaso ; Ngomo, Axel-Cyrille Ngonga: {THOTH: Neural Translation and Enrichment of Knowledge Graphs}. In: International Semantic Web Conference, 2019, S. 505–522
SpringerMore information

Semantic Web for Machine Translation: Challenges and Directions

Abstract: A large number of machine translation approaches have recently been developed to facilitate the fluid migration of content across languages. However, the literature suggests that many obstacles must still be dealt with to achieve better automatic translations. One of these obstacles is lexical and syntactic ambiguity. A promising way of overcoming this problem is using Semantic Web technologies. This article is an extended abstract of our systematic review on machine translation approaches that rely on Semantic Web technologies for improving the translation of texts. Overall, we present the challenges and opportunities in the use of Semantic Web technologies in Machine Translation. Moreover, our research suggests that while Semantic Web technologies can enhance the quality of machine translation outputs for various problems, the combination of both is still in its infancy.

Moussallem, D., Wauer, M., & Ngomo, A.C. (2019). Semantic Web for Machine Translation: Challenges and Directions. In International Semantic Web Conference (pp. 8).
PDFMore information

Towards More Intelligent SPARQL Querying Interfaces

Abstract: Over years, the Web of Data has grown significantly. Various interfaces such as SPARQL endpoints, data dumps, and Triple Pattern Fragments (TPF) have been proposed to provide access to this data. Studies show that many of the SPARQL endpoints have availability issues. The data dumps do not provide live querying capabilities. The TPF solution aims to provide a trade-off between the availability and performance by dividing the workload among TPF servers and clients. In this solution, the TPF server only performs the triple patterns execution of the given SPARQL query. While the TPF client performs the joins between the triple patterns to compute the final resultset of the SPARQL query. High availability is achieved in TPF but increase in network bandwidth and query execution time lower the performance. We want to propose a more intelligent SPARQL querying server to keep the high availability along with high query execution performance, while minimizing the network bandwidth. The proposed server will offer query execution services (can be single triple patterns or even join execution) according to the current status of the workload. If a server is free, it should be able to execute the complete SPARQL query. Thus, the server will offer execution services while avoiding going beyond the maximum query processing limit, i.e. the point after which the performance start decreasing or even service shutdown. Furthermore, we want to develop a more intelligent client, which keeps track of a server’s processing capabilities and therefore avoid DOS attacks and crashes.

Khan, H. (2019). Towards More Intelligent SPARQL Querying Interfaces. International Semantic Web Conference.
PDFMore information

Unsupervised Discovery of Corroborative Paths for Fact Validation

Abstract: Any data publisher can make RDF knowledge graphs available for consumption on the Web. This is a direct consequence of the decentralized publishing paradigm underlying the Data Web, which has led to more than 150 billion facts on more than 3 billion things being published on the Web in more than 10,000 RDF knowledge graphs over the last decade. However, the success of this publishing paradigm also means that the validation of the facts contained in RDF knowledge graphs has become more important than ever before. Several families of fact validation algorithms have been developed over the last years to address several settings of the fact validation problems. In this paper, we consider the following fact validation setting: Given an RDF knowledge graph, compute the likelihood that a given (novel) fact is true. None of the current solutions to this problem exploits RDFS semantics—especially domain, range and class subsumption information. We address this research gap by presenting an unsupervised approach dubbed COPAAL, that extracts paths from knowledge graphs to corroborate (novel) input facts. Our approach relies on a mutual information measure that takes the RDFS semantics underlying the knowledge graph into consideration. In particular, we use the information shared by predicates and paths within the knowledge graph to compute the likelihood of a fact being corroborated by the knowledge graph. We evaluate our approach extensively using 17 publicly available datasets. Our results indicate that our approach outperforms the state of the art unsupervised approaches significantly by up to 0.15 AUC-ROC. We even outperform supervised approaches by up to 0.07 AUC-ROC. The source code of COPAAL is open-source and is available at https://github.com/dice-group/COPAAL.

Syed, Z. H., Röder, M. & Ngomo, A.-C. N. (2019). Unsupervised Discovery of Corroborative Paths for Fact Validation. In C. Ghidini, O. Hartig, M. Maleshkova, V. Svátek, I. Cruz, A. Hogan, J. Song, M. Lefrançois & F. Gandon (eds.), The Semantic Web — ISWC 2019 (p./pp. 630–646), Cham: Springer International Publishing. ISBN: 978-3-030-30793-6
SpringerMore information

International Conference on Web Engineering (ICWE 2019)

Dragon: Decision Tree Learning for Link Discovery

Abstract: The provision of links across RDF knowledge bases is regarded as fundamental to ensure that knowledge bases can be used joined to address real-world needs of applications. The growth of knowledge bases both with respect to their number and size demands the development of time-efficient and accurate approaches for the computation of such links. This is generally done with the aid of machine learning approaches, such as e.g. Decision Trees. While Decision Trees are known to be fast, they are generally outperformed in the link discovery task by the state-of-the-art in terms of quality, i.e. F-measure. In this work, we present Dragon, a fast decision-tree-based approach that is both efficient and accurate. Our approach was evaluated by comparing it with state-of-the-art link discovery approaches as well as the common decision-tree-learning approach J48. Our results suggest that our approach achieves state-of-the-art performance with respect to its F-measure while being 18 times faster on average than existing algorithms for link discovery on RDF knowledge bases. Furthermore, we investigate why Dragon significantly outperforms J48 in terms of link accuracy. We provide an open-source implementation of our algorithm in the LIMES framework.

Obraczka, Daniel ; Ngonga Ngomo, Axel-Cyrille ; Bakaev, Maxim ; Frasincar, Flavius ; Ko, In-Young: Dragon: Decision Tree Learning for Link Discovery.. 11496. In: ICWE : Springer, 2019 (Lecture Notes in Computer Science). – ISBN 978-3-030-19274-7, S. 441-456
SpringerMore information

30th ACM Conference on Hypertext and Social Media

Ranking on Very Large Knowledge Graphs

Abstract: Ranking plays a central role in a large number of applications driven by RDF knowledge graphs. Over the last years, many popular RDF knowledge graphs have grown so large that rankings for the facts they contain cannot be computed directly using the currently common 64-bit platforms. In this paper, we tackle two problems: Computing ranks on such large knowledge bases efficiently and incrementally. First, we present D-HARE, a distributed approach for computing ranks on very large knowledge graphs. D-HARE assumes the random surfer model and relies on data partitioning to compute matrix multiplications and transpositions on disk for matrices of arbitrary size. Moreover, the data partitioning underlying D-HARE allows the execution of most of its steps in parallel. As very large knowledge graphs are often updated periodically, we tackle the incremental computation of ranks on large knowledge bases as a second problem. We address this problem by presenting I-HARE, an approximation technique for calculating the overall ranking scores of a knowledge without the need to recalculate the ranking from scratch at each new revision. We evaluate our approaches by calculating ranks on the 3 × 109 and 2.4 × 109 triples from Wikidata resp. LinkedGeoData. Our evaluation demonstrates that D-HARE is the first holistic approach for computing ranks on very large RDF knowledge graphs. In addition, our incremental approach achieves a root mean squared error of less than 10−7 in the best case. Both D-HARE and I-HARE are open-source and are available at: https://github.com/dice-group/incrementalHARE.

Desouki, Abdelmoneim Amer ; Röder, Michael ; Ngonga Ngomo, Axel-Cyrille: Ranking on Very Large Knowledge Graphs. In: Proceedings of the 30th ACM Conference on Hypertext and Social Media, 2019, S. 163–171
PDFMore information

International Conference on Knowledge Capture (K-Cap 2019)

Jointly Learning from Social Media and Environmental Data for Typhoon Intensity Prediction

Abstract: Existing technologies employ different machine learning approachesto predict disasters from historical environmental data. However, for short-term disasters (e.g., earthquakes), historical data alone has a limited prediction capability. In this work, we consider social media as a supplementary source of knowledge in addition to historical environmental data. Further, we build a joint model that learns from disaster-related tweets and environmental data to improve prediction. We propose the combination of semantically enriched word embedding to represent entities in tweets with their semantics representations computed with the traditional word2vec. Our experiments show that our proposed approach outperforms the accuracy of state-of-the-art models in disaster prediction.

Hamada M. Zahera, Mohamed Ahmed Sherif, & Axel-Cyrille Ngonga Ngomo (2019). Jointly Learning from Social Media and Environmental Data for Typhoon Intensity Prediction. In K-CAP 2019: Knowledge Capture Conference (pp. 4).
PDFMore information

Do your Resources Sound Similar? On the Impact of Using Phonetic Similarity in Link Discovery

Abstract: An increasing number of heterogeneous datasets abiding by the Linked Data paradigm is published everyday. Discovering links between these datasets is thus central to achieving the vision behind the Data Web. Declarative Link Discovery (LD) frameworks rely on complex Link Specification (LS) to express the conditions under which two resources should be linked. Complex LS combine similarity measures with thresholds to determine whether a given predicate holds between two resources. State of the art LD frameworks rely mostly on string-based similarity measures such as Levenshtein and Jaccard. However, string-based similarity measures often fail to catch the similarity of resources with phonetically similar property values when these property values are represented using different string representation (e.g., names and street labels). In this paper, we evaluate the impact of using phonetics-based similarities in the process of LD. Moreover, we evaluate the impact of phonetic-based similarity measures on a state-of-the-art machine learning approach used to generate LS. Our experiments suggest that the combination of string-based and phonetic-based measures can improve the Fmeasures achieved by LD frameworks on most datasets.

Abdullah Fathi Ahmed, Mohamed Ahmed Sherif, & Axel-Cyrille Ngonga Ngomo (2019). Do your Resources Sound Similar? On the Impact of Using Phonetic Similarity in Link Discovery. In K-CAP 2019: Knowledge Capture Conference (pp. 8).
PDFMore information

International Conference Recent Advances in Natural Language Processing

A Holistic Natural Language Generation Framework for the Semantic Web

Abstract: With the ever-growing generation of data for the Semantic Web comes an increasing demand for this data to be made available to non-semantic Web experts. One way of achieving this goal is to translate the languages of the Semantic Web into natural language. We present LD2NL, a framework for verbalizing the three key languages of the Semantic Web, i.e., RDF, OWL, and SPARQL. Our framework is based on a bottom-up approach to verbalization. We evaluated LD2NL in an open survey with 86 persons. Our results suggest that our framework can generate verbalizations that are close to natural languages and that can be easily understood by nonexperts. Therewith, it enables non-domain experts to interpret Semantic Web data with more than 91% of the accuracy of domain experts.

Ngonga Ngomo, A.-C., Moussallem, D. & Bühman, L. (2019). A Holistic Natural Language Generation Framework for the Semantic Web. Proceedings of the International Conference Recent Advances in Natural Language Processing (p./pp. 8), .
PDFMore information

International Workshop on Chatbot Research (CONVERSATIONS 2019)

An Approach for Ex-Post-Facto Analysis of Knowledge Graph-Driven Chatbots – the DBpedia Chatbot

Abstract: As chatbots are gaining popularity for simplifying access to information and community interaction, it is essential to examine whether these agents are serving their intended purpose and catering to the needs of their users. Therefore, we present an approach to perform an ex-post-facto analysis over the logs of knowledge base-driven dialogue systems. Using the DBpedia Chatbot as our case study, we inspect three aspects of the interactions, (i) user queries and feedback, (ii) the bot’s response to these queries, and (iii) the overall flow of the conversations. We discuss key implications based on our findings. All the source code used for the analysis can be found at https://github.com/dicegroup/DBpedia-Chatlog-Analysis.

Rricha Jalota, Priyansh Trivedi, Gaurav Maheshwari, Axel-Cyrille Ngonga Ngomo, Ricardo Usbeck: An Approach for Ex-Post-Facto Analysis of Knowledge Graph-Driven Chatbots – the DBpedia Chatbot. Pre-print of full paper presented at CONVERSATIONS 2019 – an international workshop on chatbot research, November 19-20, Amsterdam, the Netherlands. The final version of the paper will be published in the post-workshop proceedings as part of Springer LNCS.
PDFPreprint website


OPAL at conferences in 2019

The researchers at DICE / @DiceResearch / @CompScience_UPB / @unipb regularly write about news in their research areas:

@AbdelmonemMAmer @Abdullah_Fathi_ @DiegoMoussallem @hamadazahera @hashimkhanwazi4 @kvndrsslr @NgongaAxel @MAhmedSherif @MatthiasWauer @mommi84 @Ricardo_Usbeck @RrichaJalota @zafarhabeeb

OPAL at the mFUND conference 2018

Second mFUND conference

“Data as the engine for mobility 4.0” is the slogan of the second mFUND conference, which took place at the WECC in Berlin on the 16th and 17th October 2018. Funded projects discussed their results in 20 forums and 6 workshops. In addition, the venue was a great networking opportunity for about 400 participants.

Dr. Matthias Wauer discussing OPAL with Dr. Roland Goetzke at the mFUND conference 2018
Dr. Matthias Wauer discussing OPAL with Dr. Roland Goetzke at the mFUND conference 2018 © Dirk Michael Deckbar

Presentation of OPAL

Immediately in the first forum “data platforms and standardisation”, Dr. Matthias Wauer presented the OPAL project and preliminary results. These include the system architecture, quality criteria, a first crawler prototype and vocabularies to be used to describe metadata. In the following discussion, the audience supported the point of view that high-quality metadata are crucial to find suitable open data. Additionally, the recent release of Google Dataset Search shows that OPAL targets a highly relevant research area. While Google Research focuses on already semi-structured metadata from schema.org and CKAN annotations, OPAL attempts to extract less structured, implicit metadata from Web pages.

Further related projects

Besides OPAL, the first forum featured related projects like LIMBO and WEKOVI. Both of them use semantic representations of open data. However, they focus on actual datasets, whereas OPAL explicitly deals with metadata. Unfortunately, the last presentation “MetaOpenData” was cancelled.

In the majority of the forums, projects focused on the application of specific datasets in certain applications, such as cycling infrastructure, traffic safety and environmental issues, such as air pollution. In conclusion, the work of OPAL will enable better accessibility of open data, which will positively impact such application-oriented projects too.

OPAL at the mFUND conference

The OPAL project was introduced at the mFUND conference in Berlin. Dr. Matthias Wauer presented OPAL as a representative of Prof. Axel-Cyrille Ngonga Ngomo (University of Paderborn)  in workshop 4 “data think tank”.

Dr. Matthias Wauer presenting OPAL at the mFUND conference
Source: Jan Kobel / bmvi.de