A semantics-aware storage framework for scalable processing of knowledge graphs on Hadoop

2017 IEEE International Conference on Big Data (Big Data)(2017)

引用 6|浏览6
暂无评分
摘要
Knowledge graphs are graph-based data models which employ named nodes and edges to capture differentiation among entities and relationships in richly diverse data collections such as in the biomedical domain. The flexibility of knowledge graphs allows for heterogeneous collections to be linked and integrated in precise ways. However, resulting data models often have irregular structures which are not easy to manage using platforms for structured, schema-first data models like the relational model. To facilitate exchange, inter-operability and reuse of data, standards such as Resource Description Framework (RDF) have been increasingly adopted for representation. Domains such as the biomedical now have large collections of publicly available RDF graphs as well as benchmark workloads. To achieve scalability in data processing, some efforts are being made to build on distributed processing platforms such as Hadoop and Spark. However, while some distributed graph platforms have emerged for certain classes of mining workloads for non-semantic graphs (without typed edges and nodes), knowledge graph processing, which often involves ontological inferencing, continues to be plagued by scalability and efficiency challenges. In this paper, we present the design of a Hadoop-based storage architecture for knowledge graphs that overcomes some of the challenges of big RDF data processing. The rationale of the design strategy is to go beyond the traditional approach of exploiting structural properties of graphs while storing to include exploitation of semantic properties of knowledge graphs. Our system SemStorm is a Hadoop-based indexed, polymorphic, signatured file organization that supports efficient storage of data collections with significant data heterogeneity. Naive storage models for such data place more demands for meta-data management than traditional systems can support. The polymorphic file organization is further coupled with a nested, column-oriented file format to enable discriminatory data access based on queries. A major hallmark of SemStorm is the enabling of semantic-awareness in storage framework. The idea is to exploit the knowledge represented in ontologies that accompany data for optimizing data storage models such as identifying and managing data (sometimes implicit) redundancies. Another major advantage of SemStorm is that it derives optimized storage models for data autonomically, i.e., without user input. Extensive experiments conducted on real-world and synthetic benchmark datasets show that SemStorm is up to 10X faster than existing approaches.
更多
查看译文
关键词
semantics-aware storage framework,publicly available RDF graphs,distributed processing platforms,distributed graph platforms,nonsemantic graphs,big RDF data processing,naive storage models,meta-data management,discriminatory data access,data storage models,data redundancies,optimized storage models,data heterogeneity,knowledge graph scalable processing,graph-based data models,heterogeneous collections,irregular structures,schema-first data models,interoperability,data reuse,benchmark workloads,mining workloads,knowledge graph processing,ontological inferencing,Hadoop-based storage architecture,structural properties,knowledge graph semantic properties,SemStorm,Hadoop-based indexed file organization,polymorphic file organization,signatured file organization,data collection storage,nested file format,column-oriented file format,queries,semantic-awareness,knowledge representation,ontologies,real-world benchmark datasets,synthetic benchmark datasets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要