+ - 0:00:00
Notes for current slide

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Notes for next slide



Integrate and query local datasets and distant RDF data with AskOmics using Semantic Web technologies



last_modification Updated:   purlPURL: gxy.io/GTN:S00097

text-document Plain-text slides |

Tip: press P to view the presenter notes | arrow-keys Use arrow keys to move between slides
1 / 23

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 23

question Questions

  • What is the Semantic Web and how can it help integrating and querying data?

  • How can AskOmics help benefiting from Semantic Web technologies without having to write RDF or SPARQL code?

3 / 23

objectives Objectives

  • Understand the basics of RDF and SPARQL

  • Learn how input data must be structured to be integrated with AskOmics

  • Learn how to connect distant SPARQL endpoints to local data with AskOmics

4 / 23

How to explore data

5 / 23

Requirements

Study of biological mechanisms requires to:

  • integrate multiple data sources (differential expression results, genome annotation, remote protein database)
  • query them to answer a biological question (which genes are over-expressed in a condition, and what are the proteins coded by these genes)

Local data tables and remote data with gene and proteins is combined in data integration. Now a graph is produced with differential expression pointing to a gene which points to a protein. Next the data is queried and results produced.

6 / 23

What is the Semantic Web?

7 / 23

Semantic Web

Set of recommendations to integrate data, to integrate domain knowledge and to perform query and reasoning.

  • Resource Description Framework (RDF): annotate data
  • RDFS + OWL: represent knowledge (data description)
  • SPARQL Protocol and RDF Query Language (SPARQL): query data
8 / 23

RDF

  • RDF is for describing resources (the R in RDF)

    • resources are identified by URIs (nextprot:P01137, taxon:9606)
  • describing (D in RDF) a resource is representing it explicitly

    • its attributes (nextprot:P01137 :hasSequence "MPPSGLRLLL...")
    • its relations to other entities (nextprot:P01137 :hasTaxon taxon:9606)
    • its descriptions (aka classes) (nextprot:P01137 :is nextprot:Protein)
9 / 23

RDF: Set of triples

  • RDF is represented by triples (subject, predicate, object)
    • Subject: the resource being described
    • Predicate: the relation (from subject to object)
    • Object: a value of the predicate for the subject

a small graph, subject points to object with an arrow labelled predicate.

nextprot:P01137 :hasTaxon taxon:9606 .
nextprot:P01137 :hasSequence "MPPSGLRLLL" .
10 / 23

RDF: triples form a labeled directed graph

# Description
nextprot:P01137 rdf:type nextprot:Protein .
taxon:9606 rdf:type nextprot:Organism .
# Data
nextprot:P01137 :hasTaxon taxon:9606 .
nextprot:P01137 :hasSequence "MPPSGLRLLL" .

A graphic with two regions, data description and data. In the data is a circle labelled nextprot:P01137 which points to a sequence via a hasSequence arrow. The nextprot points to a taxon:9606 with a hasTaxon arrow. The taxon points to a nextprot:Organism in the data description region. The nextprot protein points to nextprot:Protein in the data description region via an rdf:type arrow.

11 / 23

SPARQL

  • The SPARQL language is a set of triple patterns with variables (?variable_name)
SELECT ?gene
WHERE {
?gene rdf:type :Gene .
?gene :hasTaxon taxon:9606 .
}
  • All ?gene with rdf:type :Gene and with :hasTaxon taxon:9606
  • In other words, the query returns all the human genes
12 / 23

SPARQL: entity matching allow federated queries

  • Using the same identifier for the same entity (entity matching) across multiple datasets allows federated queries

  • Federated SPARQL queries provide unified querying capabilities over multiple datasets as if they were a single virtual graph

Dataset 1 and 2 are shown as two silos, each with different small graphs. Each has a red node. Those nodes are connected via a dashed line. A picture of a cloud points at the two datasets, and their individual graphs collapsed into one larger graph. A query is sent to this cloud which comes out as a result table.

13 / 23

What is AskOmics?

14 / 23

AskOmics

Web software for data integration and query using Semantic Web. The main functionalities are:

  • Convert of multiple data formats into RDF triples and store them in a local triplestore
  • Generate complex SPARQL queries using a user-friendly interface
  • Support external SPARQL endpoint to cross-reference integrated data with remote data

AskOmics can be used as a standalone software, or with Galaxy

15 / 23

Data integration with AskOmics

16 / 23

Local data integration

  • RDF and SPARQL are good infrastructures to describe and query biological datasets, but most of them are still stored on flat files like CSV/TSV and GFF.

  • AskOmics converts multiple structured data formats into RDF triples

    • CSV/TSV
    • GFF
    • BED
17 / 23

AskOmics generates the graph of data and the abstraction

AskOmics uses the file structure (e.g. header of TSV files) to generate the graph of data description: the abstraction

Two tables are provided, pointing to RDF abstraction with a small graph of DE, Gene, and their attributes. And RDF data which has the same graph as abstraction, but with real identifiers.

The rest of the files is converted to RDF triples that correspond to the data.

18 / 23

Distant RDF data integration

  • Some public databases (e.g. neXtProt) provide RDF data through a SPARQL endpoint (public access for RDF data)
  • To connect with a remote SPARQL endpoint, AskOmics needs its RDF abstraction
  • The abstraction can be generated with abstractor
pip3 install abstractor
abstractor -s https://sparql.nextprot.org/sparql -o nextprot_abstraction.ttl -f turtle
  • This abstraction can then be uploaded into AskOmics as a standard file
19 / 23

Query multiple data sources with AskOmics

20 / 23

Query composition

  • Users navigate through the abstraction of local and remote data and create a path that represents a query

A picture of an RDF graph with many nodes. On the right is a query interface of some sort.

  • The query is converted to SPARQL code that is executed on the local and remote RDF data
  • Results are returned and can be downloaded
21 / 23

keypoints Key points

  • RDF and SPARQL are Semantic Web technologies that come useful for data integration and querying but the technical aspects may deter end-users

  • AskOmics is a web platform for data integration and query using Semantic Web in a user-friendly way

22 / 23

Thank You!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!

Galaxy Training Network

Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.

23 / 23

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 23
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow