RDF export & SPARQL queries#
SPARQL is a query language used to retrieve and manipulate data stored in Resource Description Framework (RDF) format. In this tutorial, we demonstrate how lamindb registries can be queried with SPARQL.
import warnings
warnings.filterwarnings("ignore")
!lamin load laminlabs/cellxgene
馃挕 connected lamindb: laminlabs/cellxgene
import bionty as bt
from rdflib import Graph, Literal, RDF, URIRef
馃挕 connected lamindb: laminlabs/cellxgene
Generally, we need to build a directed RDF Graph composed of triple statements. Such a graph statement is represented by:
a node for the subject
an arc that goes from a subject to an object for the predicate
a node for the object.
Each of the three parts can be identified by a URI.
We can use the DataFrame
representation of lamindb registries to build a RDF graph.
Building a RDF graph#
diseases = bt.Disease.df()
diseases.head()
uid | name | ontology_id | abbr | synonyms | description | public_source_id | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
689 | Me1FU1fo | breast tumor luminal A or B | MONDO:0004990 | None | breast tumor luminal|luminal breast cancer | Subsets Of Breast Carcinoma Defined By Express... | 49 | 2024-01-15 07:18:58.847937+00:00 | 2024-01-15 07:18:58.847956+00:00 | 1 |
688 | rpYSjunF | breast carcinoma by gene expression profile | MONDO:0006116 | None | breast carcinoma by gene expression profile | A Header Term That Includes The Following Brea... | 49 | 2024-01-15 07:18:57.073025+00:00 | 2024-01-15 07:18:57.073042+00:00 | 1 |
687 | 1UsnNL28 | Her2-receptor negative breast cancer | MONDO:0000618 | None | None | None | 49 | 2024-01-15 07:18:55.811848+00:00 | 2024-01-15 07:18:55.811853+00:00 | 1 |
686 | 1FdMycA0 | estrogen-receptor negative breast cancer | MONDO:0006513 | None | ER- breast cancer | A Subtype Of Breast Cancer That Is Estrogen-Re... | 49 | 2024-01-15 07:18:55.811781+00:00 | 2024-01-15 07:18:55.811787+00:00 | 1 |
685 | 2OGAtYpX | progesterone-receptor negative breast cancer | MONDO:0000616 | None | None | None | 49 | 2024-01-15 07:18:55.811699+00:00 | 2024-01-15 07:18:55.811715+00:00 | 1 |
We convert the DataFrame to RDF by generating triples.
rdf_graph = Graph()
namespace = URIRef("http://sparql-example.org/")
for _, row in diseases.iterrows():
subject = URIRef(namespace + str(row['ontology_id']))
rdf_graph.add((subject, RDF.type, URIRef(namespace + "Disease")))
rdf_graph.add((subject, URIRef(namespace + "name"), Literal(row['name'])))
rdf_graph.add((subject, URIRef(namespace + "description"), Literal(row['description'])))
rdf_graph
<Graph identifier=N7b79a64a52a44c1f8a52a4ca2266daf8 (<class 'rdflib.graph.Graph'>)>
Now we can query the RDF graph using SPARQL for the name and associated description:
query = """
SELECT ?name ?description
WHERE {
?disease a <http://sparql-example.org/Disease> .
?disease <http://sparql-example.org/name> ?name .
?disease <http://sparql-example.org/description> ?description .
}
LIMIT 5
"""
for row in rdf_graph.query(query):
print(f"Name: {row.name}, Description: {row.description}")
Name: breast tumor luminal A or B, Description: Subsets Of Breast Carcinoma Defined By Expression Of Genes Characteristic Of Luminal Epithelial Cells.
Name: breast carcinoma by gene expression profile, Description: A Header Term That Includes The Following Breast Carcinoma Subtypes Determined By Gene Expression Profiling: Luminal A Breast Carcinoma, Luminal B Breast Carcinoma, Her2 Positive Breast Carcinoma, Basal-Like Breast Carcinoma, Triple-Negative Breast Carcinoma, And Normal Breast-Like Subtype Of Breast Carcinoma.
Name: Her2-receptor negative breast cancer, Description: None
Name: estrogen-receptor negative breast cancer, Description: A Subtype Of Breast Cancer That Is Estrogen-Receptor Negative
Name: progesterone-receptor negative breast cancer, Description: None