Datapath Example 3¶
This notebook gives an example of how to build relatively simple data paths. It assumes that you understand the concepts presented in the example 2 notebook.
Exampe Data Model¶
The examples require that you understand a little bit about the example catalog data model, which is based on the FaceBase project.
Key tables¶
'dataset'
: represents a unit of data usually a ‘study’ or ‘collection’'experiment'
: a bioassay (typically RNA-seq or ChIP-seq assays)'replicate'
: a record of a replicate (bio or technical) related to an experiment
Relationships¶
dataset <- experiment
: A dataset may have one to many experiments. I.e., there is a foreign key reference from experiment to dataset.experiment <- replicate
: An experiment may have one to many replicates. I.e., there is a foreign key reference from replicate to experiment.
[1]:
# Import deriva modules
from deriva.core import ErmrestCatalog, get_credential
[2]:
# Connect with the deriva catalog
protocol = 'https'
hostname = 'www.facebase.org'
catalog_number = 1
# If you need to authenticate, use Deriva Auth agent and get the credential
credential = get_credential(hostname)
catalog = ErmrestCatalog(protocol, hostname, catalog_number, credential)
[3]:
# Get the path builder interface for this catalog
pb = catalog.getPathBuilder()
Building a DataPath¶
Build a data path by linking together tables that are related. To make things a little easier we will use python variables to reference the tables. This is not necessary, but simplifies the examples.
[4]:
dataset = pb.isa.dataset
experiment = pb.isa.experiment
replicate = pb.isa.replicate
Initiate a path from a table object¶
Like the example 2 notebook, begin by initiating a path
instance from a Table
object. This path will be “rooted” at the table it was initiated from, in this case, the dataset
table. DataPath
’s have URIs that identify the resource in the catalog.
[5]:
path = dataset.path
print(path.uri)
https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset
Path context¶
By default, DataPath
objects return entities for the last linked entity set in the path. The path
from the prior step ended in replicate
which is therefore the context
for this path.
[7]:
path.context
[7]:
Get entities for the current context¶
The following DataPath will fetch replicate
entities not dataset
s.
[8]:
entities = path.entities()
len(entities)
[8]:
15274
Get entities for a different path context¶
Let’s say we wanted to fetch the entities for the dataset
table rather than the current context which is the replicate
table. We can do that by referencing the table as a property of the path object. Note that these are known as “table instances” rather than tables when used within a path expression. We will discuss table instances later in this notebook.
[9]:
path.table_instances['dataset']
# or
path.dataset
[9]:
From that table instance we can fetch entities, add a filter specific to that table instance, or even link another table. Here we will get the dataset
entities from the path.
[10]:
entities = path.dataset.entities()
len(entities)
[10]:
351
Notice that we fetched fewer entities this time which is the number of dataset
entities rather than the replicate
entities that we previously fetched.
Filtering a DataPath¶
Building off of the path, a filter can be added. Like fetching entities, linking and filtering are performed relative to the current context. In this filter, the assay’s attriburtes are referenced in the expression.
Currently, binary comparisons and logical operators are supported. Unary opertors have not yet been implemented. In binary comparisons, the left operand must be an attribute (column name) while the right operand must be a literal value.
[11]:
path.filter(replicate.bioreplicate_number == 1)
print(path.uri)
https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1
[12]:
entities = path.entities()
len(entities)
[12]:
3766
Table Instances¶
So far we have discussed base tables. A base table is a representation of the table as it is stored in the ERMrest catalog. A table instance is a usage or reference of a table within the context of a data path. As demonstrated above, we may link together multiple tables and thus create multiple table instances within a data path.
For example, in path.link(dataset).link(experiment).link(replicate)
the table instance experiment
is no longer the same as the original base table experiment
because within the context of this data path the experiment
entities must satisfy the constraints of the data path. The experiment
entities must reference a dataset
entity, and they must be referenced by a replicate
entity. Thus within this path, the entity set for experiment
may be quite different than the
entity set for the base table on its own.
Table instances are bound to the path¶
Whenever you initiate a data path (e.g., table.path
) or link a table to a path (e.g., path.link(table)
) a table instance is created and bound to the DataPath object (e.g., path
). These table instances can be referenced via the DataPath
’s table_instances
container or directly as a property of the DataPath
object itself.
[13]:
dataset_instance = path.table_instances['dataset']
# or
dataset_instance = path.dataset
Aliases for table instances¶
Whenever a table instance is created and bound to a path, it is given a name. If no name is specified for it, it will be named after the name of its base table. For example, a table named “My Table” will result in a table instance also named “My Table”. Tables may appear more than once in a path (as table instances), and if the table name is taken, the instance will be given the “‘base name’ + number
” (e.g., “My Table2”).
You may wish to specify the name of your table instance. In conventional database terms, an alternate name is called an “alias”. Here we give the dataset
table instance an alias of ‘D’ though longer strings are also valid as long as they do not contain special characters in them.
[14]:
path.link(dataset.alias('D'))
[14]:
<deriva.core.datapath.DataPath at 0x103c22400>
[15]:
path.uri
[15]:
'https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1/D:=isa:dataset'
You’ll notice that in this path we added an additional instance of the dataset
table from our catalog model. In addition, we linked it to the isa.replicate
table. This was possible because in this model, there is a foriegn key reference from the base table replicate
to the base table dataset
. The entities for the table instance named dataset
and the instance name D
will likely consist of different entities because the constraints for each are different.
Selecting Attributes From Linked Entities¶
Returning to the initial example, if we want to include additional attributes from other table instances in the path, we need to be able to reference the table instances at any point in the path. First, we will build our original path.
[16]:
path = dataset.path.link(experiment).link(replicate).filter(replicate.bioreplicate_number == 1)
print(path.uri)
https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1
Now let’s fetch an entity set with attributes pulled from each of the table instances in the path.
[17]:
results = path.attributes(path.dataset.accession,
path.experiment.experiment_type.alias('type_of_experiment'),
path.replicate.technical_replicate_number.alias('technical_replicate_num'))
print(results.uri)
https://www.facebase.org/ermrest/catalog/1/attribute/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1/dataset:accession,type_of_experiment:=experiment:experiment_type,technical_replicate_num:=replicate:technical_replicate_number
Notice that the ResultSet
also has a uri
property. This URI may differ from the origin path URI because the attribute projection does not get appended to the path URI.
[18]:
path.uri != results.uri
[18]:
True
As usual, fetch(...)
the entities from the catalog.
[19]:
results.fetch(limit=5)
for result in results:
print(result)
{'accession': 'FB00000975', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}
{'accession': 'FB00000976', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}
{'accession': 'FB00000977', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}
{'accession': 'FB00000978', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}
{'accession': 'FB00000985', 'type_of_experiment': 'OBI:0001271', 'technical_replicate_num': 1}
[ ]: