Datapath Example 3

This notebook gives an example of how to build relatively simple data paths. It assumes that you understand the concepts presented in the example 2 notebook.

Exampe Data Model

The examples require that you understand a little bit about the example catalog data model, which is based on the FaceBase project.

Key tables

  • 'dataset' : represents a unit of data usually a ‘study’ or ‘collection’
  • 'experiment' : a bioassay (typically RNA-seq or ChIP-seq assays)
  • 'replicate' : a record of a replicate (bio or technical) related to an experiment

Relationships

  • dataset <- experiment: A dataset may have one to many experiments. I.e., there is a foreign key reference from experiment to dataset.
  • experiment <- replicate: An experiment may have one to many replicates. I.e., there is a foreign key reference from replicate to experiment.
[1]:
# Import deriva modules
from deriva.core import ErmrestCatalog, get_credential
[2]:
# Connect with the deriva catalog
protocol = 'https'
hostname = 'www.facebase.org'
catalog_number = 1
# If you need to authenticate, use Deriva Auth agent and get the credential
credential = get_credential(hostname)
catalog = ErmrestCatalog(protocol, hostname, catalog_number, credential)
[3]:
# Get the path builder interface for this catalog
pb = catalog.getPathBuilder()

Building a DataPath

Build a data path by linking together tables that are related. To make things a little easier we will use python variables to reference the tables. This is not necessary, but simplifies the examples.

[4]:
dataset = pb.isa.dataset
experiment = pb.isa.experiment
replicate = pb.isa.replicate

Initiate a path from a table object

Like the example 2 notebook, begin by initiating a path instance from a Table object. This path will be “rooted” at the table it was initiated from, in this case, the dataset table. DataPath’s have URIs that identify the resource in the catalog.

[5]:
path = dataset.path
print(path.uri)
https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset

Path context

By default, DataPath objects return entities for the last linked entity set in the path. The path from the prior step ended in replicate which is therefore the context for this path.

[7]:
path.context
[7]:
_TableWrapper name: 'replicate' List of columns: RID dataset biosample bioreplicate_number technical_replicate_number RCB RMB RCT RMT experiment

Get entities for the current context

The following DataPath will fetch replicate entities not datasets.

[8]:
entities = path.entities()
len(entities)
[8]:
15274

Get entities for a different path context

Let’s say we wanted to fetch the entities for the dataset table rather than the current context which is the replicate table. We can do that by referencing the table as a property of the path object. Note that these are known as “table instances” rather than tables when used within a path expression. We will discuss table instances later in this notebook.

[9]:
path.table_instances['dataset']
# or
path.dataset
[9]:
_TableWrapper name: 'dataset' List of columns: id accession title project funding summary description mouse_genetic human_anatomic study_design release_date show_in_jbrowse _keywords RID RCB RMB RCT RMT released Requires_DOI? DOI protected_human_subjects cellbrowser_uri

From that table instance we can fetch entities, add a filter specific to that table instance, or even link another table. Here we will get the dataset entities from the path.

[10]:
entities = path.dataset.entities()
len(entities)
[10]:
351

Notice that we fetched fewer entities this time which is the number of dataset entities rather than the replicate entities that we previously fetched.

Filtering a DataPath

Building off of the path, a filter can be added. Like fetching entities, linking and filtering are performed relative to the current context. In this filter, the assay’s attriburtes are referenced in the expression.

Currently, binary comparisons and logical operators are supported. Unary opertors have not yet been implemented. In binary comparisons, the left operand must be an attribute (column name) while the right operand must be a literal value.

[11]:
path.filter(replicate.bioreplicate_number == 1)
print(path.uri)
https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1
[12]:
entities = path.entities()
len(entities)
[12]:
3766

Table Instances

So far we have discussed base tables. A base table is a representation of the table as it is stored in the ERMrest catalog. A table instance is a usage or reference of a table within the context of a data path. As demonstrated above, we may link together multiple tables and thus create multiple table instances within a data path.

For example, in path.link(dataset).link(experiment).link(replicate) the table instance experiment is no longer the same as the original base table experiment because within the context of this data path the experiment entities must satisfy the constraints of the data path. The experiment entities must reference a dataset entity, and they must be referenced by a replicate entity. Thus within this path, the entity set for experiment may be quite different than the entity set for the base table on its own.

Table instances are bound to the path

Whenever you initiate a data path (e.g., table.path) or link a table to a path (e.g., path.link(table)) a table instance is created and bound to the DataPath object (e.g., path). These table instances can be referenced via the DataPath’s table_instances container or directly as a property of the DataPath object itself.

[13]:
dataset_instance = path.table_instances['dataset']
# or
dataset_instance = path.dataset

Aliases for table instances

Whenever a table instance is created and bound to a path, it is given a name. If no name is specified for it, it will be named after the name of its base table. For example, a table named “My Table” will result in a table instance also named “My Table”. Tables may appear more than once in a path (as table instances), and if the table name is taken, the instance will be given the “‘base name’ + number” (e.g., “My Table2”).

You may wish to specify the name of your table instance. In conventional database terms, an alternate name is called an “alias”. Here we give the dataset table instance an alias of ‘D’ though longer strings are also valid as long as they do not contain special characters in them.

[14]:
path.link(dataset.alias('D'))
[14]:
<deriva.core.datapath.DataPath at 0x103c22400>
[15]:
path.uri
[15]:
'https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1/D:=isa:dataset'

You’ll notice that in this path we added an additional instance of the dataset table from our catalog model. In addition, we linked it to the isa.replicate table. This was possible because in this model, there is a foriegn key reference from the base table replicate to the base table dataset. The entities for the table instance named dataset and the instance name D will likely consist of different entities because the constraints for each are different.

Selecting Attributes From Linked Entities

Returning to the initial example, if we want to include additional attributes from other table instances in the path, we need to be able to reference the table instances at any point in the path. First, we will build our original path.

[16]:
path = dataset.path.link(experiment).link(replicate).filter(replicate.bioreplicate_number == 1)
print(path.uri)
https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1

Now let’s fetch an entity set with attributes pulled from each of the table instances in the path.

[17]:
results = path.attributes(path.dataset.accession,
                          path.experiment.experiment_type.alias('type_of_experiment'),
                          path.replicate.technical_replicate_number.alias('technical_replicate_num'))
print(results.uri)
https://www.facebase.org/ermrest/catalog/1/attribute/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1/dataset:accession,type_of_experiment:=experiment:experiment_type,technical_replicate_num:=replicate:technical_replicate_number

Notice that the ResultSet also has a uri property. This URI may differ from the origin path URI because the attribute projection does not get appended to the path URI.

[18]:
path.uri != results.uri
[18]:
True

As usual, fetch(...) the entities from the catalog.

[19]:
results.fetch(limit=5)
for result in results:
    print(result)
{'accession': 'FB00000975', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}
{'accession': 'FB00000976', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}
{'accession': 'FB00000977', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}
{'accession': 'FB00000978', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}
{'accession': 'FB00000985', 'type_of_experiment': 'OBI:0001271', 'technical_replicate_num': 1}
[ ]: