{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Datapath Example 3\n", "\n", "This notebook gives an example of how to build relatively simple data paths.\n", "It assumes that you understand the concepts presented in the example 2\n", "notebook.\n", "\n", "## Exampe Data Model\n", "The examples require that you understand a little bit about the example\n", "catalog data model, which is based on the FaceBase project.\n", "\n", "### Key tables\n", "- `'dataset'` : represents a unit of data usually a 'study' or 'collection'\n", "- `'experiment'` : a bioassay (typically RNA-seq or ChIP-seq assays)\n", "- `'replicate'` : a record of a replicate (bio or technical) related to an experiment\n", "\n", "### Relationships\n", "- `dataset <- experiment`: A dataset may have one to many experiments. I.e., there \n", " is a foreign key reference from experiment to dataset.\n", "- `experiment <- replicate`: An experiment may have one to many replicates. I.e., there is a\n", " foreign key reference from replicate to experiment." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import deriva modules\n", "from deriva.core import ErmrestCatalog, get_credential" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Connect with the deriva catalog\n", "protocol = 'https'\n", "hostname = 'www.facebase.org'\n", "catalog_number = 1\n", "# If you need to authenticate, use Deriva Auth agent and get the credential\n", "credential = get_credential(hostname)\n", "catalog = ErmrestCatalog(protocol, hostname, catalog_number, credential)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Get the path builder interface for this catalog\n", "pb = catalog.getPathBuilder()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building a DataPath\n", "Build a data path by linking together tables that are related. To make things a little easier we will use python variables to reference the tables. This is not necessary, but simplifies the examples." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "dataset = pb.isa.dataset\n", "experiment = pb.isa.experiment\n", "replicate = pb.isa.replicate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initiate a path from a table object\n", "Like the example 2 notebook, begin by initiating a `path` instance from a `Table` object. This path will be \"rooted\" at the table it was initiated from, in this case, the `dataset` table. `DataPath`'s have URIs that identify the resource in the catalog." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset\n" ] } ], "source": [ "path = dataset.path\n", "print(path.uri)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Link other related tables to the path\n", "In the catalog's model, tables are _related_ by foreign key references. Related tables may be linked together in a `DataPath`. Here we link the following tables based on their foreign key references (i.e., `dataset <- experiment <- replicate`)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate\n" ] } ], "source": [ "path.link(experiment).link(replicate)\n", "print(path.uri)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Path context\n", "By default, `DataPath` objects return entities for the _last_ linked entity set in the path. The `path` from the prior step ended in `replicate` which is therefore the `context` for this path." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "_TableWrapper name: 'replicate'\n", "List of columns:\n", " RID\n", " dataset\n", " biosample\n", " bioreplicate_number\n", " technical_replicate_number\n", " RCB\n", " RMB\n", " RCT\n", " RMT\n", " experiment" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path.context" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get entities for the current context\n", "The following DataPath will fetch `replicate` entities not `dataset`s." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15274" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "entities = path.entities()\n", "len(entities)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get entities for a different path context\n", "Let's say we wanted to fetch the entities for the `dataset` table rather than the current context which is the `replicate` table. We can do that by referencing the table as a property of the path object. **Note** that these are known as \"table instances\" rather than tables when used within a path expression. We will discuss table instances later in this notebook." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "_TableWrapper name: 'dataset'\n", "List of columns:\n", " id\n", " accession\n", " title\n", " project\n", " funding\n", " summary\n", " description\n", " mouse_genetic\n", " human_anatomic\n", " study_design\n", " release_date\n", " show_in_jbrowse\n", " _keywords\n", " RID\n", " RCB\n", " RMB\n", " RCT\n", " RMT\n", " released\n", " Requires_DOI?\n", " DOI\n", " protected_human_subjects\n", " cellbrowser_uri" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path.table_instances['dataset']\n", "# or\n", "path.dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From that table instance we can fetch entities, add a filter specific to that table instance, or even link another table. Here we will get the `dataset` entities from the path." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "351" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "entities = path.dataset.entities()\n", "len(entities)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we fetched fewer entities this time which is the number of `dataset` entities rather than the `replicate` entities that we previously fetched." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering a DataPath\n", "\n", "Building off of the path, a filter can be added. Like fetching entities, linking and filtering are performed _relative to the current context_. In this filter, the assay's attriburtes are referenced in the expression.\n", "\n", "Currently, _binary comparisons_ and _logical operators_ are supported. _Unary opertors_ have not yet been implemented. In binary comparisons, the left operand must be an attribute (column name) while the right operand must be a literal\n", "value." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1\n" ] } ], "source": [ "path.filter(replicate.bioreplicate_number == 1)\n", "print(path.uri)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3766" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "entities = path.entities()\n", "len(entities)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table Instances\n", "So far we have discussed _base_ tables. A _base_ table is a representation of the table as it is stored in the ERMrest catalog. A table _instance_ is a usage or reference of a table _within the context_ of a data path. As demonstrated above, we may link together multiple tables and thus create multiple table instances within a data path.\n", "\n", "For example, in `path.link(dataset).link(experiment).link(replicate)` the table instance `experiment` is no longer the same as the original base table `experiment` because _within the context_ of this data path the `experiment` entities must satisfy the constraints of the data path. The `experiment` entities must reference a `dataset` entity, and they must be referenced by a `replicate` entity. Thus within this path, the entity set for `experiment` may be quite different than the entity set for the base table on its own." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table instances are bound to the path\n", "Whenever you initiate a data path (e.g., `table.path`) or link a table to a path (e.g., `path.link(table)`) a table instance is created and bound to the DataPath object (e.g., `path`). These table instances can be referenced via the `DataPath`'s `table_instances` container or directly as a property of the `DataPath` object itself." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "dataset_instance = path.table_instances['dataset']\n", "# or\n", "dataset_instance = path.dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aliases for table instances\n", "Whenever a table instance is created and bound to a path, it is given a name. If no name is specified for it, it will be named after the name of its base table. For example, a table named \"My Table\" will result in a table instance also named \"My Table\". Tables may appear _more than once_ in a path (as table instances), and if the table name is taken, the instance will be given the \"'base name' + `number`\" (e.g., \"My Table2\").\n", "\n", "You may wish to specify the name of your table instance. In conventional database terms, an alternate name is called an \"alias\". Here we give the `dataset` table instance an alias of 'D' though longer strings are also valid as long as they do not contain special characters in them." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path.link(dataset.alias('D'))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1/D:=isa:dataset'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path.uri" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll notice that in this path we added an additional _instance_ of the `dataset` table from our catalog model. In addition, we linked it to the `isa.replicate` table. This was possible because in this model, there is a foriegn key reference from the base table `replicate` to the base table `dataset`. The entities for the table instance named `dataset` and the instance name `D` will likely consist of different entities because the constraints for each are different." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Selecting Attributes From Linked Entities\n", "\n", "Returning to the initial example, if we want to include additional attributes\n", "from other table instances in the path, we need to be able to reference the\n", "table instances at any point in the path. First, we will build our original path." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1\n" ] } ], "source": [ "path = dataset.path.link(experiment).link(replicate).filter(replicate.bioreplicate_number == 1)\n", "print(path.uri) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's fetch an entity set with attributes pulled from each of the table instances in the path." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.facebase.org/ermrest/catalog/1/attribute/dataset:=isa:dataset/experiment:=isa:experiment/replicate:=isa:replicate/bioreplicate_number=1/dataset:accession,type_of_experiment:=experiment:experiment_type,technical_replicate_num:=replicate:technical_replicate_number\n" ] } ], "source": [ "results = path.attributes(path.dataset.accession, \n", " path.experiment.experiment_type.alias('type_of_experiment'), \n", " path.replicate.technical_replicate_number.alias('technical_replicate_num'))\n", "print(results.uri)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Notice** that the `ResultSet` also has a `uri` property. This URI may differ from the origin path URI because the attribute projection does not get appended to the path URI." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path.uri != results.uri" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As usual, `fetch(...)` the entities from the catalog." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'accession': 'FB00000975', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}\n", "{'accession': 'FB00000976', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}\n", "{'accession': 'FB00000977', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}\n", "{'accession': 'FB00000978', 'type_of_experiment': 'OBI:0002083', 'technical_replicate_num': 1}\n", "{'accession': 'FB00000985', 'type_of_experiment': 'OBI:0001271', 'technical_replicate_num': 1}\n" ] } ], "source": [ "results.fetch(limit=5)\n", "for result in results:\n", " print(result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 2 }