DataPath Example 1

This notebook gives an example of how to access the model elements (e.g., schemas, tables, columns) that are used when building data paths.

[1]:

# Import deriva modules
from deriva.core import ErmrestCatalog, get_credential

[2]:

# Connect with the deriva catalog
protocol = 'https'
hostname = 'www.facebase.org'
catalog_number = 1
credential = get_credential(hostname)
catalog = ErmrestCatalog(protocol, hostname, catalog_number, credential)

The PathBuilder Interface

The path builder interface gives you access to a representation of the catalog’s data model beginning with the catalog’s schemas. The path builder does not record any state about your paths, it is just an entry point to begin building paths.

[3]:

# Get the path builder interface for this catalog
pb = catalog.getPathBuilder()

Access to Schemas

The .schemas property acts like a python dictionary or Map object. Use its keys() method to get a listing of the schema names.

[4]:

pb.schemas.keys()

[4]:

dict_keys(['public', '_acl_admin', 'isa', 'viz', 'vocab', 'Imaging'])

Here we will get a handle to the isa schmea.

[5]:

isa = pb.schemas['isa']

PROTIP: Jupyter Notebook supports <tab> completion. Press the <tab> key after typing the brackets of a dictionary to see the keys. Typing pb.schemas[<tab>] will give you a dropdown of schema names. Note that the notebook interpretter will not know anything about your objects until you have executed a step with them. So instantiate an object in one step, and then you can use tab-completion in the following steps.

Alternative access method for schemas and other model objects

An alternative way to get a handle to the same schema object is directly as a property of the path builder object itself. However, this only works for schema names that are valid python identifiers.

A valid python identifier may start with ‘_’ or a letter as its first character and have ‘_’, letters, or numbers for the rest of its characters.

Valid Python identifiers: 'dataset', 'assay', 'Molecule_Type', etc.
Invalid Python identifiers: 'Sample 1 Type', 'Control?', '# of reads', etc.

IMPORTANT Similar access methods will be demonstrated for tables and columns below. Since not all catalog model names are valid python identifiers when you use this method, you may not see your catalog’s complete data model. However, the notation is more compact and ideal for cases where your model uses (all or mostly) valid python identifiers in its model element names.

[6]:

isa = pb.isa  # same schema object we got from the previous step

Access to Tables

Similarly, a schema object has a tables property that gives you access to a representation of the catalog schema’s tables. Again, use its keys() method to list the table names in the schema.

[7]:

isa.tables.keys()

[7]:

dict_keys(['enhancer', 'project_member', 'dataset_sex', 'dataset_data_type', 'icon', 'biosample_cell_characterization', 'file', 'project_publication', 'track_data', 'dataset_syndrome', 'imaging_data', 'clinical_assay', 'dataset_phenotype', 'dataset_enhancer', 'array_data', 'dataset_contributor', 'sample_replicate_group', 'dataset_human_age', 'sample', 'dataset_somite_count', 'dataset_mouse_genetic_background', 'dataset_chromosome', 'data_access_request', 'replicate', 'library', 'external_reference', 'dataset_geo', 'dataset_anatomy', 'protocol', 'clinical_assay_syndrome', 'processed_data', 'mesh_data', 'dataset_cell_source', 'dataset_gene', 'alignment', 'human_subjects_classification', 'biosample', 'related_dataset', 'experiment', 'dataset_cell_characterization', 'public_key', 'experiment_protocol', 'dataset_dar', 'person', 'publication', 'dar_status', 'project_investigator', 'biosample_cell_source', 'pipeline', 'dataset_experiment_type', 'sequencing_data', 'tracks', 'project', 'dataset_strain', 'dataset_mutation', 'dataset_data_use_limitation', 'dataset_stage', 'thumbnail', 'track_data_visibility', 'dataset_qc_issue', 'dataset_genotype', 'previews', 'dataset', 'dataset_species', 'dataset_instrument'])

Similarly we can get a table from the schema’s tables property in both of the demonstrated methods.

[8]:

dataset = isa.tables['dataset']
# or
dataset = isa.dataset

Access to Columns

A table has a columns dictionary. We can get the column names as usual.

[9]:

dataset.column_definitions.keys()

[9]:

dict_keys(['id', 'accession', 'title', 'project', 'funding', 'summary', 'description', 'mouse_genetic', 'human_anatomic', 'study_design', 'release_date', 'show_in_jbrowse', '_keywords', 'RID', 'RCB', 'RMB', 'RCT', 'RMT', 'released', 'Requires_DOI?', 'DOI', 'protected_human_subjects', 'cellbrowser_uri'])

Again, we have the following methods to get handles to the table’s column objects.

[10]:

accession = dataset.column_definitions['accession']
# or
accession = dataset.accession

Final Thought

The model introspection provided in the datapath module (i.e., the PathBuilder) is intended for the narrowly scoped usage required for building paths and accessing data from ERMrest catalogs. It is not intended for general introspection of catalogs and therefore does not include details such as constraints, annotations, ACLs, column data types, etc.

[ ]: