deriva.seo package¶
Submodules¶
deriva.seo.sitemap_builder module¶
-
class
deriva.seo.sitemap_builder.
SitemapBuilder
(protocol, host, catalog_id, license_url=None)[source]¶ Bases:
object
Class to build sitemaps from deriva catalogs.
Typical usage is:
- Create a SiteMapBuilder class
- For each table to include in the sitemap, create a table spec and populate it with the table’s data. Typically, the creating and populating for a table is done by a single call to add_table_spec.
- Call write_sitemap to write out the site map
In the simplest case, this would create a sitemap with two tables:
sb = SitemapBuilder("https", "myhost.org", 1) sb.add_table_spec("schema1", "table1") sb.add_table_spec("schema2", "table2") sb.write_sitemap(sys.stdout)
If you want to include only a subset of rows of a table, you can pass a datapath to add_table_spec:
pb = catalog.getPathBuilder() path=pb.schema3.table3.filter(pb.schema3.table3.Species=="Homo sapiens") sb.add_table_spec("schema3", "table3", datapath=path)
If you want to do something more customizable, you can populate the spec yourself:
rows = do_something_complicated() spec = sb.add_table_spec("schema4", "table4", populate=False) sb.set_table_spec_data(rows) sb.add_fkey_times(spec)
Note: if add_table_spec populates the spec for you, it will set the modification time based on the times in the table and on all single-valued incoming foreign keys (because those are the most likely to affect a Chaise record page). If you’re populating the spec yourself (i.e., if you called add_table_spec with populate=False), you can call add_fkey_times to make those time adjustments.
- Limitations:
Sitemaps should be no more than 50MB in size and should contain no more than 50,000 URLS (https://www.sitemaps.org/faq.html#faq_sitemap_size), but this class doesn’t enforce those limits. (Note: A site can have multiple sitemaps).
A URL element should have no more than 1000 images associated with it (https://support.google.com/webmasters/answer/178636?hl=en), but this class doesn’t enforce that limit.
This class assumes that all images in a catalog will have the same license.
-
add_fkey_times
(spec)[source]¶ Replace the RMT in each row with the greatest value of the row’s RMT and the RMTs of all single-column-fkey-linked tables.
Parameters: spec – a populated spec
-
add_table_spec
(schema, table, datapath=None, populate=True, priority=None)[source]¶ Create a table spec and add it to the sitemap.
Parameters: - schema – the name of the table’s schema
- table – the table name
- datapath – a datapath to use to populate the table (regardless of the value of “populate”)
- populate – an indication of whether or not the spec should be populated. If populate==True, the table spec’s data will be populated with the results from the query specified by “datapath” (if non-None) or all the rows of the table. If populate==False (and datapath is not set), the table spec’s data will need to be set via a call to set_table_spec_data()
- priority – the priority to assign to sitemap entries created from this table (see https://www.sitemaps.org/protocol.html for a discussion of priorities)
-
static
ermrest_time_to_float
(ermrest_time)[source]¶ Converts a time string as returned by ermrest to a floating-point number
-
populate_spec
(spec, datapath)[source]¶ Add data to a spec
Parameters: - spec – the spec to populate
- datapath – optional datapth to use (if None, all records from the spec table will be incuded)
-
set_table_spec_data
(spec, rows)[source]¶ - Add data to a table spec. This should be used if you want to associate
- image data with the rows of this table, or if you want to include only a subset of the table’s rows in the sitemap. Otherwise, you can just use populate=True in your call to add_table_spec().
Parameters: - spec – a table spec returned by add_table_spec()
- rows –
an array of dictionary objects corresponding to the rows of the table. Each row MUST have RID and RMT elements and MAY have an “images” element. If “images” is present, each entry should have:
”image_url” : the absolute URL of the image (jpeg, etc.) “image_caption” : a caption for the image (optional) “image_title” : a title for the image (optional)for example:
[ { "RID": '1-2345', "RMT": '018-12-03T18:29:38.348231-08:00'}, { "RID": '1-2346', "RMT": '018-12-03T18:29:38.348231-08:00', "images": [ { "image_url": "https://myhost.org/hatrac/images/image1.jpg", "image_title": "MyProject Image 2-3456: An Awesome Image", "image_caption": "This is an awesome image" }, { "image_url": "https://myhost.org/hatrac/images/image2.jpg", } ] } ]