by Barrett Smith
(Part 2 of the "Open Gov" blog series)
Federal agencies are currently wrestling with determining what they must do to comply with the White House’s Executive Order on open data. Until something more definitive is published, the best guidance might come from the Implementation Guide page on Project Open Data.
The guide lists eight steps with minimum compliance requirements for each. Of those, four are listed as being due November 5, 2013. In the previous post in this series, we discussed compliance at a high level. in this post, I’ll walk through the first two deadline requirements and will follow up with a second blog post outlining the additional requirements.
The four steps facing the November deadline are:
Create and maintain an enterprise data inventory
Create and maintain a public data catalog
Engage with customers to help facilitate and prioritize data release
Clarify roles and responsibilities
One thing to note about these first two deadline-driven requirements is that the public data catalog is a subset of items in the enterprise data inventory. Agencies are to maintain an inventory of all the datasets they possess, however only those which are determined to have a Public Access Level of Public will be listed in the published public data catalog.
A second interesting aspect is that the inventory should not just list datasets, but also APIs made available by the agency. So whether the data is made available as a downloadable data file or as a set of web-services which permit access to the data, they are to be treated the same.
Finally, datasets in both lists must be described using the common core metadata (CCM) schema.
The Common Core Metadata Schema
The CCM schema is an RDF metadata vocabulary based on the W3C’s Data Catalog (DCAT) Vocabulary, the goal of which is to allow comparison and cross-set searching of different data sets by establishing a common vocabulary. In human-speak, it’s a way of describing the items in a data catalog in a standardized format which can be read by either people or machines.
The schema requires each data set be described using 9 distinct fields and makes 7 more fields “required-if-applicable.” In addition, the schema allows agencies to augment the description with any of a set of fields listed as “expanded fields” or fields defined in “any well-known vocabulary.”
The CCM schema requires 9 specific fields be used to describe each dataset listed in either the enterprise catalog or the public catalog. Most of these fields are as simple as they appear to be (and details of each can be found on the Project Open Data schema page). There are a couple details worth noting, though.
First, the cardinality of the keyword field is listed as 1,n. However, the content of the field is a comma-separated list of keywords, which doesn’t make sense with a field that can occur multiple times for the dataset. In the example data files supplied, none of the entries has more than one keyword field. So my assumption is that this cardinality should really be 1,1.
Secondly, the descriptions for the two contact fields do not explicitly state whether these must contain information for a particular person or could instead reference a position within the agency and a group email address. However, the JSON example data file appears to confirm that a position and group mail address is acceptable.
Finally, where all the other fields were essentially free-text, the value of the Public Access Level must be one of “Public”, “Restricted”, or “Private.” The distinctions between the Public and non-public access levels are relatively clear. However, no guidance is given as to what information is Restricted versus Private. Hopefully future writings will clarify those distinctions.
The Enterprise Data Inventory
The minimum compliance requirements for the Enterprise Data Inventory are simply that a single listing be produced which contains an entry for each dataset, described using the common core metadata schema. Best-practice guidance for the Inventory includes use of Data Management System, an iterative approach to finding and listing all the datasets in use, and the expansion of the CCM descriptive fields in use to more thoroughly describe the datasets.
The Public Data Catalog
In addition to the aforementioned requirements that the public data catalog should include all Public datasets and describe each using the CCM schema, to be in compliance the catalog must also include an entry for the catalog itself and be published to www.[agency].gov/data.json.
One goal of the Public Data Catalog is to provide a listing which the Data.gov service can crawl to populate their service, eliminating the need for agencies to separately populate data.gov. It is not clear, though, on what timeline that will be implemented, so this may continue to be a separate task for some time.
Also not listed in the compliance requirements for the Public Data Catalog is the human-readable page which is discussed elsewhere in relation to the catalog. This may become a requirement later in the implementation cycle or may just be an omission. However, creating a human-readable page with RDF markup should not impose a significant burden once the JSON listing is created.
In the next post in the series, we’ll talk about the other two implementation steps due in November: engaging with customers and clarifying roles and responsibilities.
(The topics covered in this blog are getting a lot of discussion at Drupal Gov Days. For more information go to the event's website.)