SemTechBiz: Schema.org, Knowledge Graph, and prospects for Drupal
by Stéphane Corlosquet
Last week’s SemTechBiz San Francisco was packed with insightful keynotes and sessions showing trends for semantic web technologies. The hot topics of the conference were Google’s, Yahoo’s and Wikidata’s Knowledge Graphs, as well as the adoption of schema.org, an initiative launched 2 years ago by the major search engines to standardize and promote structured data on the Web. An increasing number of companies and organizations presented how they support schema.org and semantic web technologies in their publishing workflow and products. Among the organizations presenting were Autodesk, BBC, Raytheon, Wikimedia, Wells Fargo, Profium, Library of Congress, Viacom and Walmart.
The conference started off with a timely announcement from the BBC about the launch of rNews metadata in their webpages. Developed by a consortium of the world's major news agencies, news publishers and news industry vendors (ITPC), rNews is also part of schema.org. Later in the week, David Rogers gave a detailed presentation on the BBC's Linked Data Platform which makes their content more connected and more discoverable. This platform relies on a custom in-house schema developed for the needs of the BBC, but when it comes to publishing content in HTML+RDFa, mappings to schema.org are added where appropriate, making BBC’s content discovrable by external applications like search engines. As Jim Hendler says: "A little semantics goes a long way".
Richard Wallis from Online Computer Library Center, Inc. made the case for using schema.org for describing library datasets, and told us about worldcat.org, the world’s largest library catalog that includes 290+ Million resources. While schema.org alone isn’t yet sufficient for the purpose of library data exchange, the schema bib extend group is working on adding library support for schema.org.
Eric Freese from codeMantra presented cloudshelf, a proof of concept ebook reader powered by modern browser technologies (HTML5, CSS3, SVG, MathML, etc.). His goal is to enhance the ereading experience by making the knowledge of the book available to the reader. Schema.org semantics present in the EPUB book content are leveraged to offer richer search features, or the ability to add custom annotations on top of the book’s knowledge base.
I was delighted to meet Aaron Bradley who led the SEO & Semantic Marketing bird of a feather, where we all exchanged tips and stories about how schema.org is affecting and changing the SEO and marketing landscape. Rich Snippets are still the top #1 tangible application that marketers can easily grasp, but other products such Google Plus and Google Now are next on the list. Aaron wrote a great blog post on his key search marketing takeaways from SemTechBiz.
On the last day, the Schema.org panel led by Dan Brickley shed some lights to some of the larger challenges around schema design. The focus was on how to grow the scope of schema.org to cover use cases like Library data or fictional characters and fictional worlds such as those from Wikia. Schema.org also announced that it was adding JSON-LD to the list of recommended schema.org syntaxes, alongside microdata and RDFa. Schema.org sponsors Google, Yahoo!, Bing and Yandex all had representatives attending the conference, and many of them presented how they use schema.org data in their products.
Knowledge Graphs everywhere
Jason Douglas’ keynote about Google Knowledge Graph attracted a lot of people all eager to learn how the Search Giant manages this massive database of entities and facts that drives more and more of its products. In order to provide better search experience and become more useful to people’s lives, Google needs to have a better understanding of the real world, what things are, how they interact with and relate to each other. “things not strings” is the way Google sees and harvests information from the Web. To help achieve its goals and boost the amount of structured data available online, Google and other search engines are putting a lot of efforts into schema.org. Besides the data that people make available using schema.org, another tool Google uses to build its Knowledge graph is Freebase, a large collaborative knowledge base that Google acquired in 2010. Freebase currently includes 40 Million entities and is making its data available as RDF for anyone to use. Overall, Google claims to know about 570 Million entities in its knowledge graph, which drives many of its products such as Rich Snippets, Google Plus, Google Now, and more.
Wikimedia’s Denny Vrandečić gave an update on the recent progress of Wikidata, a free knowledge base built by the community to organize and managed the structured data that drives all the language versions of Wikipedia. Wikidata already includes 12 Million items and many Wikipedia pages have already been switched to using Wikidata behind the scenes. Before Wikidata, each language version for a given entity (city, country, person) had to be maintained separately for each language, but with wikidata, the interwiki links as well as the data contained in the info boxes will be centralized and managed in one location. Denny also showed some interesting applications that make use of Wikidata. Like Freebase, Wikidata also offers data dumps in JSON, XML and Linked Open data RDF with stable URIs. The project is funded by Google among other organizations. During his presentation, Denny announced that Yandex, the main Russian search engine, became the latest organization to fund Wikidata. Needless to say that search engines see Wikidata as a promising source to build their knowledge graph. Yahoo! also had a presentation about how its Knowledge Base operates, you can read more about all these presentations in this blog post from semanticweb.com: At SemTechBiz, Knowledge Graphs Are Everywhere.
Beyond schema.org - What's next?
While schema.org alone offers a solid base for publishing structured data understood by search engines, several groups are working in parallel to develop more specialized schemas. New to me was this effort from Wells Fargo and the Enterprise Data Management Council to develop the Financial Industry Business Ontology (FIBO), which should help achieve better transparency between businesses, governments and consumers. Taxpayers will also be able to follow how and where their money gets spent.
Hans Constandt told the stories of some family members and friends who are affected by diseases that pharmas are not investing in because they are too rare. Ontoforce is developing tools to harvest and make sense of massive amounts of health data in the pursuit of information about these rare diseases: A Higher Calling of Semantic Technology: Linking Data to Save Lives.
What does this all mean for Drupal?
By powering more than 2% of the websites worldwide ranging from personal blogs to corporate and government sites, Drupal can play big role in publishing structured data. In fact Drupal has been pioneering the semantic web for many years. In January 2011, Drupal 7 shipped with its first version of RDF support in core. Unfortunately when we designed this module back in 2009, a lot of the standards and tools that are available today didn’t exist. Schema.org was launched 6 months after Drupal 7 was released. A new version of RDFa was also published as a W3C Recommendation last year: RDFa 1.1. The current module in core is outdated and contains bugs that need fixing. Contributed modules are available for Drupal 7 to publish schema.org + RDFa 1.1 data, and people are already using them. Check out for example the Yahoo! Glimmer search tool, and select JobPosting in the right sidebar. At the time of this writing, the top result is a Drupal 7 site which uses http://schema.org/JobPosting and appears above monster.co.uk!
To make sure we make it for the Drupal 8’s July 1st code freeze deadline, the unofficial RDF in Drupal initiative is funding Lin Clark and a few others to work on the refactoring of the RDF module for Drupal 8. There will be low hanging fruit patches to review and test once the main refactoring patch is committed, so watch out for twitter streams (Lin, myself), and subscribe to the Drupal Semantic Web group to stay tuned.
In conclusion, attending SemTechBiz was very useful to grasp the trends of the semantic web industry. My main takeaway is that companies producing data (BBC, Wikipedia/Wikidata, Library of Congress) and those consuming data (major search engines) are finally getting on the same page when it comes to agreeing on a universal standard for describing data: schema.org. I’m also glad that the schema.org group is moving away from dictacting one single syntax and opening up to the variety of syntaxes available: microdata, RDFa and JSON-LD. Wikidata, Freebase and Google already support all those. Microdata and RDFa are also supported by other search engines, and I’m sure JSON-LD will too. Looking forward to the next SemTechBiz.