Freebase and the Collaborative Semantic Web

Last modified by Hal Eden on 2010/08/20 11:06

Comments (0) · Attachments (2) · History · Information

Title: Freebase and the Collaborative Semantic Web

Authors: Dan Delany, Diana Tamabayeva

Abstract: Today's World Wide Web treats all linked information as "documents" with no particular semantic meaning. We believe that the advent of the "collaborative semantic web," that is, an Internet which treats linked objects as pieces of information with semantic meaning and structure, and which allows its users to edit and create these objects, is a plausible and interesting future scenario. We believe the web is becoming unmanageably and unsearchably large, and that the concept of not only storing data but structured semantic relational metadata about that data and its relationships has the potential to revolutionize information creation, storage, and access. This project presents a survey of current efforts to create such a semantic web, with an in-depth case study of Freebase, a project currently in the works to create an open, collaborative, structured database of factual information.

Keywords: Freebase, Semantic Web, collaboration, bottom-up approach, top-down approach

Statement of the Problem:

The World Wide Web as we know it is less than 15 years old, and yet it has already dramatically changed the ways in which we find, create, and share information with each other. While the effects of the Web have been far-reaching and profound, it is safe to say that the Internet is still a technological paradigm in its infancy. Google claims to index more than a trillion unique URLs, a number which grows by several billion every day, but nearly all of this data is in the form of unstructured HTML documents which are written to visually display data directly to the user. As this enormous worldwide database continues to grow, many scientists and technologists (including Sir Tim Berners-Lee, the man widely credited with "inventing" the World Wide Web) believe that a new system will evolve to include structured meta-data about each piece of data on the web, creating a next-generation "semantic web."

The web in its earliest incarnation was simply a collection of linked "flat" documents. These documents were written entirely in HTML, a markup language which formats data to be displayed to human beings. To the computers involved in serving this data, the information is simply a string of meaningless characters. Since that time, a huge number of server-side technologies, including Perl, PHP, Python frameworks and Ruby on Rails, have revolutionized the ways in which web servers store and process their data. Web programmers can now use object-oriented programming to control data flow and logic, and relational databases to persistently store this data. This innovation has fueled the creation of "Web 2.0," a loose term that describes websites which leverage this power to build full-featured web applications, complex data-driven systems, and social communities. However, while developers use rich object data structures in their site's backend, these technologies all "flatten" this data into an HTML document before presenting it to the user.

The idea behind the semantic web is to alter the infrastructure of the Internet to allow the creation of standardized data structures which store semantic information about the format and contents of each document on the web. This essentially means that each document is represented as a data node which is an instance of an object type, with structured and standardized data and metadata (title, author, description, date, etc.) as its instance variables. It also allows each node to have structured relationships to other data nodes (for example, the "rabbit" node might be linked to the "mammal" node through the "is a species of" relationship). Instead of a series of documents linked together by textual hyperlinks, the web becomes a series of typed data objects linked together and referenced through their relationships to other data.

Rationale:

This paradigm shift would have huge implications for the way information is represented on the web. The ability for documents to reference standardized data via a persistent URL would allow web developers to pull in data directly from its source. For example, if I wanted to post the Boulder SKIP bus schedule on my website, I'd have to go to RTD-Denver.com, copy the HTML of the bus schedule, and paste it on my page. In the semantic web, I'd just include on my page a call to the schedule object living on the RTD Denver servers. Since the schedule would be an object with the bus times as its variables, I could display it however I wanted, showing whichever routes or stops I wanted. Additionally, if I knew the user's current location, I could even compare the current time to all times in the schedule to display the time of the next bus stopping at the nearest stop to their location. In addition to the richness of the data, this method ensures that the data will always be current, as any changes to the data object on RTD's servers will automatically propagate to any sites which reference the object.

The semantic web also has the potential to drastically change how we interact socially on the Internet. Instead of a set of profiles stored on various social networking sites all over the web, the semantic web might store a "you" node, which contains all of your personal data, including name, current location, contact information, usernames and passwords, and so on. This highlights the necessity of an infrastructure-level security and access permissions system to ensure tight control over personal data privacy. However, if this security were well-implemented, one can imagine a system where users logged in once to authenticate them, and were then authenticated on every website they visited. Instead of relying on many different third-party services, your node, which would be of type "person," could contain structured relationships with all of your friends, your calendar and to-do list, events you attended, content you published, and almost any information associated with you. The effects this system would have on the current system of social networking sites are hard to predict, but they would certainly be widespread and game-changing.

While the semantic web in its true form is many years away, efforts are underway to apply these concepts to change the way the web works today. This work can loosely be separated into two ways of thinking: bottom-up and top-down. Those who take a top-down approach apply algorithms to existing data on the web to interpret its meaning and give it semantic structure, primarily through machine learning and natural-language processing. Those who take a bottom-up approach believe the best way of creating the semantic web is the same way the original web was created: from the ground up, by content publishers and those who contribute to public knowledge bases. Tomorrow's web will most likely be a mix of the two; an amalgamation of recycled information from the old Internet given new meaning by semantic algorithms, and new content created according to semantic standards.

In the past few years, several startup companies have had some success leveraging top-down algorithms to provide semantically enhanced search, both for the web at large (eg. Powerset, Hakia) and in more focused domains such as travel and consumer electronics (eg. Retrevo, TripIt, Spock). These services provide structured data about the search topic, allowing users to compare prices of products or aggregate data from several social networks to obtain a full profile of a person. Their results are useful applications of semantic technology, but they all use proprietary software and closed databases, and are therefore not actively contributing to the semantic web. However, these technologies are of some interest, as they display some of the characteristics expected of next-generation "Web 3.0" applications: the pertinent content is accurate because it's pulled directly from the source, and the data becomes readily accessible, and easily cross-referenced.

The top-down effort to create the semantic web is not entirely dominated by proprietary systems: recently, several tools have become available that aid developers in crawling and structuring data on the web, most notably Yahoo! Pipes and Dapper. These applications let users set up rules for parsing displayed HTML into structured database fields. Dapper scrapes a set of given, similar HTML pages and allows users to visually select the elements on the page they are trying to extract. It then crawls a set of HTML pages with the same format and intelligently pulls data from this element on all pages, outputting them to a structured XML feed or RDF document. Yahoo! Pipes takes existing feeds or structured data and allows users to merge, transform and filter them into new, restructured feeds. These two tools alone allow almost any content on the web to be syndicated and republished as raw, structured data, and may be used to feed some of the bottom-up database creation efforts currently in progress.

The bottom-up approach to this problem has made most of its progress on two fronts: the definition of standards which allow content publishers to publish their data with semantic structure and relationships, and the creation of large, structured, Wikipedia-style knowledge bases of facts.

The World Wide Web Consortium, or W3C, is the main international standards organization for the web, and has been working since 2001 to create new standard formats to support the linked data of the coming semantic web. This organization, which created the standards for HTML and CSS, today's web markup languages, has nearly ultimate power over the technologies used by those who publish their content to the web, and will likely be the engine that drives much of the progress towards this new vision of the Internet. The W3C has defined several standards for publishing structured content, most notably RDF and OWL.

RDF, or Resource Description Framework, describes relationships between data nodes using "RDF triples," which consist of two URLs referencing two data nodes, and a third persistent URL that references a standardized relationship type between the two objects. RDF triples can be visualized as two nodes connected by a labeled edge, and RDF files combine many of these triples to form a structured graph of data. OWL, or Web Ontology Language, describes "ontologies," or high-level statements which define rich class characteristics and relationships between classes. An RDF file, therefore, might be used to state that a particular car has a "color" attribute that is "red," while an OWL file would be used to lay out high level characteristics of the car class, including the notions that every car is a vehicle and every car has four wheels. In addition to RDF and OWL, the W3C has also created a standard called SPARQL, which is a SQL-like query language for referencing structured data through its relationships to other nodes, allowing web developers to display or utilize any data in the semantic web.

RDF and OWL are powerful technologies, but they suffer from several shortcomings. First of all, while the OWL language allows developers to write their own semantic ontologies and relate them to other ontologies on the web, the W3C currently has no plans to establish any "canonical" ontologies, schemas for structured data that are universal standards. This may be a good thing for publishers creating unique types of content, but for data types which are commonly used across the web, different users may create different ontologies with different characteristics, preventing interoperability across systems (which is the point of the semantic web in the first place). The W3C hopes that the adoption of OWL will fuel the creation of community-built ontologies which become de facto standards, but the potential for warring, incompatible standards exists. Additionally, RDF and OWL are designed to be powerful languages, but this comes at the cost of simplicity of use - the creation of these documents by hand is a difficult, complex process that is prone to errors. Many proponents of the formats claim that this doesn't matter, as most of these documents will be created automatically. However, few of these tools currently exist, making it very difficult for an average web publisher with little knowledge of the foundations of linked data documents to publish semantically structured content. Finally, as there are currently very few applications or web sites which actually make use of RDF data, publishers have little or no motivation to create this extra layer of data. While the RDF and OWL standards are steps in the right direction, they are still far from a complete solution to the problem of publishing structured data.

While we can rely on publishers to create a significant amount of linked data on the web, another bottom-up approach focuses on creating a centralized database of structured factual information about the world. This would operate like an object-oriented version of Wikipedia, with rich, structured relationships defined between any related nodes. Several efforts to create this type of database are in the works: True Knowledge is a service currently in beta which claims to index more than 100 million facts about 4.4 million individual objects, and provides a natural-language search engine for finding information. However, while True Knowledge allows some limited contribution, it uses a mostly closed knowledge base, and provides very limited API, making it difficult for developers to access meaningful data.

Another, more promising effort to create a factual, semantic database is Freebase, a Wikipedia-like set of knowledge which all users can contribute to. Freebase has a set of standardized, hierarchical ontologies which are well-defined but also user-editable. More importantly, Freebase has a full-featured, open API for both reading from and writing to the database, allowing developers to automatically add their structured data. Freebase was the main focus of our research, and is discussed in further depth later.

Relationship to the Course:

While it is interesting to review the technical details of each of these systems on their own, the true semantic web envisioned by Internet architects will only be built through a massive collaborative effort, and will likely involve some combination of the above approaches. It is hard to guess at a final picture of a functional semantic web, but it will no doubt contain some "authoritatively" true information in databases like Freebase, and some user-generated content; it is this blend which makes today's Internet so compelling. However, there are several challenges which this paradigm must overcome if it is ever to reach the ubiquity of today's web. First, and perhaps most importantly, the collaborative semantic web effort suffers from the same chicken-and-egg problem that all social networks have to deal with: the paradigm gets its value and power from ubiquity and large-scale adoption, which means that until it is adopted on a large scale, it does not provide much value or power to its users! At present, the percentage of the web with semantic data is very small, and it is not realistic to believe that publishers and collaborators will manually build a network which is not immediately useful, so it is safe to assume that large-scale top-down efforts which automatically perform semantic analysis will be necessary to bootstrap this process towards a critical mass.

Additionally, the future of the web depends largely on the motivation of users and developers to contribute to it. As mentioned, there is currently little reason for a content publisher to post data in a semantically-annotated format. However, as the idea catches on, developers will begin to realize the power of providing high-quality data to the web, and will use it as a tool to proliferate their data across the Internet. However, the semantic web drastically changes the notion of value - previously, publishers created high-quality valuable pages, and the ways in which this will translate into creating high-value raw data which can be referenced by anyone is not immediately apparent.

This brings up another issue - that of copyright and intellectual property laws. Currently, it's illegal to republish copyrighted content in its entirety on a personal website. However, in the semantic web, referencing external content is the entire point. Perhaps an infrastructure-level mechanism can be implemented to impose limitations on references to licensed content, or at least provide attribution to the content creators. Regardless, this is another case in which we will have to rethink and rewrite the old rules to fit new circumstances.

Finally, while collaborative factual databases will certainly play a large role in the future of the semantic web, the role of content publishers hosting their own structured data cannot be ignored - after all, it was these creators who built the massive existing World Wide Web. While central databases have become easier and easier to contribute to, the RDF and OWL formats are still relatively nascent, and are very complex and difficult to create. Much work needs to be done on both improving these formats and building tools which automatically generate these complicated frameworks to ease the burden on developers before their adoption by the web community at large can be reasonably expected.

While much of this project was intended to be a survey of current collaborative structured data efforts, most of our research was done on the Freebase system as a case study. Our research began by reading literature on the subject - while papers have been written about the semantic web and about collaborative communities on the web, there is very little written which ties these subjects together. Additionally, we spent several weeks exploring the ins and outs of the Freebase site's functionality, data, and social aspects. Finally, we spoke to four experts in the field about their thoughts on the Freebase community: Jack Alves, former director of engineering at Metaweb (the company behind Freebase); David Huynh, a current Metaweb engineer who is building Freebase; Shawn Simister, an active contributor to Freebase's application development and data modeling communities, and Professor Joe Hellerstein, a professor of computer science at UC Berkeley whose research interests include understanding how data is managed and created in collaborative environments.

Freebase was launched in March 2007 by Metaweb, a startup company based in San Francisco, California. Metaweb was spun off from Applied Minds, Inc., a technology and research & development firm, in 2005, and they built their database in "stealth mode" over the next two years. At launch, Freebase had 200 users and around 2.2 million individual topics. Today, these numbers have grown to 25,379 users and over 5.3 million individual topics, some of them scraped and automatically annotated from public sources such as Wikipedia, the SEC, and MusicBrainz, and others contributed individually by community members. See figures 1 and 2, below, for more detail about Freebase adoption over time.

Fig. 1 and 2: Freebase user and topic count growth over time. Source: MQL Query, Freebase.com

Freebase is currently the largest attempt underway to build a factual semantic database out of completely free information. All data stored in the database has a Creative Commons "attribution" license, which allows unfettered republishing, as long as the source is credited. The data is available via an API using a query language called "MQL," which uses JSON data strings, and it is also published in RDF format to encourage the adoption of this new standard. MQL allows the user not only read, but write access to the database as well, as long as an API key is used so user changes can be tracked, making it a powerful tool not just for web app development but for expert user collaboration as well. Additionally, Freebase has GUI-based tools on their website which allow users to explore this data, edit facts on an individual basis, and even change the database schema itself. These potent tools provide the functionality necessary to allow users to not only add missing information, but entirely build the Freebase knowledge base from the ground up.

The Freebase community essentially consists of three sub-communities, each of which plays a different collaborative role: Data modelers build the schemas and ontologies which act as the framework holding the entire database together, knowledge contributors add pieces of knowledge to the database, either individually or by automated import, and application developers build applications which query and leverage Freebase's data to perform some task. All three of these communities have ways in which they interact and collaborate - the developers and data modelers each have a mailing list which is used to make announcements and provide support for users with questions and problems. Freebase also allows any topic in its system to have comments or discussions attached to them; most of the interaction among knowledge contributors happens on these nodes. Finally, there is an IRC chat room which offers general knowledge and guidance for using Freebase. All of these systems are at least semi-active and are generally populated by a mix of helpful experts and Metaweb employees: the Freebase system is so young that nearly all of its most knowledgeable collaborators are the engineers who created it. While these technologies do foster community creativity, there is much that Freebase is not doing that they could be: There exists no central discussion board for general topics, and while a web-based backlog of all mailing list posts is available if you can find it, it is not directly linked from anywhere on the Freebase site. Furthermore, users cannot directly message other users, nor can they add friends. It's hard to say exactly how these features might affect the community, but they would almost certainly make expert knowledge about the Freebase system more available to those interested in tinkering with or contributing to it, and ideally, would encourage a community atmosphere.

The smallest of these three sub-communities is the data modeling community, which focuses on defining the ontologies for Freebase's data. This small group of developers, most of whom started as application developers and moved into data modeling to massage data to fit their needs, has done some incredible work creating standard ontologies, but they struggle with the same challenges the W3C has dealt with time and time again - defining standards is hard work, and takes a long time to get it right. This leads directly to several problems: first, when a schema is in the works but not committed to the database, which is often a long time, contributors cannot add data to the proposed new fields. Additionally, when data modelers have to change legacy schemas to replace them with newer, more intuitive ones, they run the risk of breaking the code of anyone referring to the data using the old schema. Finally, the creation of new schemas does not necessarily reflect the fields data contributors would like to see; rather, they are often created by the data modelers to fill their own needs.

Jack Alves, who was the director of engineering at MetaWeb from the time of its inception to just after its initial launch, spoke with us by telephone and had several thoughts about how to revamp the data modeling process to make it better fit the needs of the community. Most notably, Alves proposed a system which he calls "Sloppy Freebase." In this scenario, when a knowledge contributor has information about a topic which does not yet fit into a field defined in the topic's data schema, he or she would have the ability to add a "tag pair," a piece of semi-structured data which would hold a user-defined field name and fact. For example, if I found a book in the database which I knew won the Pulitzer Prize in 1997, but the "book" schema did not yet have any formally defined field for awards, I could add a tag pair that looked something like ("Won Award","1997 Pulitzer Prize"). This tag would not be truly semantically linked data, but it would provide visitors to the topic with relevant information about it. Furthermore, these tags would later be parsed, either individually by data modelers or automatically with an intelligent crawling technique, and added to the canonical Freebase as soon as the schema was built. This solves the problem of contributors not being able to contribute data to unwritten ontologies, and also provides contributors with a way to affect the schemas being built. However, the problem of new schemas breaking old code still remains; a challenge which Alves noted is an unsolved problem with the semantic web idea in general. This begs the question: is it the responsibility of application developers to constantly monitor the schemas they are referencing to make sure their code does not break, or should there be a system by which publishers who define ontologies are responsible for notify data-referencers of a changing schema?

Freebase's application developer community, which seems to be about twice the size of its data modeling community (based on the number of subscribers to each mailing list), has many useful tools at its disposal. The Freebase API is fleshed out and well-documented, and the site hosts a tool which allows users to test their MQL queries to ensure they will return the correct data. Additionally, Metaweb recently released an in-browser application development platform called Acre, which lets developers with no semantic database experience get a Freebase application up and running in minutes using Javascript, an MQL query, and a simple templating language included with Acre. Since Acre development is done directly on the server, deploying an application is as easy as saving the file in the IDE and opening another browser window with the application's URL. While the documentation for Acre is very sparse (mainly because it is so new), it gives developers a huge amount of power with very little overhead, and allows beginners to get a feel for querying and programming with semantic data without having to spend hours getting things configured and set up. To show off the power of this system, we've created a simple application, hosted at http://mapabiz.dandelany.user.dev.freebaseapps.com, which queries all businesses in Denver with business hours data, checks the current time, and shows whether or not the business is currently open. Our plan is to flesh this out further by mapping the business addresses on a Google Map in the application. While this demonstrates the abilities of the Acre system, it also highlights one of many areas in which data is very sparse: of 1016 businesses listed in the Denver area, only 19 of them have linked hours information.

This brings us to the third and largest community, though possibly the hardest to pin down (due to a lack of a central communication system), Freebase's knowledge contributors. Currently, contributors are creating approximately 150,000 new topics every month, and this rate is growing. However, Shawn Simister, a Freebase application developer, showed us an application he made, hosted at http://leaderboard.narphorium.user.dev.freebaseapps.com, which displays the top contributors to Freebase. Of Freebase's 25 top contributors, 22 of them are (or were) employees of Metaweb, indicating that the majority of content creation is being done by Freebase's creators importing large datasets. So how can Freebase encourage more community contributions? Answers varied among those we asked, but they mostly seemed to involve either lowering the barrier or heightening the ceiling for user contributions. Jack Alves, who proposed the "Sloppy Freebase" idea, noted the fact that some users are casual contributors, for whom the data contribution task needs to be made as simple and flexible as possible, while other users are academics or power-users who need powerful tools to contribute large datasets. Simister, too, noted the importance of an easy way to import large, structured datasets, noting "As a programmer, I feel that I'm most effective when I'm contributing large data sets … there [are some] facilities built into Freebase that let users upload lists of topics, but they're still limited to specific situations so I prefer writing my own tools." David Huynh, a Metaweb engineer, told us that they are currently working on a tool to do exactly this, allowing users to upload spreadsheets directly into Freebase, and several other developers also mentioned the project. Additionally, Professor Joe Hellerstein noted that the portability of Freebase's data (and the semantic web in general) implies the ability to collect contributed data from applications hosted anywhere on the web, saying, "My experience… is that the community is likely to form in another space that is focused on a topic of interest. The data-centric features need to feel organic there -- not in an offsite place about data [like Freebase]."

These three communities are the heart of Freebase's database, and while their systems have their shortcomings, Freebase seems to be head-and-shoulders above anyone else in the bottom-up semantic web realm in terms of openness of data, sheer database size, and community activity. The dream of the semantic web, with all its abstractions, hierarchies, ontologies and references is and will remain a dream for the near future, but companies like Metaweb, along with many others with their own unique approaches, have begun to create the future of the Internet today, and will continue to do their best to fuel the fires of creativity and collaboration to constantly improve our World Wide Web.

Contribution of Individual:

Daniel Delany: