TeamAwesomeA15-FinalReports
Warehouses with Interactive Content
How Massive Data Warehouses Manage Content
Team Awesome
Adam Jackson, Brionna Lopez, Ian Smith, Aaron Vimont
Abstract
The Internet is evolving into its next stage where static pages no longer exist and every page you visit is customized specifically for you. Often referred to as Web 2.0, the new Internet features content more often than not created by the users themselves and, through the use of Human Centered Computing practices, it is delivered to you based on input from thousands or even millions of other users like you.
The next generation of massive data warehouses like Netflix, Amazon, YouTube, Wikipedia, and Facebook contain applications, images, video, and other data that provide a more complex set of challenges for designers. Users not only need an interface that allows them to rapidly find the content they desire, but also simple ways to contribute to these environments.
This paper explores and attempts to answer if these warehouses with interactive content are successful in allowing people to quickly explore, find, and add content quickly, what metrics they provide to help users preview content, and what strengths, weaknesses, requirements, and problems these dynamic repositories have.
Keywords
design, user content, personalized content, data warehouses, Netflix, Amazon, YouTube, Wikipedia, Facebook, rating, voting, recommendation, review, contribute, upload, collaborate, web 2.0
Problem
No matter what type of data massive warehouses with interactive content feature, they all share one common problem: filtering through vast amounts of data in order to provide their users with whatever it is they want.
Goal
There are a number of potential solutions to the problem massive warehouses share, and the goal of this paper is to determine if warehouses with interactive content are successful in allowing people to quickly explore, find, and add content quickly. Next we discuss what metrics they provide to help users preview content. Finally we ascertain the strengths, weaknesses, requirements, and problems these dynamic repositories have.
Statement of the Problem
For every minute of every day, YouTube users upload 35 hours worth of content (1). At this rate, if you attempted to watch all of the video content uploaded on a given day end-to-end without sleeping or eating it would take you more than five and half years. Needless to say, YouTube has a data problem, the same data problem faced by almost every other large data warehouse. These websites need to quickly present their users with the content they request and also allow users to find related content. Over the course of our research, we have come to realize the all-encompassing nature of the problem of optimizing users’ access to content: it affects every aspect of the website from its UI to system architecture to the suggestion algorithms it uses.
Rationale
Data warehouses with interactive content present an interesting and important problem in how they go about obtaining content from a single user and distributing that content to the rest of their users. Amazon and Netflix rely on rating and recommendation systems to help users find the product they desire. Facebook and YouTube track how you and your fellow users behave in order to bring you content they think captivates you and encourages you yourself to contribute. Wikipedia wouldn’t exist without its user base, and millions of people around the world who use Wikipedia everyday have only a small handful of contributors and editors to thank.
These problems are highly important and are, in fact, the driving force behind many new and innovative ideas related to the next generation of the Internet. It all comes back to Web 2.0, creating a web customized to every individual user. Only by spurring interest in these fields can we foster the innovation and creativity needed to help overcome these resounding problems.
Methodologies
The majority of the research conducted for this project was done through the use of data analysis and interviews. Our interviewees include John Bacus, a member of the Google SketchUp team, and Holger Dick, a Computer Science Graduate Student at the University of Colorado specializing in Human-Centered Computing. We also reference articles written by recommendation algorithm engineers at Google and Facebook.
Related Work
While all of these websites have been greatly researched individually, there is very little research focusing on all of them together. Specifically, our research aims to reveal similarities and differences between how these sites manage their massive data warehouses. Additionally, we intend to add information about our own experiences with these websites and whether or not we find their content distribution systems successful.
Characterization of the Individual Contributions
Ian conducted research of YouTube, specifically how the recommendation algorithm works and also investigated the real-world performance of said algorithm. Aaron’s research involved learning about how Facebook’s News Feed and advertising systems deliver customized content to users. Adam investigated Amazon and Netflix, specifically how their rating and review systems deliver different recommendations to users. Brionna researched Wikipedia and how its structure enables users to quickly find content relative to the current page they’re on.
Findings
No matter how you look at it, the underlying structure of all these sites could not function without their users. By utilizing the input of their communities, these websites deliver the best experience possible to each individual using the system. Below we will explain how each website obtains input from individual users and how they in turn deliver results to the community.
Netflix
Netflix provides a collection of over 100,000 movies and television shows to its user base of over ten million people (2). It comes as no surprise that Netflix’s biggest problem is enabling users to filter through this extremely expansive collection in order to find titles their actually interested in. Search alone is not enough to help users find movies they like, and Netflix found itself in need of a recommendation system. This need was so substantial that Netflix offered a $1,000,000 cash prize to any individual or team who could improve their current recommendation system by 10% or more. While the users themselves do not contribute their own content to Netflix’s library, there is one aspect that makes Netflix’s entire recommendation system possible: user ratings.
Any Netflix subscriber can rate any title listed on Netflix. The rating system is so expansive that a user can can give a different rating to every single episode of every single television series. Netflix takes these ratings and, through the use of complex machine learning algorithms, uses them to determine what type, genre, and aspects of movies you enjoy. The system then finds movies that fit this criteria and recommends them to you. The system also compares your interests to those with similar interests, and recommends movies enjoyed by those most similar to you.
Amazon
Amazon is very similar to Netflix in their use of rating and recommendation systems. Amazon features millions of products, ranging from music and movies to clothing and outdoor gear (2). In addition to search, ratings, and recommendations Amazon also features reviews for every product on the site. The reviews themselves can then be rated by other users, and the top rated reviews appear on the main product pages. Product pages also feature links to analogous products which other users may have considered or purchased instead of the current product you’re viewing.
Similar to Netflix, Amazon’s rating, recommendation, and review systems would not exist without direct contribution from their user base. While only a small number of users take the time to rate or review a product, it is estimated that more than 90% of consumers take product ratings and reviews into account during the purchase process (15). Amazon and Web 2.0 are ushering in an age where consumers are no longer passive, but actively seek out information about their purchases, most often from their fellow consumers.
Wikipedia
Wikipedia features almost three and a half million articles in English alone. Search is the most common method used for finding articles, but one of its most useful features for finding additional articles is inherent to the structure of the site itself. Since all of the content of Wikipedia is created by the users themselves, a new page can be added anytime by any person. Once the new article has been created, the author can return to the original page and insert a link to the new page. This simple method for linking relative content gives users the power to gain further knowledge about any subject they wish.
One of the main features of Wikipedia is that any user can create a page and any user can edit a page; this of course can allow for biased or incorrect information to be added. Regardless of this, software robots are used to minimize this and users are able to add pages to their “watch list.” By doing this, users can monitor when pages are changed, what those changes are, and who made the change. For this assignment, a few key words were changed to a well known page and within a couple of days the mistake was corrected. This experiment helped illustrate that Wikipedia is monitored fairly well and can generally be trusted, but a user should also know that every page is not absolutely correct all the time.
Also, some pages do not allow nameless users to edit a page. This occurs on high frequented pages that are protected and in order to edit a user must be logged in as an administrator. Wikipedia advocates that everyday users can edit any content, but a user must petition to be an Administrator of a protected page only if that user has high insight into the subject and is not merely looking for editorial recognition. In addition to users editing the pages themselves, a discussion tab is attached to every Wikipedia page that is intended for people to voice their concerns with the page and the content that can be viewed by other people either willing or able to change the page. Also, if it is noticed that a particular user is putting inappropriate or inaccurate information continuously, then the user will be put on a list that does not allow for this person to make any changes to the page again. Finally, to stop a recurrent problem, an article can be temporarily stopped from editing and/or user names and IP addresses can be blocked from editing.
Another important component of Wikipedia is that it runs off of MediaWiki. MediaWiki is a free-based wiki application that is run and developed by the Wikimedia Foundation. This software is written in PHP and has a backend database. With this software, links can be made where red indicates a page can be created for a specific keyword whereas blue indicates a page that exists and that is relevant to understand the desired content better. Also, linked content can be done through users with an extensive set of rules which include how many links can be present, relevancy, and where the links go to. These links help point readers to more information that is relevant and related to the page being looked at.
YouTube
YouTube’s main interface for enabling users to discover new content is its ‘recommended videos’ panel that accompanies every YouTube video. The point of recommending other videos is to entice the user to continue watching videos, and the best way to do this is to try to guess exactly what the user would want to watch. YouTube uses two different recommendation algorithms, the both of which are based upon YouTube’s Adsorbtion Algorithm (10). The Adsorbtion Algorithm recommends users videos based upon the user’s viewing history and also based on how popular any given video is in terms of total views of that video. The algorithm gives weight to videos watched by users with similar viewing histories (for example, a user who watches many videos that feature cats will be recommended videos watched by other users who watched the same videos) and also videos of a similar number of views. One version of Adsorbtion (Adsorbtion-N) tends to recommend more popular videos, whereas the other version (Adsorbtion-D) tends to recommend videos with similar content.
The Adsorbtion Algorithms are where the idea of The Long Tail really comes into play, because the casual YouTube user tends to want to watch very popular videos (like the latest music videos) and the more dedicated YouTube user tends to want to watch relatively obscure videos. As a result, YouTube tailors recommendation results based upon whether a given user is more or less casual. For a user that fits the casual profile (i.e., the head of the standard distribution), the Adsorbtion Algorithm will recommend popular videos that don’t necessarily have much in common with a video that user has recently watched. For dedicated users (i.e., users that comprise the Long Tail), the algorithm will recommend videos that aren’t necessarily popular but are related to videos the user has recently watched. In this way, YouTube ensures that it can cater to its whole spectrum of users.
Facebook
Facebook has become a huge repository of personal information for millions of users. Because of this, ads are personalized to fit a specific user. Facebook provides tools, such as Ad Manager, to allow business pages to target a specific audience. These ads can be broken down to target different ages, sexes, locations and interests (11). This tool can also be used to determine how much the company is willing to pay per click. To put an add on Facebook, the company needs to simply develop a page for themselves, add an image for their company and submit their ad information (11). Being able to use the immense amount of information on Facebook gives companies a huge audience for their products. In addition, they also have ads that can be very effective. Research has shown that seeing a Facebook friend that “likes” a product will significantly improve the chance of a user “liking” or buying that product (12).
Because the ad tools are so customizable, it becomes easier to reach a specific audience. These tools cut down on the number of people the ads need to be targeted towards. The ad algorithms do not need to try to cycle ads through every single Facebook page. Instead, it can take data from Ad Manager and simply its results. This makes searching and utilizing data stored in Facebook’s data warehouses much easier to do.
Another important aspect of Facebook, is the News Feed that initially shows up when a user visits the homepage. The News Feed contains stories and other information that a user’s friends have recently contributed. This feed constantly updates with new information. The News Feed uses what Facebook developers call the EdgeRank algorithm (13). The EdgeRank algorithm treats every item that can show up in the News Feed as an object. When as user interacts with an object in the News Feed, which could be adding a comment or simply clicking a link to a profile, an Edge is created. These Edges are given rank based on the type of interaction with News Feed object, how often a user interacts with a certain profile and how old certain Edges are. Edges with higher rank are given a higher precedence in the News Feed (13). Algorithms like this help make sense of how much data people interact with on a regular basis. If the Facebook News Feed displayed every new post from every friend, people would be overwhelmed. Developing a system to organize the data is critical for use of any Web 2.0 site.
Conclusions
Now that we’ve examined how these websites allow users to explore and find content, we must ask weather or not they are successful in doing so. Alexa.com, a website analytics tool, ranks Facebook as the 2nd most visited site in the world, YouTube as the 3rd, Wikipedia as the 7th, and Amazon.com as the 16th (7). According to Compete.com’s traffic statistics, these have combined total traffic of more than 410 million visitors (8). Netflix was the number one ranking company in customer satisfaction in 2009, with Amazon coming in as a close second (9). According to these statistics and the extremely widespread use of these sites, it appears that these sites are indeed successful at enabling users to explore and find content. Warehouses such as these can collect, analyze and share data from millions of different users. The algorithms have grown and produced sites completely customizable for a single user. Warehouses with interactive content provide a way to connect people all over the world in a way never seen before.
Results, Experiences, & Recommendations
We have all taken the time to look through the various sites we researched. We wanted to see how effective their search engines are for finding content, how accurate their recommendation systems are for new products or movies and how quickly the sites are updated.
Netflix
The Netflix recommendation algorithm is intended to help you find movies you’re interested in, but how well does this system work? The answer is surprisingly well, but only to a certain extent. A Netflix “power user” is described as someone who has rated thousands of movies. The general feedback from these users is that as they rate more and more content, the content recommended to them is more and more accurate in terms of what they are predicted to like. Too many ratings can be a bad thing though, as one user who rated more than 20,000 items no longer receives any recommendations at all (16). By rating content he had never seen and recommending titles in genres he effectively blocked, this user “broke” the Netflix recommendation algorithm. In more than five years this user has still not received a single recommendation.
Amazon
Amazon uses an algorithm to make recommendations for new products based on the frequency of purchases and the actual items purchased. We have found these recommendations to be a fairly good representation of past purchases. However, purchases made as gifts for other people also get included in this algorithm; therefore some of the recommendations are not an accurate representation of products we would actually want to buy. The algorithm seems quite sensitive in that a single purchase can change the outcome of recommendations. For instance, buying a single Christmas CD can flood the recommendations with other Christmas music favorites. However, the recommendations more often suggest products that would be useful for the user.
Wikipedia
When searching for a page on Wikipedia, generally a page will automatically load. However, if there are multiple pages with the same title and it cannot be determined what specific page the user wants then a Disambiguation page will be loaded with relevant page titles (17). When experimenting with various page titles it was found that many pages automatically loaded and were relevant and occasionally there were pages that loaded that we were not looking for. For example when looking for the bank in relation to geography, a page automatically loaded about a financial institution.
As a team, we generally liked how this method worked because it was accurate more oft than not, and if a wrong page loaded it was fairly simple to find what we were looking for. Also, the sidebar of the Wikipedia site features buttons for random articles and featured content. The random articles button literally loads any random page in the Wikipedia warehouse. One recommendation for Wikipedia that we believe to be useful would be to have setting for the random button to include types of pages, as in protected or featured content, to give pages that were more reliable. Also, another recommendation that we found for Wikipedia was to have additional recommended articles for every page, not just some of them.
For example, when looking at the 3G page, a list of related articles were there but for other pages a related articles box was missing. Thus, having a related articles box would be helpful for every page. A final recommendation that we found from our use of the Wikipedia site was to have a better system of checking edits on a page. For this project, a few key words were changed on the 3G page and it took a few days to finally fix it. While the error was fixed fairly quickly, we felt it could have been done faster. Perhaps a solution could be implemented to look for changes to key words and if they fit into the page or obvious errors to spelling or require administrators to look over the changes faster to keep their role on the page. All of these recommendations would be beneficial to enhance the use of Wikipedia as a whole and make the user experience better.
YouTube
Searches on YouTube can produce results ranging from a few hundred to over 100,000 videos. This makes it incredibly difficult for the user to find a video they are looking for. They need to be very specific with their searches and include many keywords. Without this, the desired video can be buried many pages back. YouTube does have different search options located at the top of the page, however this does not greatly cut down on the number of videos presented in the search.
One recommendation for YouTube would be a search rating system. The user could enter a search and then vote on the relevance of a video to the search. This would certainly add a level of complexity to YouTube’s warehousing, but it could potentially help the number of videos presented to the user. The search could first list the videos by relevance to the user’s keywords. If the user wished to see all of the videos, they could click for that search option. This method may just be a way for YouTube to help control immense amount of content that is uploaded.
In our experience with YouTube, we found the recommendation system to be hit-or-miss. The system worked best when there are already many related videos with a similar number of views. For example, when we started with a video lecture by Richard Feynman (a physicist) about scientific thought, YouTube recommended videos that were all either lectures by Richard Feynman or lectures about physics. When we watched these recommended videos, they in turn were linked to videos about physics. We eventually got to the point that YouTube mostly recommended us videos about general relativity, but these videos were still about similar subject matter and also had a similar number of views.
However, when YouTube is faced with videos that have a small number of views, the recommendation system does not generally yield such high quality recommendations.Perhaps the best example of this is that one of the group members checked to see what YouTube recommended in response to one of his own videos. The video in question records his interactions with a typewriter and has about fifty views. YouTube recommends a couple videos about typewriters that have a similar number of views, but also recommends a Lady Gaga music video that has over three hundred million views: this recommendation could hardly be less accurate. We think the reason that the recommendation system fails here is because there is so little data upon which it can base its recommendation. Perhaps in this case, YouTube should offer fewer recommendations rather than populate its recommendation list with irrelevant videos.
Facebook
Facebook’s EdgeRank algorithm successfully updates the News Feed page with stories from friend a user visits more often. We experimented with the algorithm, finding friends whose pages we rarely visited, and interacting with those pages. By increasing the number of “edges” we had with those people, their stories began to show in the News Feed. The current algorithm is a great way to follow stories of the friends a user visits the most. However, stories from other friends get lost with this algorithm. We would recommend the algorithm have increased level of randomness, as to include News Feed stories from friends a user does not connect with as much. That way those people are not lost behind dozens of stories from a single friend.
References
1. Walk, Hunter. "Great Scott! Over 35 Hours of Video Uploaded Every Minute to YouTube." YouTube Blog. Blogger, 10/11/2010. Web. 13 Nov 2010. <http://youtube-global.blogspot.com/2010/11/great-scott-over-35-hours-of-video.html>
2. "Netflix." Wikipedia. Wikimedia Foundation, 2010. Web. <http://en.wikipedia.org/wiki/Netflix>.
3. "Amazon." Wikipedia. Wikimedia Foundation, 2010. Web. <http://en.wikipedia.org/wiki/Amazon>.
4. "Wikipedia." Wikipedia. Wikimedia Foundation, 2010. Web. <http://en.wikipedia.org/wiki/Wikipedia>.
5. "YouTube." Wikipedia. Wikimedia Foundation, 2010. Web. <http://en.wikipedia.org/wiki/YouTube>.
6. "Facebook." Wikipedia. Wikimedia Foundation, 2010. Web. <http://en.wikipedia.org/wiki/Facebook>.
7. "Alexa Top 500 Global Sites." Alexa. Alexa Internet, 12/11/2010. Web. 14 Nov 2010. <http://www.alexa.com/topsites>.
8. "Compete.com Site Comparison." Compete. Compete.com, 12/12/2010. Web. 14 Nov 2010. <http://siteanalytics.compete.com/wikipedia.org+amazon.com+youtube.com+netflix.com+facebook.com/>.
9. "Netflix leads the pack in customer satisfaction, says survey.." Internet Retailer. Internet Retailer Magazine, 06/05/2010. Web. 14 Nov 2010. <http://www.internetretailer.com/2010/05/06/netflix-leads-pack-customer-satisfaction-says-survey>.
10. S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D. Ravichandran, M. Aly (2008), Video Suggestion and Discovery for YouTube: Taking Random Walks Through the View Graph, Proceedings of WWW-2008 (WWW-2008). <http://www.esprockets.com/papers/adsorption-yt.pdf>
11. Butler, Chris. “A Practical Guide to Social Media: How to Use Facebook to Promote Your Business.” (2009). <http://www.newfangled.com/how_to_use_facebook_to_promote_your_business>.
12. Carter, Brian. “Facebook Ad Algorithm Cracked: The Secrets Behind Successful Facebook Advertisers.” (2010). <http://www.allfacebook.com/facebook-ad-algorithm-2010-10>.
13. Kincaid, Jason. “EdgeRank: The Secret Sauce That Makes Facebook’s News Feed Tick.” (2010). <http://techcrunch.com/2010/04/22/facebook-edgerank/>.
14. Bennet, James. "The Netflix Prize." N.p., 2007. Web. 15 Nov 2010. <http://74.125.155.132/scholar?q=cache:0M847r8IrnsJ:scholar.google.com/+netflix+queue&hl=en&as_sdt=4000>.
15. Barone, Lisa. "Checking in With Customer Reviews."Small Business Trends 10/11/2010: n. pag. Web. 15 Nov 2010. <http://smallbiztrends.com/2010/11/checking-in-with-customer-reviews.html>.
16. Madrigal, Alexis. "Meet a King of Netflix." Atlantic 17 09 2010: n. pag. Web. 10 Dec 2010. <http://www.theatlantic.com/technology/archive/2010/09/meet-a-king-of-netflix/63191/>.
17.Ronghuai, Huang, Yang Qiang, and Pei Jian. "Advanced Data Mining and Applications." Google Books. Springer, 2009. Web. 10 Dec 2010. <http://books.google.com/books?id=cY77E4JHB-wC&pg=PA279&dq=wikipedia+%22related+articles%22&hl=en&ei=
aZICTdXgMsSXnAeRlpjmDQ&sa=X&oi=book_result&ct=result&resnum
=3&ved=0CDQQ6AEwAg#v=onepage&q&f=false>.