Digging Out Worthwhile Content on the WebOctober 21, 2008
A page from digg.com captured on September 17, 2008.
"Content, content everywhere and not a drop to read."
With that pessimistic but arguably accurate description of the Internet, Kristina Lerman opened her talk, "The Dynamics of Social Voting," at this year's SIAM Annual Meeting. A 21st-century version of "57 Channels (And Nothin' On)," Lerman's declaration refers to the fraction of Web content most users deem worth reading, compared with the massive volume available: the
signal-to-noise ratio of the Internet, so to speak.
Lerman, a project leader at the University of Southern California's Information Sciences Institute, highlighted recent data suggesting that for each gigabyte of "authored" content created for the Web---such as an article on a newspaper's Web site---between 2 and 7 gigabytes of user-generated content are created: blog posts, comments on other posts, MySpace updates, book reviews on Amazon, and other end-user-generated information.
With such massive quantities of data being generated daily---much of it of interest to no one but its creator and possibly a few friends---how do users search out content that's worth their time?
Lerman, whose talk was part of the mini-symposium Toward Real-Time Analysis of Networks, organized by Tina Eliassi-Rad of Lawrence Livermore National Laboratory, discussed data mining research involving one resource many individuals use to find interesting content: the social news-aggregation Web site Digg.
Digg users promote, or "digg," news from across the Web on the site, where the stories are posted in order according to when they first "arrived." If a story gains enough votes---diggs---it is moved to the site's front page. Stories with fewer votes remain on the site's "Upcoming Stories" pages for a day, after which they expire.
Which stories, of the hundreds submitted each day, make it to the front page?
The number of votes an article gets is not the only thing that affects how quickly it is moved from the "upcoming" pages to the front page, or whether it is moved at all, Lerman explained. Story promotion is also a function of social networking. Much as a disease or a rumor spreads through a community, a Digg story is often promoted by people who know each other, or who have frequent close contact. Because many acquaintances have similar tastes, they often prefer the same stories.
Digg users can designate "friends"---people they know, whether on or off the Web---and Lerman's research reveals that stories recommended by "better-connected" users (those with more Digg "friends") were promoted to the front page more quickly than those submitted by less well-connected users.
Lerman wondered what happened to those stories once they reached the front page.
As she explained in her talk, articles can continue to get votes once promoted to the front page, and front-page stories get far more attention than those still in the upcoming pages: The majority of Digg visitors read only the front page. As Lerman discovered, however, an article that makes it to the front page is not necessarily of interest to a wide audience.
Using story submission and voting data obtained from Digg, she developed a decision tree to predict whether a given story would be "interesting." The decision tree was based not only on the total number of votes a story had received, but also on the number of the votes that were from friends of previous voters (including the original submitter), as well as on the size of the submitter's network. To qualify as "interesting" in her study, a story had to obtain at least 500 votes. (She chose this threshold based on an analysis of previous Digg data---numbers of votes per story, numbers of stories receiving certain numbers of votes; for data collected over the course of a year, about 20% of front-page stories received fewer than 500 votes.)
Using the decision tree, Lerman correctly predicted after just 10 votes 57% of the stories that reached "interesting" status; Digg's algorithm, in contrast, needed more than 40 votes to predict only 36% of "interesting" stories.
Lerman's end results were surprising: Al-though stories submitted by well-connected users and "dugg" by many of the users' friends made it to the front page quickly, those stories were less likely to meet Lerman's criterion for "interesting." On the other hand, front-page stories submitted by less well-connected users were deemed to be more "interesting" in the end: They received higher numbers of total votes. The bottom line, according to Lerman, is that initial interest among a more varied collection of users may be a better predictor of a story's overall popularity.
"Better-connected users might be able to recommend a story that will end up on the front page more easily," she explained, "but [the story] may end up with less overall votes."
Digg has since changed its story-promotion algorithm to devalue recommendations from friends, apparently in an attempt to reduce the power of what Lerman calls "authoritative," or well-connected, users. Still, Lerman sees much in the dynamics of information spread on networks that remains to be explored, including the myriad ways in which social networks can spread ideas and the ways in which networks shape connectivity.
She highlighted numerous potential applications, including viral marketing and recommendation systems for companies like Amazon, quoting studies showing that while music ratings do affect a listener's choices, recommendations from friends do not lead to new Amazon purchases. Amazon is a commercial system, but Lerman emphasizes that mathematical analysis can do the same things for a free content system like Digg: predict the behavior of the system, control it by preventing unwanted behaviors, and determine how individual users affect overall behavior.
Michelle Sipics is a contributing editor at SIAM News.