Wanted: An Accurate Map of the InternetJune 10, 2005
By Sara Robinson
The World Wide Web can be viewed as a growing, ever-changing network graph, in which the Web sites are the nodes and the links between them are the edges. The same is true of the physical infrastructure of the Internet, which consists of router nodes connected by edges of copper and optical fiber. Accurate models of the growth of these graphs are critical for such applications as search engines, business models for Internet transport, and network protocol design.
Over the past decade, researchers have looked at the degree distribution of the nodes in graphs of these and other such large-scale networks because it's a key, large-scale attribute that tends to stay constant over time. In many cases, the distribution has turned out to be "heavy-tailed," with high-degree nodes occurring with higher probability than would be expected in a standard random graph. The degree distributions of the World Wide Web graph and, following a well-known 1999 study, the graph of the physical Internet are thought to be heavy-tailed.
In fact, heavy-tailed degree distributions are common to so many man-made and natural-world networks that they have been called "a signature of human activity." Following several years of what some researchers describe as "irrational exuberance" about heavy tails, however, the recent mood in random networks research leans toward caution and skepticism. "People were thinking that all that matters is the degree distribution, but the degree distribution alone doesn't tell you all that much," says Jon Kleinberg, a professor of computer science at Cornell University.
Riding the recent wave of caution are three papers that raise new questions about the degree distribution of the physical Internet. The papers demonstrate a systematic bias in the measurement technique used to gather the data for the 1999 Internet study--pronounced enough to make even classic random graphs look heavy-tailed.
While most researchers still believe the Internet degree distribution to be heavy-tailed, the tail may be less heavy than once thought, suggesting a need to re-evaluate Internet models. "Now we know that the complexity we see may arise from the tools that we use to observe that complexity," says Aaron Clauset, a graduate student in computer science at the University of New Mexico and, with his adviser, Cristopher Moore, a co-author of two of the three papers. "By accounting for the observational bias, we can begin to move toward more realistic models that capture more of the complexity of the Internet."
Looking for Power Laws
In 1998, three physicists at the University of Notre Dame began to explore the network topology of the Web, with the goal of characterizing its large-scale properties. The researchers--Réka Albert, Albert-László Barabási, and Hawoong Jeong--built a Web-crawling robot, similar to the ones search engines were using, and used the data they culled to approximate the probability that a given page has k incoming links or k outgoing links.
They expected that the Web would look more or less like an Erdös–Rényi graph, a random graph model introduced in 1959 in which edges between nodes are included with some fixed probability p. Erdös–Rényi graphs have a bell-shaped degree distribution, with the fraction of nodes of degree d decreasing exponentially in d. Such a picture made sense, the Notre Dame researchers reasoned, because "people follow their unique interests" when linking their Web sites to others. Given the diversity of interests and the sheer volume of sites, they concluded, the pattern of links "should appear fairly random."
The physicists' data, however, showed something quite different. The degree distribution of the Web had no peak, and it tailed off not exponentially, but polynomially. That is, the distribution obeyed a power law: The probability that a node is connected to k other nodes is proportional to 1/k^n (n in this case was approximately 2). Thus, the researchers concluded, the Internet contains many more large-scale "hubs"--Web sites with very high degrees--than might be expected. Networks of this type, in which the node degrees are not clustered around an average value, are sometimes called "scale-free."
The bias in the Web's link pattern, the physicists hypothesized, is due to the phenomenon known as "the rich-get-richer." People are far more likely to link their Web sites to high-profile sites than to little known ones, causing the high-degree sites to grow at a much faster rate.
The physicists were astonished by their findings, and in 1999, when a paper on their work appeared in Science, other researchers had a similar reaction. Soon, scientists in many fields were examining the patterns in an array of real-world networks and demonstrating that power laws govern many of them. Studies showed power law distributions in collaboration networks of researchers and film stars, in networks of the calls between telephone numbers, and in cellular and ecological processes, to name a few examples.
At the same time, researchers began to devise models explaining the power law growth in different networks. These models generally fall into two categories: One is the rich-get-richer, or "preferential attachment," model, in which power laws arise because of the bias for linking to high-degree nodes. The other, a model called Highly Optimized Tolerance, introduced in 1999 by Jean Carlson of the University of California at Santa Barbara and John Doyle of the California Institute of Technology, shows that heavy-tailed degree distributions may be an effect of optimal yet reliable design in the presence of a hazard. As an example, the researchers attributed the power law distribution governing the size of forest fires to the optimized placement of firebreaks.
Another trend was the rediscovery of old power law research. Delving into the literature, researchers learned that the mathematical basis of the preferential attachment model dates back at least to 1925, when it was used to explain a power law in the distribution of species among genera of plants. In a paper published in the 1950s (and cited by the HOT researchers), Benoît Mandelbrot had introduced a model that had something of the flavor of HOT to explain a power law in the rank-frequency distribution of words.
Eventually, in the eyes of some, the search for networks that follow power laws became an end unto itself, and some papers were making larger-than-life claims about a particular power law's significance. Capturing the prevailing attitude, a footnote in a 2002 research paper quotes a computer scientist who described a power law as "a signature of a particular human activity: looking for power laws."
In response to the perceived overreaching, much recent work has emphasized the need to validate new graph models that yield power laws, and to study other graph invariants, such as the eigenvalues of a graph's adjacency matrix, in addition to its degree distribution.
Helping to define this environment of caution is the new thread of research questioning the conclusions of the 1999 study of the Internet. The 1999 study--by Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos---three brothers who are computer scientists, respectively, at the Universities of California at Irvine and Los Angeles, and Carnegie Mellon University---indicated that a power law, with an exponent between 2 and 3, governs the topology of routers and cables in the physical Internet. "The Faloutsos paper was important because it showed us that all the previous Internet models had the wrong connectivity distribution," Clauset says. The data used in the study had been gathered by other researchers several years earlier.
The three recent papers indicate that the degree distribution given in the Faloutsos paper, while a step in the right direction, may not be wholly accurate because of a systematic bias in the measurement method used to gather the data. The bias is so extreme that it can make the bell-shaped distributions of Erdös–Rényi graphs look like power laws, though the researchers are quick to say that they don't believe that is the case for the Internet. "We believe it follows a power law, but with a different exponent than what we are observing," Moore says.
The Internet Graph
The World Wide Web--the network of Web sites and the links between them--is the face of the Internet that everyone sees. Underlying that public face is the physical Internet: the conduits for information, the highways of copper and optical fiber through which routers, following the rules of Internet Protocol, direct Web traffic.
Like the Web, the physical Internet is a vast entity that is constantly growing without central control. Local Internet service providers and corporate and university networks control their own networks, which are attached to backbone networks operated by telephone companies like AT&T and Sprint. There are no accurate, publicly available maps of the connections within and between networks because much of that information is proprietary.
An accurate picture of the Internet is important, because researchers need models to test such things as the effects on traffic of new pricing mechanisms for ISPs or new routing protocols. A good model must not only give a static snapshot, but also approximate the way the Internet grows and changes over time. With dynamic models, researchers can forecast effects of network changes that will be seen in, say, six months or two years.
Without an explicit picture to work from, researchers modeling the Internet have had to resort to a clever sampling technique to get an approximate picture of the topology. Traceroute, a standard probing strategy, reveals the sequence of hops a packet takes when traveling from a source computer to a particular destination. Some network administrators disable the traceroute command, and paths shift over time in response to congestion and technical problems. Nonetheless, studies have shown that the routes packets take to their destinations are close to the shortest paths. By aggregating traceroute paths from a few sources to a representative sample of destinations, researchers thought that they had a reasonable approximation to the Internet graph that at least captured its large-scale characteristics.
The 1999 paper by the three Faloutsos brothers, which used single-source traceroute data collected four years earlier by Jean-Jacques Pansiot and Dominique Grad, computer scientists at the Université Louis Pasteur, reported that the in and out degrees of Internet nodes (either routers or networks) seemed to obey a power law. "The main message from our work was that the degrees are highly skewed," says Michalis Faloutsos. Following the Faloutsos paper, models of the Internet were changed to reflect a heavy-tailed degree distribution.
That the physical Internet would follow a power law was surprising to researchers, because a simple preferential attachment model could not explain it. In a 2002 paper, Alex Fabrikant, Elias Koutsoupias, and Christos Papadimitriou, of UC Berkeley, UCLA, and UC Berkeley, respectively, proposed a model they called Heuristically Optimized Tradeoffs, which builds on the Highly Optimized Tolerance model. The three researchers showed how the competing desires to minimize "last mile" costs and transmission delays can result in a heavy-tailed degree distribution for the Internet graph.
Still, network researchers were unsure of the importance of such factors in determining the structure of the Internet. "There are many ways to generate networks with power law degree distributions, but it's not clear that any of them accurately model the Internet," Clauset says.
The Bias of Traceroute Sampling
Researchers were aware before 2003 that traceroute sampling was imperfect, and that resulting approximations to the Internet graph were likely to be biased. Still, the exact nature of the bias was unknown.
In 2003, four researchers from Boston University---Anukool Lakhina, John W. Byers, Mark Crovella, and Peng Xie---showed empirically that when graphs are sampled with traceroute methods, the measured degree distribution can differ sharply from that of the underlying graph. In particular, for a sparse Erdös–Rényi graph, the subgraph formed from traceroute samples from a small number of sources "can look remarkably like a power law." The BU researchers provided an intuitive explanation for the bias: Edges close to the root are more likely to be revealed, and, more important, high-degree vertices will be sampled with higher probability than low-degree ones. Their work was published in the Proceedings of IEEE INFOCOM.
In a 2005 paper in Physical Review Letters, Moore and Clauset made the empirical observations rigorous. The researchers showed analytically that the distributions of traceroute samples of Erdös–Rényi graphs do follow a power law, with an exponent of 1, even though the graphs' actual distribution is Poisson. To get their result, the researchers modeled a single-source traceroute sample as a spanning tree of the graph, using differential equations to model the growth of the tree. They also demonstrated empirically that for a graph with an underlying power law distribution in which edges far outnumber nodes, traceroute sampling can significantly underestimate the exponent.
Most recently, Clauset and Moore joined forces with David Kempe of the University of Southern California and Dimitris Achlioptas of Microsoft Research. Using sophisticated statistical techniques, they showed how to calculate the observed distributions under traceroute sampling for random graphs of nearly arbitrary connectivity. As an example, the group showed how traceroute sampling finds power laws in both Poisson and regularly distributed graphs. Their paper appeared in the Proceedings of the 2005 ACM Symposium on the Theory of Computing.
"What makes the recent work interesting is not simply the statement that traceroute sampling is biased, but that the authors can rigorously analyze the consequences of this bias," Kleinberg says. "That this inaccuracy systematically turns relatively uniform graphs into graphs that appear to have power-law degree distributions is striking, and it suggests the need for deeper investigation."
Using their most recent result, Clauset and Moore are able to explore the correspondence between underlying and observed distributions, without having to do computationally expensive simulations. So far, the only distribution they have looked at that would yield the observed exponent for the Internet is a power law, although the Internet graph could have a smaller exponent than what has been observed. What they hope to do is to invert their mathematical result, which would enable them to take the observed distribution and derive the family of underlying distributions that produce it. Then, using additional empirical data or simulations, they might be able to validate an estimate of the bias that is truly present in the observed distribution.
They also hope to determine the number of traceroute sources needed to ensure that the observed degree distribution is unbiased. They conjecture that the necessary number is somewhere between 10 and 50. "To our knowledge, the largest number in any published study is 12," Moore says.
Meanwhile, Kleinberg points out that models of the Web graph, too, would benefit from better measurement techniques. Web crawls yield only samples of pages, and those samples can be biased, because parts of the Web either have dynamic content or are inaccessible to crawlers. Developers of search algorithms need accurate Web models, he says, as do people seeking to ferret out efforts to influence search engine rank-ings. Web models can also serve as the basis for studies of how ideas spread and how Web sites rise out of obscurity and become popular.
Sara Robinson is a freelance writer based in Los Angeles, California.