The Challenges of Knowledge Discovery in Adversarial SettingsSeptember 24, 2008
Antonio Badia and David Skillicorn
Most work in knowledge discovery or data mining is an attempt to maximize the fit between a model and some, often large, set of data. In certain settings, such as counterterrorism and law enforcement, or in the search for signs of fraud, money laundering, or corporate malfeasance, this strategy is ineffective and perhaps even dangerous. These settings are adversarial---some group is actively trying to conceal its traces from analysis. For those doing the analysis, the greatest interest might lie in the places where the fit between the model and the data is poor. Adversaries who know that the analysis strategy is to maximize the model fit can exploit this to conceal their traces, or cause the error rates to become so large that the analysis is discarded as unusable. Minimizing mean-squared error, for example, is a statistically principled way to fit a model to data, but it allows a single carefully chosen data record to alter the model completely.
Adversarial knowledge discovery is different from and harder than mainstream knowledge discovery. Opponents may try to prevent the collection of data; they may provide wrong data (noise or, worse still, data that will explicitly mislead the analysis algorithms), or they may act, as much as possible, to make data about them look similar to that of others (much easier for an adversary who is aware of the analysis being done). Most knowledge-discovery tools deal with this possibility only indirectly, as they try to handle the incomplete or noisy nature of some data.
Knowledge-discovery techniques that look at the relationships among objects (people, places, things) are attractive because properties are emergent, depending on both the individual objects and the connections between them. This makes it much harder for an adversary to manipulate the analysis process. The relationship between any two objects depends, potentially, on all the other objects and their interconnections. So although local manipulations may have predictable local effects, their global effects are difficult to determine. Local affinities between connected neighboring nodes can be extended to global affinities in ways that take into account the existence of multiple paths between nodes---influence and communication are naturally modeled this way.
Analysis of graph or relational data, often called link analysis, is a branch of network (or graph) theory that concentrates on what the edges of a network reveal about the properties of the graph as a whole and the vertices within it. Link analysis explores the nature of the relationships that different links represent, the reliability of apparent links, the probability that hidden links might actually exist, and the inferences that links of different types support. It has a history of use in adversarial settings. Insurance and credit card companies have used it to combat fraud. Law enforcement agencies have also used it. Link analysis is a natural choice when surveillance reveals local connections between individuals, places, and phone numbers; these ties between individuals, and of individuals to places and times, are often extremely revealing. As an example of an adversarial activity, consider a group of people who agree to fake a car accident, submit a claim, and obtain a benefit from an insurance company. If analysis reveals that they are not random individuals, but rather are connected to each other in some way, there is a good chance that the claim is fraudulent; such connections are not usually present in the case of a genuine accident.
The Web can be an adversarial setting: Ranking algorithms like the well-known Google PageRank are based on link analysis, where the links are hyperlinks from a page to others. People (search engine spammers) who try to improve a site's ranking by giving the impression that it is better connected to the whole Web than it actually is are acting as adversaries. One technique used by these spammers consists of creating many pages, all of them linking to a page whose ranking is to be improved. Because such pages can have a significant impact only if they, in turn, have many pages pointing at them, spammers create networks of pages pointing at each other. Most search engines perform link analysis to discover such networks; when detected, the guilty pages are punished by being ignored.
Existing techniques for link analysis have not been adapted for the adversarial case, in which many interesting and difficult problems arise. Google, for example, struggles constantly to remove Web spam from its search results. Researchers in the field of social network analysis have developed ways to understand social groups via their connections but are unable to distinguish, say, Osama bin Laden, with connections to only a few trusted lieutenants, from a hanger-on who may have the same local connection pattern.
In general, structures that are common in graph data represent normality and so are not very interesting in an adversarial setting. Of course, not every abnormal structure in a graph represents traces of adversarial activity--but such structures are a good place to start looking. Most existing techniques for analyzing graph data emphasize regularities and commonalities. Approaches that emphasize the irregularities---which are more difficult to devise---are needed.
Visualization is an important way for humans to understand graph data. Most graph-drawing algorithms, however, try to emphasize the regularities of a graph, whereas in an adversarial setting the interest usually lies in the irregularities. Telephone call graphs contain many regularities, because most people make calls in similar ways, reinforced by social mediation (calling at 3 A.M., say, is frowned on). A terrorist group may make fewer calls, and to a smaller set of numbers, than most ordinary callers. Existing graph-drawing algorithms push the irregularities of a graph into the corners of renderings; algorithms that place them front and center are needed.
Spectral graph techniques, which again tend to emphasize global regularities, have been used to find good ways to partition and cluster graphs using eigenvectors from one end of the spectrum, where eigenvector entries associate large regions that are structurally similar. Eigenvectors from the other end of the spectrum reveal graph nodes with unusual neighborhoods, and eigenvectors near the "middle" of the spectrum reveal unusual regions of other types.
Given the feedback loop or arms race between analysts and adversaries, new variants of standard approaches will have to be developed for knowledge discovery. As adversaries develop new ways to exploit weaknesses in certain approaches, analysts will have to look for ways to overcome the weaknesses. And as mainstream knowledge-discovery settings become more adversarial --if some people get better service from a business because of their attributes, I may be motivated to manipulate my attributes to become one of them--these new variants will have to become the standard way of doing things. The whole field of knowledge discovery has a lot to learn from a consideration of adversarial settings.
These problems are addressed at the Link Analysis, Counterterrorism, and Security Workshop, held each year at the SIAM International Conference on Data Mining, including this year's meeting, which was held in Atlanta in April. Proceedings of the workshop can be found at http://research.cs.queensu.ca/home/skill/lacts2008schedule.html.
Antonio Badia is director of the Database Laboratory in the Department of Computer Engineering and Computer Science, Speed Scientific School, at the University of Louisville. David Skillicorn is a professor in the School of Computing at Queen's University, an adjunct professor in the Department of Mathamtics and Computer Science at the Royal Military College, and coordinator, Research in Information Security, Kingston (RISK), all in Kingston, Ontario, Canada.