Automated Learning and Data Visualization
Automated numeric methods of data mining, statistics, and machine learning adapt themselves to systematic patterns in data to carry out
predictive tasks, or to describe the patterns in a way that provides fundamental understanding. Data visualization is critical in all phases of the analysis of data, from the moment of arrival when data checking and cleaning are needed,to the final presentation of results. Visualization allows us to learn which patterns occur out of an immensely broad collection of possible patterns; it is difficult to select and carry out, a priori, automated learning methods to cover nearly as broad a collection of possibilities.
It is widely accepted that an effective knowledge of patterns is necessary for fundamental understanding. But the knowledge can be of
immense benefit for predictive tasks as well because it gives us valuable information about which automated numeric methods will likely produce best performance. Selecting best automated methods by trying a number of them in a training-test framework runs the risk of simply finding the best among a collection of poor performers. So visualization supports the automated methods. But the reverse is true,
too. It is difficult to make progress just displaying raw data without the benefit of automated methods that provide fits to patterns, which are then displayed, and provide displays of remaining variation in the data after adjusting for the fits. Automation and visualization are symbiotic.
Today, an immense challenge to data visualization, as it is to all technical areas of data analysis, is the rapid expansion in the size
and complexity of datasets. This should not deter our commitment to an understanding of patterns in data, but does require new frameworks for how we approach data visualization.
One such framework is visualization databases; for a single complex dataset, it consists of a large number of displays, many of which
consist of many pages. The displays become a new database that is queried and studied on an as-needed basis. Production, management,
and viewing a visualization database need many new ideas. For example, methods are needed for view selection to populate the database
when the number of views can be millions or more. Examples are statistical sampling methods that find a representative collection
of views, and automation algorithms that find interesting views by searching for certain patterns.
William S. Cleveland, Purdue University