A Word Count Statistic in Computational Biology

Michael Waterman
University of Southern California

Sequence comparison and database searching are among of the most frequent and useful activities in computational biology and bioinformatics. The goal
is to discover relationships between sequences and thus to suggest biological features previously unknown. As the sizes of biological sequence databases grow, more efficient comparison methods are required to carry out the large number of comparisons. The statistic consdered in this talk is based on the number of k-words common to two random sequences. Estimates of significance use both Poisson and normal approximations.

