Retrieval testing is difficult for several reasons. First, the sets of documents on which actual retrieval are carried out are large to the point that the creation of complete humanly judged test sets is not possible. Because of this we have just completed a study of a large collection in the area of molecular biology. For this large database the retrieval operation is simulated and relative measures of performance are compared with absolute measures. The simulations are based on the McCarn-Lewis model equation. It has been found that the relative measure, as we have constructed it, is a sensitive indicator of absolute performance but one which deviates slightly from the absolute measure in a systematic way which can in some circumstances be predicted from easily measurable properties of the retrieval method being tested. The second factor that makes retrieval testing difficult is the lack of a model for actual retrieval performance. This makes it difficult to assess the significance of results obtained when testing is done. One must use special assumptions about the symmetry properties of test results obtained by different retrieval methods, etc. Such assumptions are seldom satisfied. We have therefore investigated the bootstrap methods. Based on these results we have concluded that the bootstrap methods provide a reliable and practical method of significance testing for retrieval performance results. The third area that makes retrieval testing difficult is the lack of reproducibility of human relevance judgements. We take the approach that relevance must therefore be understood as a probability. This allows a unique interpretation of the probability ranking principle. Our studies show that such an interpretation provides a stable approach to testing that places limits on performance and allows the comparison of human and machine performance on an equal footing