Estimating sampling selection bias in human genetics: a phenomenological approach

Risso, Davide; Taglioli, Luca; De Iasio, Sergio; Gueresi, Paola; Alfani, Guido; Nelli, Sergio; Rossi, Paolo; Paoli, Giorgio; Tofanelli, Sergio

doi:10.1371/journal.pone.0140146

This research is the first systematic empirical attempt to calculate the various components of the statistical error associated with routinely used sampling strategies in human genetics. We reconstructed surname distributions of 26 Italian communities with different demographic features across the last six centuries (years 1447-2001). The degree of overlapping between “reference founding core” distributions and the distributions obtained from sampling the present-day communities by probabilistic and selective methods was quantified under different conditions and models. When taking into account only one individual per surname (low kinship model), the average degree of error was 59.5%, with a peak error of 84% by random sampling. When multiple individuals per surname were considered (high kinship model), the average error decreased by 8-30% at the cost of a larger variance. Criteria aimed at maximize monophyly and long-term residency appeared to be influenced by recent gene flows much more than expected. Selection of the more frequent family names following low kinship criteria proved to be a suitable approach only for stable communities. In any other case true random sampling, despite its high statistical error, did not return more biased estimates than other selective methods. Our results indicate that the sampling of individuals bearing historically documented surnames (founders’ method) should be applied to prevent an over-stratification of ancient and recent genetic components that heavily biases inferences and statistics.