This research is the first systematic empirical attempt to calculate the various components of the statistical error associated with routinely used sampling strategies in human genetics. We reconstructed surname distributions of 26 Italian communities with different demographic features across the last six centuries (years 1447-2001). The degree of overlapping between “reference founding core” distributions and the distributions obtained from sampling the present-day communities by probabilistic and selective methods was quantified under different conditions and models. When taking into account only one individual per surname (low kinship model), the average degree of error was 59.5%, with a peak error of 84% by random sampling. When multiple individuals per surname were considered (high kinship model), the average error decreased by 8-30% at the cost of a larger variance. Criteria aimed at maximize monophyly and long-term residency appeared to be influenced by recent gene flows much more than expected. Selection of the more frequent family names following low kinship criteria proved to be a suitable approach only for stable communities. In any other case true random sampling, despite its high statistical error, did not return more biased estimates than other selective methods. Our results indicate that the sampling of individuals bearing historically documented surnames (founders’ method) should be applied to prevent an over-stratification of ancient and recent genetic components that heavily biases inferences and statistics.
Estimating sampling selection bias in human genetics: a phenomenological approach
ALFANI, GUIDO;ROSSI, PAOLO;
2015
Abstract
This research is the first systematic empirical attempt to calculate the various components of the statistical error associated with routinely used sampling strategies in human genetics. We reconstructed surname distributions of 26 Italian communities with different demographic features across the last six centuries (years 1447-2001). The degree of overlapping between “reference founding core” distributions and the distributions obtained from sampling the present-day communities by probabilistic and selective methods was quantified under different conditions and models. When taking into account only one individual per surname (low kinship model), the average degree of error was 59.5%, with a peak error of 84% by random sampling. When multiple individuals per surname were considered (high kinship model), the average error decreased by 8-30% at the cost of a larger variance. Criteria aimed at maximize monophyly and long-term residency appeared to be influenced by recent gene flows much more than expected. Selection of the more frequent family names following low kinship criteria proved to be a suitable approach only for stable communities. In any other case true random sampling, despite its high statistical error, did not return more biased estimates than other selective methods. Our results indicate that the sampling of individuals bearing historically documented surnames (founders’ method) should be applied to prevent an over-stratification of ancient and recent genetic components that heavily biases inferences and statistics.File | Dimensione | Formato | |
---|---|---|---|
PlosONE_2015_0140146.pdf
accesso aperto
Descrizione: articolo
Tipologia:
Pdf editoriale (Publisher's layout)
Licenza:
Creative commons
Dimensione
816.15 kB
Formato
Adobe PDF
|
816.15 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.