This research is the first systematic empirical attempt to calculate the various components of the statistical error associated with routinely used sampling strategies in human genetics. We reconstructed surname distributions of 26 Italian communities with different demographic features across the last six centuries (years 1447-2001). The degree of overlapping between “reference founding core” distributions and the distributions obtained from sampling the present-day communities by probabilistic and selective methods was quantified under different conditions and models. When taking into account only one individual per surname (low kinship model), the average degree of error was 59.5%, with a peak error of 84% by random sampling. When multiple individuals per surname were considered (high kinship model), the average error decreased by 8-30% at the cost of a larger variance. Criteria aimed at maximize monophyly and long-term residency appeared to be influenced by recent gene flows much more than expected. Selection of the more frequent family names following low kinship criteria proved to be a suitable approach only for stable communities. In any other case true random sampling, despite its high statistical error, did not return more biased estimates than other selective methods. Our results indicate that the sampling of individuals bearing historically documented surnames (founders’ method) should be applied to prevent an over-stratification of ancient and recent genetic components that heavily biases inferences and statistics.

Estimating sampling selection bias in human genetics: a phenomenological approach

ALFANI, GUIDO;ROSSI, PAOLO;
2015

Abstract

This research is the first systematic empirical attempt to calculate the various components of the statistical error associated with routinely used sampling strategies in human genetics. We reconstructed surname distributions of 26 Italian communities with different demographic features across the last six centuries (years 1447-2001). The degree of overlapping between “reference founding core” distributions and the distributions obtained from sampling the present-day communities by probabilistic and selective methods was quantified under different conditions and models. When taking into account only one individual per surname (low kinship model), the average degree of error was 59.5%, with a peak error of 84% by random sampling. When multiple individuals per surname were considered (high kinship model), the average error decreased by 8-30% at the cost of a larger variance. Criteria aimed at maximize monophyly and long-term residency appeared to be influenced by recent gene flows much more than expected. Selection of the more frequent family names following low kinship criteria proved to be a suitable approach only for stable communities. In any other case true random sampling, despite its high statistical error, did not return more biased estimates than other selective methods. Our results indicate that the sampling of individuals bearing historically documented surnames (founders’ method) should be applied to prevent an over-stratification of ancient and recent genetic components that heavily biases inferences and statistics.
2015
2015
Risso, Davide; Taglioli, Luca; De Iasio, Sergio; Gueresi, Paola; Alfani, Guido; Nelli, Sergio; Rossi, Paolo; Paoli, Giorgio; Tofanelli, Sergio
File in questo prodotto:
File Dimensione Formato  
PlosONE_2015_0140146.pdf

accesso aperto

Descrizione: articolo
Tipologia: Pdf editoriale (Publisher's layout)
Licenza: Creative commons
Dimensione 816.15 kB
Formato Adobe PDF
816.15 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11565/3985829
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 3
social impact