A real data-based simulation procedure to select an imputation strategy for mixed-type trait data

biorxiv(2022)

引用 1|浏览1
暂无评分
摘要
Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Imputation offers an alternative to removing cases with missing values from datasets. Imputation techniques that incorporate phylogenetic information into their estimations have demonstrated improved accuracy over standard techniques. However, previous studies of phylogenetic imputation tools are largely limited to simulations of numerical trait data, with categorical data not evaluated. It also remains to be explored whether the type of genetic data used affects imputation accuracy. We conducted a real data-based simulation study to compare the performance of imputation methods using a mixed-type trait dataset (lizards and amphisbaenians; order: Squamata). Selected methods included mean/mode imputation, k -nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Known values were removed from a complete-case dataset to simulate different missingness scenarios: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Each method (with and without phylogenetic information derived from mitochondrial and nuclear gene trees) was used to impute the removed values. The performances of the methods were evaluated for each trait and in each missingness scenario. A random forest method supplemented with a nuclear-derived phylogeny performed best overall, and this method was used to impute missing values in the original squamate dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to the complete-case data. However, phylogeny did not always improve performance for every trait and in every missingness scenario, and caution should be taken when imputing trait data, particularly in cases of extreme bias. Ultimately, these results support the use of a real data-based simulation procedure to select a suitable imputation strategy for a given mixed-type trait dataset. Moreover, they highlight the potential biases that complete-case usage may introduce into analyses. Author summary The issue of missing data is problematic in trait datasets as observations for rare or threatened species are often missing disproportionately. When only complete cases are used in an analysis, derived results may be biased. Imputation is an alternative to complete-case analysis and entails filling in the missing values using known observations. It has been demonstrated that including phylogenetic information in the imputation process improves accuracy of predicted values. However, most previous evaluations of imputation methods for trait datasets are limited to numerical, simulated data, with categorical traits not considered. Using a reptile dataset comprised of both numerical and categorical trait data, we employed a real data-based simulation strategy to select an optimal imputation method for the dataset. We evaluated the performance of four different imputation methods across different missingness scenarios (e.g. missing completely at random, values missing disproportionately for smaller species. Results indicate that imputed data better reflected the original dataset characteristics compared to complete-case data; however, the optimal imputation strategy for a given scenario was contingent on missingness scenario and trait type. As imputation performance varies depending on the properties of a given dataset, a real data-based simulation strategy can be used to provide guidance on best imputation practices. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要