GeoSVM: an efficient and effective tool to predict species' potential distributions

JOURNAL OF PLANT ECOLOGY(2008)

引用 11|浏览27
暂无评分
摘要
Patterns of species distribution have long been one of the important topics of ecological study (Brown and Lomonilo 1998). In this brief communication, we introduce a new program— GeoSVM—that uses support vector machine (SVM) to predict species’ potential distributions. (GeoSVM is now available at http://www.unm.edu/;wyzuo/GEO.htm.) Here, we also give the results of our evaluation of the performance of GeoSVM. We used data for 30 species of Rhododendron in China as a case study to compare GeoSVM and Genetic Algorithm for Rule-Set Prediction (GARP), one of the most popular models to predict species’ potential distributions. We found that GeoSVM is more accurate and efficient than GARP. Furthermore, GeoSVM can handle more environmental information, which significantly improves the prediction accuracy. Patterns of species distribution can potentially answer a bunch of fundamental questions in ecology, such as where are the original habitats of the species; how do the species distribute on earth; how do species achieve their distribution patterns; what is the relationship between distribution patterns of different species and how to set up a policy to conserve endangered species. The development of computer technology and machine learning methods enables the use of environmental factors to simulate species’ potential distribution. Various statistical models have been explored in previous works for predicting species distributions, e.g. generalized linear models, generalized additive models, logistic regression, neural networks, decision trees, principle components analysis (PCA), Mahalanobis distance, maximum entropy method, genetic algorithm and regression tree analysis (see a survey in Zuo et al. 2007). These statistical models have been commonly used in wide range of other applications. However, when applied to the prediction of potential species distributions, a common problem arises—the high dimensionality and small sample size problem. This problem is caused by the nature of the task—the prediction of potential species distributions generally depends on the specimen data. These data are accumulated by fieldwork. Fieldwork, being an expensive and difficult process, limits the quantity of data available. We have >400 species of Rhododendron in China, but only 161 of them have >20 location samples (the lower limit of sample size for GARP). On the other hand, there are >100 environmental factors that can potentially affect species distribution, such as meteorological factors like annual, monthly, maximum and minimum values of temperature, precipitation and relative humidity as well as geographical factors like altitude and slope and soil and vegetation type. Most statistical methods rely on the big sample assumption that ‘the number of samples is much larger than the number of parameters’. As we can see, however, this assumption does not hold anymore for species distribution data. Under this situation, these models usually perform well on training samples, but badly on new testing data. This phenomenon is called ‘over training’. Some dimension-reducing methods, such as PCA, can mitigate this problem but only to some extent. SVM is a model for classification and regression based on statistical learning theory created by Vapnik (1995) at AT&T Bell Labs. It is based on structural risk minimization principle, an improvement over the traditional empirical risk minimization principle. Because of its outstanding empirical performance, SVM has been well accepted by many scientific communities (Gunn 1998). We implemented a potential species distribution predicting system, called GeoSVM, based on SVM. Detailed system architecture of GeoSVM is described in Zuo et al. (2007). First, GeoSVM randomly generates negative sample points that are five times the number of positive ones. GeoSVM assumes that the species do not exist at negative sample points. Weight 1/5 is given to each negative sample and Weight 1 is given to each positive sample. Environmental features are extracted from the environmental digital map based on the training samples’ locations. These environmental
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要