Complex systems: Unzipping Zipf's law.

NATURE(2011)

引用 38|浏览13
暂无评分
摘要
Perhaps the only thing more abundant in both natural and man-made systems than power laws are the models that have been developed to explain them. Writing in the New Journal of Physics, Baek et al.1 argue that because such models depend on the specifics of each system, they fail to capture the shared cause of this regularity. The authors instead propose a general model that can be applied to any division of items into groups, and that can, for example, account for Zipf's law of word frequencies in text, the popularity of last names, and city and county populations. Scientists have been captivated by power laws with reason. Whereas other probability distributions invariably bend on log–log scales, the power law continues as a perfectly straight line over as many orders of magnitude as the system size allows. A power-law distribution, in special cases referred to as Zipf's law or a Pareto distribution, specifies that the probability of observing an item of size k is proportional to k−α, with α being typically between 1 and 3. The implications of the distribution are even more striking than its heavy-tailed shape: there are a few mega-cities but many small towns; a small number of individuals hold a substantial fraction of the total wealth; and there are roughly 2.5 million Smiths in the United States, whereas most last names are uncommon. In fact, a heavy-tailed distribution of sizes tends to hold for a wide range of systems in which items are assigned to bins: species to genera, readers to books, visitors to websites, written words to vocabulary (Fig. 1), and even casualties to wars2. Various models have been proposed to explain one or several of the observed power laws. Two main criticisms are commonly aimed at such models. First, many distributions deviate, at least slightly, from a straight line on a log–log scale. Often the deviation is an exponential cut-off in the tail of the distribution and is not captured by the model. Second, models tend to contain system-specific elements that limit their generalizability, and early pursuits of more general models were undertaken by, among others, Herbert Simon, who wrote3: “No one supposes that there is any connection between horse-kicks suffered by soldiers in the German army and blood cells on a microscopic slide other than that the same urn scheme provides a satisfactory abstract model for both phenomena.” The urn model proposed by Simon is related to other preferential-attachment growth models, also known as cumulative-advantage or 'rich-get-richer' processes. Yule developed4 one of the oldest such models, proposing that genera grow in proportion to the number of species they contain, by assuming that each species has an equal likelihood of generating a speciation event. Whereas preferential-attachment models continue to be used to explain power-law distributions observed in various contexts5, 6, some power laws prompt different explanations. For example, Zipf's distribution of word frequencies can result from a principle of least effort7, 8 in the evolution of language, or even from random sequences of letters and spaces9. This leaves open the possibility that there is a more general, global explanation of power laws that is independent of system-specific details. Just such an explanation has been developed by Baek and colleagues1. Their random group formation (RGF) model is built on the only common feature among all the systems modelled: that M items are divided among N groups. Entropy is maximized when an item is equally likely to be found at any of M 'addresses' across the groups. Next, one derives a distribution of group sizes that minimizes the amount of information needed to locate an item knowing only the size of the group to which it belongs. This objective, in addition to the constraints of total number of groups and items, and the maximum group size, is sufficient to derive the RGF function, a power-law distribution of group sizes with an exponential cut-off. There are several remarkable aspects to this finding. The RGF function closely fits observed group size distributions without incorporating any knowledge of system-specific dynamics. In contrast to previous models, which would typically tune their parameters by fitting the empirically observed distribution, the RGF model requires no tuning. The power-law exponent in the RGF function is given directly once one specifies M, N and the maximum observed group size. Furthermore, the exponential cut-off observed in empirical data is an essential component of the RGF model, rather than a correction introduced to fit the data. The RGF model just as easily fits word-frequency distributions representing entire books as it fits random subsamples of the same texts, something that alternative models generally cannot do. Finally, the approach is flexible enough to incorporate system-specific constraints, as needed. The work of Baek and colleagues1 may be the first to provide a truly general explanation of the prevalence of power-law distributions in frequency counts. But it is not yet ready to replace other models entirely. For many, if not all, systems the intuition behind the assumption that one wishes to minimize the information cost of locating an item needs to be further developed. By contrast, growth models usually integrate intuition about a system's evolution. Furthermore, the power-law exponents produced by the RGF model in some cases differ from those estimated previously using maximum-likelihood fits to data2. Nevertheless, by deriving power-law distributions from very general system-independent principles, Baek et al. have raised the bar for other models. A model purporting to explain a power-law distribution should be as general as Baek and colleagues' model, or it should be able to reproduce additional features of the system it models, beyond the familiar straight line on a log–log plot. Download references Subscribe to comments
更多
查看译文
关键词
Physics
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要