Complex systems approach to natural language

PHYSICS REPORTS-REVIEW SECTION OF PHYSICS LETTERS(2024)

引用 0|浏览2
暂无评分
摘要
The science of complexity aims to answer the question of what rules nature chooses when assembling the basic constituents of matter and energy into structures and dynamical patterns that cascade through the entire hierarchy of scales in the Universe. A related phenomenon - natural language - can successfully mirror such structures as reflected by its ability to encode and transmit information about them and among them. It is thus legitimate to expect that natural language carries the essence of complexity. And indeed, in the human's speaking and writing it is particularly true that more is different. Natural language thus deserves a central place in the related quantitative study within the science of complexity. With this in mind the present review summarizes the main methodological concepts used in this domain and documents their applicability and utility in identifying universal as well as system-specific features of natural language in its written representation in several major Western languages. In particular, three main complexity-related current research trends in quantitative linguistics are exhaustively covered. The first part ad-dresses the issue of word frequencies in texts and, in particular, demonstrates that taking punctuation into consideration largely restores scaling whose violation in the Zipf's law for the most frequent words is commonly modelled by the so-called Mandelbrot's correction. The second part introduces methods inspired by time series analysis, used in studying various kinds of long-range correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems: the presence of long-range correlations along with fractal or even multifractal structures. Moreover, it appears that the distances between consecutive punctuation marks quite universally across languages comply with the discrete variant of the Weibull distribution, often appearing in survival analysis. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of word-adjacency networks whose structure reflects the word co-occurrence in texts. Various parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied in semantic analysis to represent a hierarchy of words and associations between them based on their meaning. Structure of such networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation appears to have a significant impact not only on the language's information-carrying ability but also on its key statistical properties, hence it seems recommended to consider punctuation marks on a par with words.(c) 2023 Elsevier B.V. All rights reserved.
更多
查看译文
关键词
Natural language,Complexity,Power laws,Fractals,Complex networks,Punctuation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要