nimCSO: A Nim package for Compositional Space Optimization
arxiv(2024)
摘要
nimCSO is a high-performance tool implementing several methods for selecting
components (data dimensions) in compositional datasets, which optimize the data
availability and density for applications such as machine learning. Making said
choice is a combinatorically hard problem for complex compositions existing in
highly dimensional spaces due to the interdependency of components being
present. Such spaces are encountered, for instance, in materials science, where
datasets on Compositionally Complex Materials (CCMs) often span 20-45 chemical
elements, 5-10 processing types, and several temperature regimes, for up to 60
total data dimensions.
At its core, nimCSO leverages the metaprogramming ability of the Nim language
(nim-lang.org) to optimize itself at the compile time, both in terms of speed
and memory handling, to the specific problem statement and dataset at hand
based on a human-readable configuration file. As demonstrated in this paper,
nimCSO reaches the physical limits of the hardware (L1 cache latency) and can
outperform an efficient native Python implementation over 400 times in terms of
speed and 50 times in terms of memory usage (not counting interpreter), while
also outperforming NumPy implementation 35 and 17 times, respectively, when
checking a candidate solution.
It is designed to be both (1) a user-ready tool, implementing two efficient
brute-force approaches (for handling up to 25 dimensions), a custom search
algorithm (for up to 40 dimensions), and a genetic algorithm (for any
dimensionality), and (2) a scaffold for building even more elaborate methods in
the future, including heuristics going beyond data availability. All
configuration is done with a simple human-readable YAML config file and plain
text data files, making it easy to modify the search method and its parameters
with no knowledge of programming and only basic command line skills.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要