S-464 Automated Occupational Encoding to the Canadian National Occupation Classification using an Ensemble Classifier from TF-IDF and Doc2Vec Embeddings

Cesar Augusto Suarez Garcia,Anil Adisesh,Christopher J. O. Baker

Occupational and Environmental Medicine(2021)

引用 0|浏览5
暂无评分
摘要
Introduction Occupational encoding is a technique that allows job titles provided by study participants to be categorized according to their role in the labor force. Encoding has primarily been a slow error-prone manual process which is ripe for automation. Objectives Our goals was to design and test an automated coding prototype using machine learning techniques. Methods The prototype classification system ENENOC (the ENsemble Encoder for the National Occupational Classification) is comprised of series of steps involving data cleaning, exact match search, multi classifier ensembling, hierarchical classification, and multiple output selection. In the absence of exact matching between job title input and NOC category descriptions, the input data is embedded using the TF-IDF algorithm and Doc2Vec. The embeddings are fed into a hierarchical, ensemble classifier that uses classical machine learning techniques: Random Forests, Support Vector Machine and K-Nearest Neighbour. Ensemble encoding is achieved using a majority-voting system. The hierarchical two tier classification methodology first predicts the first digit of the NOC code followed while the second tier predicts the second third and fourth digit of the NOC code for the input data. The combined approach produces a single, 4-digit code as a top choice, as well as four alternate NOC codes, that serve as additional ranked choice based on the Doc2Vec model. Results The prototype was benchmarked on a manually annotated data set comprising of 64,000 records. It produced a top-1 Per-Digit Macro F1-Score of 0.65 and a top-5 Per-Digit Macro F1-Score of 0.76, both of which are highly within published accuracy ranges for manual coding (44% to 89% inter-annotator agreement). ENENOC coded 30,000 job titles in 3 hours. Conclusion The ENENOC prototype is a sophisticated ENsemble Encoder for the National Occupational Classification which has state of the art performance accuracy with significant speed improvements over manual coding.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要