CogLTX: Applying BERT to Long Texts

NeurIPS 2020, 2020.

Cited by: 0|Bibtex|Views26|Links
Keywords:
text classificationquestion answeringlong textopen domain question answeringmulti hop reading comprehension
Weibo:
We present CogLTX, a cognition inspired framework to apply BERT to long texts

Abstract:

BERT is incapable of processing long texts due to its quadratically increasing memory and time consumption. The most natural ways to address this problem, such as slicing the text by a sliding window or simplifying transformers, suffer from insufficient long-range attentions or need customized CUDA kernels. The maximum length limit in BER...More

Code:

Data:

0
Introduction
  • Pretrained language models, pioneered by BERT [12], have emerged as silver bullets for many NLP tasks, such as question answering [38] and text classification [22].
  • Researchers and engineers breezily build state-of-the-art applications following the standard finetuning paradigm, while might end up in disappointment to find some texts longer than the length limit of BERT
  • This situation may be rare for normalized benchmarks, for example SQuAD [38] and GLUE [47], but very common for more complex tasks [53] or real-world textual data.
  • The O(L2) space complexity implies a fast increase with the text length L
Highlights
  • Pretrained language models, pioneered by BERT [12], have emerged as silver bullets for many NLP tasks, such as question answering [38] and text classification [22]
  • The direct and superficial obstacle for long texts is that the pretrained max position embedding is usually 512 in BERT [12]
  • We assume there exists a short text z composed by some sentences from the long text x, satisfying reasoner(x+) ≈ reasoner(z+), (1)
  • We split each long text x into blocks [x0 ... xT −1] by dynamic programming, which restricts the block length to a maximum of B, in our implementation B = 63 if the BERT length limit L = 512
  • We present CogLTX, a cognition inspired framework to apply BERT to long texts
  • CogLTX defines a pipeline for long text understanding under the “key sentences” assumption
Methods
  • 3.1 The CogLTX methodology

    This basic assumption of CogLTX is that “for most NLP tasks, a few key sentences in the text store sufficient and necessary information to fulfill the task”.
  • The authors assume there exists a short text z composed by some sentences from the long text x, satisfying reasoner(x+) ≈ reasoner(z+), (1).
  • The authors split each long text x into blocks [x0 ...
  • XT −1] by dynamic programming, which restricts the block length to a maximum of B, in the implementation B = 63 if the BERT length limit L = 512.
  • The key short text z should be composed by some blocks in x, i.e. z = [xz0 ...
  • All blocks in z are automatically sorted to maintain the original relative ordering in x
Results
  • Table 1 show that CogLTX-base outperforms well-established QA models, for example BiDAF [41] (+17.8% F1), previous SOTA DECAPROP [43], which incorporates elaborate self-attention and RNN mechanisms (+4.8%F1), and even RoBERTa-large with sliding window (+4.8%F1).
  • The max-pooling results of RoBERTa-large sliding window are worse than CogLTX (7.3% Macro-F1)
  • The authors hypothesize this is due to the tendency to assign higher probabilities to very long texts in max-pooling, highlighting the efficacy of CogLTX
Conclusion
  • The authors present CogLTX, a cognition inspired framework to apply BERT to long texts.
  • CogLTX defines a pipeline for long text understanding under the “key sentences” assumption.
  • Hard sequence-level tasks might violate it, efficient variational bayes methods with affordable computation still worth investigations.
  • CogLTX has a drawback to miss antecedents right before the blocks, which is alleviated by prepending the entity name to each sentence in the HotpotQA experiments, and could be solved by position-aware retrieval competition or coreference resolution in the future
Summary
  • Introduction:

    Pretrained language models, pioneered by BERT [12], have emerged as silver bullets for many NLP tasks, such as question answering [38] and text classification [22].
  • Researchers and engineers breezily build state-of-the-art applications following the standard finetuning paradigm, while might end up in disappointment to find some texts longer than the length limit of BERT
  • This situation may be rare for normalized benchmarks, for example SQuAD [38] and GLUE [47], but very common for more complex tasks [53] or real-world textual data.
  • The O(L2) space complexity implies a fast increase with the text length L
  • Methods:

    3.1 The CogLTX methodology

    This basic assumption of CogLTX is that “for most NLP tasks, a few key sentences in the text store sufficient and necessary information to fulfill the task”.
  • The authors assume there exists a short text z composed by some sentences from the long text x, satisfying reasoner(x+) ≈ reasoner(z+), (1).
  • The authors split each long text x into blocks [x0 ...
  • XT −1] by dynamic programming, which restricts the block length to a maximum of B, in the implementation B = 63 if the BERT length limit L = 512.
  • The key short text z should be composed by some blocks in x, i.e. z = [xz0 ...
  • All blocks in z are automatically sorted to maintain the original relative ordering in x
  • Results:

    Table 1 show that CogLTX-base outperforms well-established QA models, for example BiDAF [41] (+17.8% F1), previous SOTA DECAPROP [43], which incorporates elaborate self-attention and RNN mechanisms (+4.8%F1), and even RoBERTa-large with sliding window (+4.8%F1).
  • The max-pooling results of RoBERTa-large sliding window are worse than CogLTX (7.3% Macro-F1)
  • The authors hypothesize this is due to the tendency to assign higher probabilities to very long texts in max-pooling, highlighting the efficacy of CogLTX
  • Conclusion:

    The authors present CogLTX, a cognition inspired framework to apply BERT to long texts.
  • CogLTX defines a pipeline for long text understanding under the “key sentences” assumption.
  • Hard sequence-level tasks might violate it, efficient variational bayes methods with affordable computation still worth investigations.
  • CogLTX has a drawback to miss antecedents right before the blocks, which is alleviated by prepending the entity name to each sentence in the HotpotQA experiments, and could be solved by position-aware retrieval competition or coreference resolution in the future
Tables
  • Table1: NewsQA results (%)
  • Table2: Results on HotpotQA distractor (dev). (+hyperlink) means usage of extra hyperlink data in Wikipedia. Models beginning with “−” are ablation studies without the corresponding design
  • Table3: Table 3
  • Table4: Alibaba result (%)
Download tables as Excel
Related work
  • As mentioned in Figure 1, the sliding window method suffers from the lack of long-distance attention. Previous works [49, 33] tried to aggregate results from each window by mean-pooling, max-pooling, or an additional MLP or LSTM over them; but these methods are still weak at long-distance interaction and need O(5122 · L/512) = O(512L) space, which in practice is still too large to train a BERT-large on a 2,500-token text on RTX 2080ti with batch size of 1. Besides, these late-aggregation methods mainly optimizes classification, while other tasks, e.g., span extraction, have L BERT outputs, need O(L2) space for self-attention aggregation.

    MemRecall (initial z+ = [Q], long text x = [x0 ... x40]) Z+ Q

    Concat respectively x0: Quality Cafe is the name of two different former locaŏ tions in Downtown Los Angeles, California. ...

    0.83 ŏ x8: "The Quality Cafe (aka. Quality Diner) is a now-defunct diner ...but has appeared as a location featured in a number of Hollywood films, including "Training Day", "Old School"

    ŏ 3 Retrieval competition

    Select highest scoring blocks 0.79 ŏ ŏ Q x0 x8 new z+

    Decay Forget other blocks

    Q x0 x2 x8 x13 x14 x25 x31
Funding
  • The work is supported by NSFC for Distinguished Young Scholar (61825602), NSFC (61836013), and a research fund supported by Alibaba
Study subjects and analysis
long-text datasets with different tasks: 4
0.01 (13) In MY day, we had to make do with 5 bytes of swap. We conducted experiments on four long-text datasets with different tasks. The token-wise (Figure 2 (c)) tasks are not included because they mostly barely need information from adjacent sentences, and are finally transformed into multiple sequence-level samples

long news articles: 12744
Given a question and a paragraph, the task is to predict the answer span in the paragraph. We evaluate the performance of CogLTX on NewsQA [44], which contains 119,633 human-generated questions posed on 12,744 long news articles.3. Since previous SOTA [43] is not BERT based (due to long texts) in NewsQA, to keep the similar scale of parameters for fair comparison, we finetune the base version of RoBERTa [26] for 4 epochs in CogLTX

documents: 18846
As one of the most general tasks in NLP, text classification is essential to analyze the topic, sentiment, intent, etc. We conduct experiments on the classic 20NewsGroups [22], which contains 18,846 documents from 20 classes. We finetune RoBERTa for 6 epochs in CogLTX

articles: 30000
Owing to the large capacity of BERT, we share the model for all the labels by prepending the label name at the beginning of the documents as input, i.e., [[CLS] label [SEP] doc], for binary classification. Alibaba is a dataset of 30,000 articles extracted from an industry scenario in a large e-commerce platform. Each article advertises for several items from 67 categories

label-article pairs: 20000
The detection of mentioned categories are perfectly modeled as multi-label classification. To accelerate the experiment, we respectively sampled 80,000 and 20,000 label-article pairs for training and testing. For this task, we finetune RoBERTa for 10 epochs in CogLTX

Reference
  • A. Asai, K. Hashimoto, H. Hajishirzi, R. Socher, and C. Xiong. Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • A. Baddeley. Working memory. Science, 255(5044):556–559, 1992.
    Google ScholarLocate open access versionFindings
  • P. Barrouillet, S. Bernardin, and V. Camos. Time constraints and resource sharing in adults’ working memory spans. Journal of Experimental Psychology: General, 133(1):83, 2004.
    Google ScholarLocate open access versionFindings
  • I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
    Findings
  • J. Brown. Some tests of the decay theory of immediate memory. Quarterly Journal of Experimental Psychology, 10(1):12–21, 1958.
    Google ScholarLocate open access versionFindings
  • D. Chen, A. Fisch, J. Weston, and A. Bordes. Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1870–1879, 2017.
    Google ScholarLocate open access versionFindings
  • M. Collins and T. Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1):25–70, 2005.
    Google ScholarLocate open access versionFindings
  • Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
    Findings
  • M. Daneman and P. A. Carpenter. Individual differences in working memory and reading. Journal of Memory and Language, 19(4):450, 1980.
    Google ScholarLocate open access versionFindings
  • R. Das, S. Dhuliawala, M. Zaheer, and A. McCallum. Multi-step retriever-reader interaction for scalable open-domain question answering. 2018.
    Google ScholarFindings
  • A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1–22, 1977.
    Google ScholarLocate open access versionFindings
  • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • M. Ding, C. Zhou, Q. Chen, H. Yang, and J. Tang. Cognitive graph for multi-hop reading comprehension at scale. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2694–2703, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Fang, S. Sun, Z. Gan, R. Pillai, S. Wang, and J. Liu. Hierarchical graph network for multi-hop question answering. arXiv preprint arXiv:1911.03631, 2019.
    Findings
  • A. Fisch, A. Talmor, R. Jia, M. Seo, E. Choi, and D. Chen. Mrqa 2019 shared task: Evaluating generalization in reading comprehension. In EMNLP 2019 MRQA Workshop, page 1, 2019.
    Google ScholarLocate open access versionFindings
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
    Findings
  • Y. Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, 2014.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • N. Kitaev, Ł. Kaiser, and A. Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
    Findings
  • S. Kundu and H. T. Ng. A question-focused multi-factor attention network for question answering. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, pages 331–339, 1995.
    Google ScholarLocate open access versionFindings
  • K. Lee, M.-W. Chang, and K. Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, 2019.
    Google ScholarLocate open access versionFindings
  • T. Lee and Y. Park. Unsupervised sentence embedding using document structure-based context. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 633–647.
    Google ScholarLocate open access versionFindings
  • X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and J. Tang. Self-supervised learning: Generative or contrastive. arXiv preprint arXiv:2006.08218, 2020.
    Findings
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review, 63(2):81, 1956.
    Google ScholarLocate open access versionFindings
  • S. Min, V. Zhong, R. Socher, and C. Xiong. Efficient and robust question answering from minimal context over documents. arXiv preprint arXiv:1805.08092, 2018.
    Findings
  • S. Min, V. Zhong, L. Zettlemoyer, and H. Hajishirzi. Multi-hop reading comprehension through question decomposition and rescoring. arXiv preprint arXiv:1906.02916, 2019.
    Findings
  • K. Nishida, K. Nishida, M. Nagata, A. Otsuka, I. Saito, H. Asano, and J. Tomita. Answering while summarizing: Multi-task learning for multi-hop qa with evidence extraction. arXiv preprint arXiv:1905.08511, 2019.
    Findings
  • K. Oberauer, H.-M. Süß, R. Schulze, O. Wilhelm, and W. W. Wittmann. Working memory capacity—facets of a cognitive ability construct. Personality and individual differences, 29(6): 1017–1045, 2000.
    Google ScholarLocate open access versionFindings
  • R. Pappagari, J. Villalba, and N. Dehak. Joint verification-identification in end-to-end multiscale cnn framework for topic identification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, and N. Dehak. Hierarchical transformers for long document classification. arXiv preprint arXiv:1910.10781, 2019.
    Findings
  • J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
    Google ScholarLocate open access versionFindings
  • J. Qiu, H. Ma, O. Levy, S. W.-t. Yih, S. Wang, and J. Tang. Blockwise self-attention for long document understanding. Findings of EMNLP’20, 2020.
    Google ScholarFindings
  • L. Qiu, Y. Xiao, Y. Qu, H. Zhou, L. Li, W. Zhang, and Y. Yu. Dynamically fused graph network for multi-hop reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6140–6150, 2019.
    Google ScholarLocate open access versionFindings
  • J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
    Findings
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016.
    Google ScholarLocate open access versionFindings
  • D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
    Google ScholarLocate open access versionFindings
  • G. Salton, A. Wong, and C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
    Google ScholarLocate open access versionFindings
  • M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention flow for machine comprehension. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • S. Sukhbaatar, É. Grave, P. Bojanowski, and A. Joulin. Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 331–335, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Tay, A. T. Luu, S. C. Hui, and J. Su. Densely connected attention propagation for reading comprehension. In Advances in Neural Information Processing Systems, pages 4906–4917, 2018.
    Google ScholarLocate open access versionFindings
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200, 2017.
    Google ScholarLocate open access versionFindings
  • M. Tu, K. Huang, G. Wang, J. Huang, X. He, and B. Zhou. Select, answer and explain: Interpretable multi-hop reading comprehension over multiple documents. arXiv preprint arXiv:1911.00484, 2019.
    Findings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • S. Wang and J. Jiang. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905, 2016.
    Findings
  • W. Wang, M. Yan, and C. Wu. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1705–1714, 2018.
    Google ScholarLocate open access versionFindings
  • Z. Wang, P. Ng, X. Ma, R. Nallapati, and B. Xiang. Multi-passage bert: A globally normalized bert model for open-domain question answering. arXiv preprint arXiv:1908.08167, 2019.
    Findings
  • D. Weissenborn, G. Wiese, and L. Seiffe. Making neural qa as simple as possible but not simpler. arXiv preprint arXiv:1703.04816, 2017.
    Findings
  • C. M. Wharton, K. J. Holyoak, P. E. Downing, T. E. Lange, T. D. Wickens, and E. R. Melz. Below the surface: Analogical similarity and retrieval competition in reminding. Cognitive Psychology, 26(1):64–101, 1994.
    Google ScholarLocate open access versionFindings
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018.
    Google ScholarLocate open access versionFindings
  • L. Yao, C. Mao, and Y. Luo. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7370–7377, 2019.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments