## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Learning in Natural Language

Learning in Natural Language, pp.898-904, (2000)

EI

Keywords

Abstract

Statistics-based classifiers in natural language are developed typically by assuming a generative model for the data, estimating its parameters from training data and then using Bayes rule to obtain a classifier. For many problems the assumptions made by the generative models are evidently wrong, leaving open the question of why these app...More

Code:

Data:

Introduction

- Generative probability models provide a principled way to the study of statistical classification in complex domains such as natural language.
- In the context of natural language most classifiers are derived from probabilistic language models which estimate the probability of a sentence s, say, using Bayes rule, and decompose this probability into a product of conditional probabilities according to the generative model assumptions.
- The generative models used to estimate these terms typically make Markov or other independence assumptions
- It is evident from looking at language data that these assumptions are often patently false and that there are significant global dependencies both within and across sentences.
- Classifiers built based on these false assumptions seem to behave quite robustly in many cases

Highlights

- Generative probability models provide a principled way to the study of statistical classification in complex domains such as natural language
- When using (Hidden) Markov Model (HMM) as a generative model for the problem of part-of-speech tagging, estimating the probability of a sequence of tags involves assuming that the part of speech tag ti of the word Wi is independent of other words in the sentence, given the preceding tag ti-1
- We show that a variety of models used for learning in Natural Language make their prediction using Linear Statistical Queries (LSQ) hypotheses
- We show how different models used in the literature can be cast as Linear Statistical Queries hypotheses by selecting the statistical queries appropriately and how this affects the robustness of the derived hypothesis
- Our goal is to show that an algorithm that is able to learn under these restrictions is guaranteed to produce a robust hypothesis
- In addition to providing better learning techniques, developing an understanding for when and why learning works in this context is a necessary step in studying the role of learning in higher-level natural language inferences

Conclusion

- In the last few years the authors have seen a surge of empirical work in natural language. A significant part of this work is done by using statistical machine learning techniques. Roth [1998] has investigated the relations among some of the commonly used methods and taken preliminary steps towards developing a better theoretical understanding for why and when different methods work.
- Roth [1998] has investigated the relations among some of the commonly used methods and taken preliminary steps towards developing a better theoretical understanding for why and when different methods work.
- In addition to providing better learning techniques, developing an understanding for when and why learning works in this context is a necessary step in studying the role of learning in higher-level natural language inferences

Reference

- [Anthony and Holden, 1993] M. Anthony and S. Holden. On the power of polynomial discriminators and radial basis function networks. In Proc. 6th Annu. Workshop on Cornput Learning Theory, pages 158-164. ACM Press, New York, NY, 1993.
- [Aslam and Decatur, 1995] J. A. Aslam and S. E. Decatur. Specification and simulation of statistical query algorithms for efficiency and noise tolerance. In Proc. 8th Annu. Conf. on Comput. Learning Theory, pages 437-446. ACM Press, New York, NY, 1995.
- [Darroch and Ratcliff, 1972] J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43(5): 1470-1480, 1972.
- [Decatur, 1993] S. E. Decatur. Statistical queries and faulty PAC oracles. In Proceedings of the Sixth Annual ACM Workshop on Computational Learning Theory, pages 262268. ACM Press, 1993.
- [Delcher et al., 1993] A. Delcher, S. Kasif, H. Goldberg, and W. Xsu. Application of probabilistic causal trees to analysis of protein secondary structure. In National Conference on Artificial Intelligence, pages 316-321, 1993.
- [Duda and Hart, 1973] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.
- [Gale et al, 1993] W. Gale, K. Church, and D. Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415-439, 1993.
- [Golding and Roth, 1999] A. R. Golding and D. Roth. A winnow based approach to context-sensitive spelling correction. Machine Learning, 1999. Special issue on Machine Learning and Natural Language;. Preliminary version appeared in ICML-96.
- [Golding, 1995] A. R. Golding. A Bayesian hybrid method for context-sensitive spelling correction. In Proceedings of the 3rd workshop on very large corpora, ACL-95, 1995.
- [Grove and Roth, 1998] A. Grove and D. Roth. Linear concepts and hidden variables: An empirical study. In Neural Information Processing Systems. MIT Press, 1998.
- [Haussler, 1992] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78150, September 1992.
- [Hoffgen and Simon, 1992] K. Hoffgen and H. Simon. Robust trainability of single neurons. In Proc. 5th Annu. Workshop on Comput. Learning Theory, pages 428-439, New York, New York, 1992. ACM Press.
- [Jaynes, 1982] E. T. Jaynes. On the rationale of maximumentropy methods. Proceedings of the IEEE, 70(9):939-952, September 1982.
- [Kearns et al., 1992] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efficient agnostic learning. In Proc. 5th Annu. Workshop on Comput. Learning Theory, pages 341352. ACM Press, New York, NY, 1992.
- [Kearns, 1993] M. Kearns. Efficient noise-tolerant learning from statistical queries. In Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, pages 392-401, 1993.
- [Kupiec, 1992] J. Kupiec. Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6:225-242, 1992.
- [Rabiner, 1989] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2): 257-285, 1989.
- [Ratnaparkhi et al, 1994] A. Ratnaparkhi, J. Reynar, and S. Roukos. A maximum entropy model for prepositional phrase attachment. In ARPA, Plainsboro, NJ, March 1994.
- [Ratnaparkhi, 1997] A. Ratnaparkhi. A linear observed time statistical parser based on maximum entropy models. In EMNLP-97, The Second Conference on Empirical Methods in Natural Language Processing, pages 1-10, 1997.
- [Roth and Zelenko, 1998] D. Roth and D. Zelenko. Part of speech tagging using a network of linear separators. In COLING-ACL 98, The 17th International Conference on Computational Linguistics, pages 1136-1142, 1998.
- [Roth, 1998] D. Roth. Learning to resolve natural language ambiguities: A unified approach. In Proc. National Conference on Artificial Intelligence, pages 806-813, 1998.
- [Schiitze, 1995] H. Schiitze. Distributional part-of-speech tagging. In Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics, 1995.
- [Valiant, 1984] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134-1142, November 1984.
- [Vapnik, 1982] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York, 1982.
- [Vapnik, 1995] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
- [Yamanishi, 1992] K. Yamanishi. A learning criterion for stochastic rules. Machine Learning, 1992.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn