Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

Michelle A. Lee
Michelle A. Lee
Krishnan Srinivasan
Krishnan Srinivasan
Parth Shah
Parth Shah

international conference on robotics and automation, 2019.

Cited by: 80|Bibtex|Views232|Links
EI
Keywords:
video predictionconvolutional neural networkmanipulation skillhaptic feedbackToyota Research InstituteMore(11+)
Weibo:
The primary goal of our experiments is to examine the effectiveness of the multimodal representations in contactrich manipulation tasks

Abstract:

Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. However, it is non-trivial to manually design a robot controller that combines modalities with very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inp...More

Code:

Data:

0
Introduction
  • Even in routine tasks such as inserting a car key into the ignition, humans effortlessly combine the senses of vision and touch to complete the task.
  • This policy is learned through self-supervision and generalizes over variations of the same contact-rich manipulation task in geometry, configurations, and clearances.
  • Using a self-supervised learning objective, this network is trained to predict optical flow, whether contact will be made in the control cycle, and concurrency of visual and haptic data.
Highlights
  • Even in routine tasks such as inserting a car key into the ignition, humans effortlessly combine the senses of vision and touch to complete the task
  • The resulting compact representation of the high-dimensional and heterogeneous data is the input to a policy for contact-rich manipulation tasks using deep reinforcement learning
  • The primary goal of our experiments is to examine the effectiveness of the multimodal representations in contactrich manipulation tasks
  • We first conduct an ablative study in simulation to investigate the contributions of individual sensory modalities to learning the multimodal representation and manipulation policy
  • Our transfer learning results indicate that the multimodal representations from visual and haptic feedback generalize well across variations of our contact-rich manipulation tasks
  • To enable efficient real robot training, we proposed a novel model to encode heterogeneous sensory inputs into a compact multimodal representation
Results
  • The resulting compact representation of the high-dimensional and heterogeneous data is the input to a policy for contact-rich manipulation tasks using deep reinforcement learning.
  • The authors' goal is to learn a policy on a robot for performing contact-rich manipulation tasks.
  • The authors design a set of predictive tasks that are suitable for learning visual and haptic representations for contact-rich manipulation tasks, where supervision can image encoder skip connections action encoder flow predictor action-conditional optical flow
  • Given the robot action and the compact representation of the current sensory data, the model has to predict (i) the optical flow generated by the action and (ii) whether the end-effector will make contact with the environment in the control cycle.
  • The authors' final goal is to equip a robot with a policy for performing contact-rich manipulation tasks that leverage multimodal feedback.
  • The authors formulate contact-rich manipulation as a model-free reinforcement learning problem to investigate its performance when relying on multimodal feedback and when acting under uncertainty in geometry, clearance and configuration.
  • The policy network is a 2-layer MLP that takes as input the 128-d multimodal representation and produces a 3D displacement ∆x of the robot end-effector.
  • The authors first conduct an ablative study in simulation to investigate the contributions of individual sensory modalities to learning the multimodal representation and manipulation policy.
  • The authors apply our full multimodal model to a real robot, and train reinforcement learning policies for the peg insertion tasks from the learned representations with high sample efficiency.
  • To investigate the importance of each modality for contact-rich manipulation tasks, the authors perform an ablative study in simulation, where the authors learn the multimodal representations with different combinations of modalities.
Conclusion
  • The authors make the task tractable on a real robot by training a shallow neural network controller while freezing the multimodal representation model that can generate action-conditional flows with low endpoint errors.
  • The authors' transfer learning results indicate that the multimodal representations from visual and haptic feedback generalize well across variations of the contact-rich manipulation tasks.
  • To enable efficient real robot training, the authors proposed a novel model to encode heterogeneous sensory inputs into a compact multimodal representation.
Summary
  • Even in routine tasks such as inserting a car key into the ignition, humans effortlessly combine the senses of vision and touch to complete the task.
  • This policy is learned through self-supervision and generalizes over variations of the same contact-rich manipulation task in geometry, configurations, and clearances.
  • Using a self-supervised learning objective, this network is trained to predict optical flow, whether contact will be made in the control cycle, and concurrency of visual and haptic data.
  • The resulting compact representation of the high-dimensional and heterogeneous data is the input to a policy for contact-rich manipulation tasks using deep reinforcement learning.
  • The authors' goal is to learn a policy on a robot for performing contact-rich manipulation tasks.
  • The authors design a set of predictive tasks that are suitable for learning visual and haptic representations for contact-rich manipulation tasks, where supervision can image encoder skip connections action encoder flow predictor action-conditional optical flow
  • Given the robot action and the compact representation of the current sensory data, the model has to predict (i) the optical flow generated by the action and (ii) whether the end-effector will make contact with the environment in the control cycle.
  • The authors' final goal is to equip a robot with a policy for performing contact-rich manipulation tasks that leverage multimodal feedback.
  • The authors formulate contact-rich manipulation as a model-free reinforcement learning problem to investigate its performance when relying on multimodal feedback and when acting under uncertainty in geometry, clearance and configuration.
  • The policy network is a 2-layer MLP that takes as input the 128-d multimodal representation and produces a 3D displacement ∆x of the robot end-effector.
  • The authors first conduct an ablative study in simulation to investigate the contributions of individual sensory modalities to learning the multimodal representation and manipulation policy.
  • The authors apply our full multimodal model to a real robot, and train reinforcement learning policies for the peg insertion tasks from the learned representations with high sample efficiency.
  • To investigate the importance of each modality for contact-rich manipulation tasks, the authors perform an ablative study in simulation, where the authors learn the multimodal representations with different combinations of modalities.
  • The authors make the task tractable on a real robot by training a shallow neural network controller while freezing the multimodal representation model that can generate action-conditional flows with low endpoint errors.
  • The authors' transfer learning results indicate that the multimodal representations from visual and haptic feedback generalize well across variations of the contact-rich manipulation tasks.
  • To enable efficient real robot training, the authors proposed a novel model to encode heterogeneous sensory inputs into a compact multimodal representation.
Related work
  • RELATED WORK AND BACKGROUND

    A. Contact-Rich Manipulation

    Contact-rich tasks, such as peg insertion, block packing, and edge following, have been studied for decades due to their relevance in manufacturing. Manipulation policies often rely entirely on haptic feedback and force control, and assume sufficiently accurate state estimation [56]. They typically generalize over certain task variations, for instance, peg-in-chamfered-hole insertion policies that work independently of peg diameter [55]. However, entirely new policies are required for new geometries. For chamferless holes, manually defining a small set of viable contact configurations has been successful [12] but cannot accommodate the vast range of real-world variations. [48] combines visual and haptic data for inserting two planar pegs with more complex cross sections, but assumes known peg geometry.
Funding
  • This work has been partially supported by JD.com American Technologies Corporation (“JD”) under the SAIL-JD AI Research Initiative and by the Toyota Research Institute ("TRI")
Reference
  • F. J. Abu-Dakka, B. Nemec, J. A. Jørgensen, T. R. Savarimuthu, N. Krüger, and A. Ude, “Adaptation of manipulation skills in physical contact with the environment to reference force profiles”, Autonomous Robots, vol. 39, no. 2, pp. 199–217, 2015.
    Google ScholarLocate open access versionFindings
  • P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning to poke by poking: Experiential learning of intuitive physics”, in Advances in Neural Information Processing Systems, 2016, pp. 5074–5082.
    Google ScholarLocate open access versionFindings
  • M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al., “Learning dexterous in-hand manipulation”, ArXiv preprint arXiv:1808.00177, 2018.
    Findings
  • M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction”, ArXiv preprint arXiv:1710.11252, 2017.
    Findings
  • Y. Bekiroglu, R. Detry, and D. Kragic, “Learning tactile characterizations of object- and pose-specific grasps”, in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2011, pp. 1554–1560.
    Google ScholarLocate open access versionFindings
  • Y. Bekiroglu, D. Song, L. Wang, and D. Kragic, “A probabilistic framework for task-oriented grasp stability assessment”, in 2013 IEEE International Conference on Robotics and Automation, 2013, pp. 3040–3047.
    Google ScholarLocate open access versionFindings
  • A Bicchi, M Bergamasco, P Dario, and A Fiorillo, “Integrated tactile sensing for gripper fingers”, in Int. Conf. on Robot Vision and Sensory Control, 1988.
    Google ScholarLocate open access versionFindings
  • R. Blake, K. V. Sobel, and T. W. James, “Neural synergy between kinetic vision and touch”, Psychological science, vol. 15, no. 6, pp. 397–402, 2004.
    Google ScholarLocate open access versionFindings
  • J. Bohg, K. Hausman, B. Sankaran, O. Brock, D. Kragic, S. Schaal, and G. Sukhatme, “Interactive perception: Leveraging action in perception and perception in action”, IEEE Transactions on Robotics, vol. 33, pp. 1273–1291, Dec. 2017.
    Google ScholarLocate open access versionFindings
  • K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, et al., “Using simulation and domain adaptation to improve efficiency of deep robotic grasping”, in 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2018, pp. 4243–4250.
    Google ScholarLocate open access versionFindings
  • T. de Bruin, J. Kober, K. Tuyls, and R. Babuška, “Integrating state representation learning into deep reinforcement learning”, IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1394–1401, 2018.
    Google ScholarLocate open access versionFindings
  • M. E. Caine, T. Lozano-Perez, and W. P. Seering, “Assembly strategies for chamferless parts”, in Proceedings, 1989 International Conference on Robotics and Automation, 1989, 472–477 vol.1.
    Google ScholarLocate open access versionFindings
  • R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik, E. H. Adelson, and S. Levine, “More than a feeling: Learning to grasp and regrasp using vision and touch”, IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3300–3307, 2018.
    Google ScholarLocate open access versionFindings
  • R. Calandra, A. Owens, M. Upadhyaya, W. Yuan, J. Lin, E. H. Adelson, and S. Levine, “The feeling of success: Does touch sensing help predict grasp outcomes?”, Conference on Robot Learning (CoRL), 2017.
    Google ScholarFindings
  • Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine, “Path integral guided policy search”, in ICRA, 2017.
    Google ScholarFindings
  • F. Conti, F. Barbagli, R. Balaniuk, M. Halg, C. Lu, D. Morris, L. Sentis, J. Warren, O. Khatib, and K. Salisbury, “The chai libraries”, in Proceedings of Eurohaptics 2003, Dublin, Ireland, 2003, pp. 496– 500.
    Google ScholarLocate open access versionFindings
  • F. Conti and O. Khatib, “A framework for real-time multi-contact multi-body dynamic simulation”, in Robotics Research, Springer, 2016, pp. 271–287.
    Google ScholarLocate open access versionFindings
  • G. M. Edelman, Neural darwinism: The theory of neuronal group selection. Basic books, 1987.
    Google ScholarFindings
  • C. Eppner, R. Deimel, J. Álvarez-Ruiz, M. Maertens, and O. Brock, “Exploitation of environmental constraints in human and robotic grasping”, Int. J. Rob. Res., vol. 34, no. 7, pp. 1021–1038, Jun. 2015.
    Google ScholarLocate open access versionFindings
  • N. Fazeli, S. Zapolsky, E. Drumwright, and A. Rodriguez, “Fundamental limitations in performance and interpretability of common planar rigid-body contact models”, ArXiv preprint arXiv:1710.04979, 2017.
    Findings
  • C. Finn and S. Levine, “Deep visual foresight for planning robot motion”, in Robotics and Automation (ICRA), 2017 IEEE International Conference on, IEEE, 2017, pp. 2786–2793.
    Google ScholarLocate open access versionFindings
  • P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks”, CoRR, vol. abs/1504.06852, 2015. arXiv: 1504.06852.
    Findings
  • J. Fu, S. Levine, and P. Abbeel, “One-shot learning of manipulation skills with online dynamics adaptation and neural network priors”, in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2016, pp. 4019–4026.
    Google ScholarLocate open access versionFindings
  • S. Ganguly and O. Khatib, “Experimental studies of contact space model for multi-surface collisions in articulated rigid-body systems”, in International Symposium on Experimental Robotics, Springer, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Gao, L. A. Hendricks, K. J. Kuchenbecker, and T. Darrell, “Deep learning for tactile understanding from visual and haptic data”, in Robotics and Automation (ICRA), 2016 IEEE International Conference on, IEEE, 2016, pp. 536–543.
    Google ScholarLocate open access versionFindings
  • C. Garcia Cifuentes, J. Issac, M. Wüthrich, S. Schaal, and J. Bohg, “Probabilistic articulated real-time tracking for robot manipulation”, IEEE Robotics and Automation Letters (RA-L), vol. 2, no. 2, pp. 577–584, Apr. 2017.
    Google ScholarLocate open access versionFindings
  • H. van Hoof, N. Chen, M. Karl, P. van der Smagt, and J. Peters, “Stable reinforcement learning with autoencoders for tactile and visual data”, in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, IEEE, 2016, pp. 3928–3934.
    Google ScholarLocate open access versionFindings
  • M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learning force control policies for compliant manipulation”, in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2011, pp. 4639–4644.
    Google ScholarLocate open access versionFindings
  • D. Kappler, P. Pastor, M. Kalakrishnan, M. Wuthrich, and S. Schaal, “Data-driven online decision making for autonomous manipulation”, in Proceedings of Robotics: Science and Systems, Rome, Italy, 2015.
    Google ScholarLocate open access versionFindings
  • O. Khatib, “Inertial Properties in Robotic Manipulation: An ObjectLevel Framework”, Int. J. Rob. Res., vol. 14, no. 1, pp. 19–36, 1995. arXiv: 9809069v1 [arXiv:gr-qc].
    Google ScholarLocate open access versionFindings
  • S. Lacey and K. Sathian, “Crossmodal and multisensory interactions between vision and touch”, in Scholarpedia of Touch, Springer, 2016, pp. 301–315.
    Google ScholarFindings
  • Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning”, Nature, vol. 521, no. 7553, p. 436, 2015.
    Google ScholarLocate open access versionFindings
  • T. Lesort, N. Díaz-Rodríguez, J.-F. Goudou, and D. Filliat., “State representation learning for control: An overview”, CoRR, vol. abs/1802.04181, 2018.
    Findings
  • S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies”, J. Mach. Learn. Res., vol. 17, no. 1, pp. 1334–1373, Jan. 2016.
    Google ScholarLocate open access versionFindings
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning”, ArXiv preprint arXiv:1509.02971, 2015.
    Findings
  • G.-H. Liu, A. Siravuru, S. Prabhakar, M. Veloso, and G. Kantor, “Learning end-to-end multimodal sensor policies for autonomous navigation”, ArXiv preprint arXiv:1705.10422, 2017.
    Findings
  • J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning”, in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 689–696.
    Google ScholarLocate open access versionFindings
  • J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Actionconditional video prediction using deep networks in atari games”, in Advances in neural information processing systems, 2015, pp. 2863– 2871.
    Google ScholarLocate open access versionFindings
  • A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio”, ArXiv preprint arXiv:1609.03499, 2016.
    Findings
  • A. Owens and A. A. Efros, “Audio-visual scene analysis with selfsupervised multisensory features”, ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Simto-real transfer of robotic control with dynamics randomization”, in 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2018, pp. 1–8.
    Google ScholarLocate open access versionFindings
  • B. Ponton, A. Herzog, S. Schaal, and L. Righetti, “A convex model of humanoid momentum dynamics for multi-contact motion generation”, 2016, pp. 842–849.
    Google ScholarFindings
  • M Posa, C Cantu, and R Tedrake, “A direct method for trajectory optimization of rigid bodies through contact”, The International Journal of Robotics Research, vol. 33, no. 7, pp. 1044–1044, Jun. 2014.
    Google ScholarLocate open access versionFindings
  • L. Righetti, M. Kalakrishnan, P. Pastor, J. Binney, J. Kelly, R. C. Voorhies, G. S. Sukhatme, and S. Schaal, “An autonomous manipulation system based on force control and optimization”, Autonomous Robots, vol. 36, no. 1, pp. 11–30, 2014.
    Google ScholarLocate open access versionFindings
  • J. M. Romano, K. Hsiao, G. Niemeyer, S. Chitta, and K. J. Kuchenbecker, “Human-inspired robotic grasp control with tactile sensing”, IEEE Transactions on Robotics, vol. 27, no. 6, pp. 1067–1079, 2011.
    Google ScholarLocate open access versionFindings
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization”, in International Conference on Machine Learning, 2015, pp. 1889–1897.
    Google ScholarLocate open access versionFindings
  • J. Sinapov, C. Schenck, and A. Stoytchev, “Learning relational object categories using behavioral exploration and multimodal perception”, in Robotics and Automation (ICRA), 2014 IEEE International Conference on, IEEE, 2014, pp. 5691–5698.
    Google ScholarLocate open access versionFindings
  • H. Song, Y. Kim, and J. Song, “Automated guidance of peg-inhole assembly tasks for complex-shaped parts”, in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2014, pp. 4517–4522.
    Google ScholarLocate open access versionFindings
  • N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines”, in Advances in neural information processing systems, 2012, pp. 2222–2230.
    Google ScholarFindings
  • Z. Su, K. Hausman, Y. Chebotar, A. Molchanov, G. E. Loeb, G. S. Sukhatme, and S. Schaal, “Force estimation and slip detection/classification for grip control using a biomimetic tactile sensor”, in 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), 2015, pp. 297–303.
    Google ScholarLocate open access versionFindings
  • J. Sung, J. K. Salisbury, and A. Saxena, “Learning to represent haptic feedback for partially-observable tasks”, in Robotics and Automation (ICRA), 2017 IEEE International Conference on, IEEE, 2017, pp. 2802–2809.
    Google ScholarLocate open access versionFindings
  • S. Tonneau, A. Del Prete, J. Pettré, C. Park, D. Manocha, and N. Mansard, “An efficient acyclic contact planner for multiped robots”, IEEE Transactions on Robotics, vol. 34, no. 3, pp. 586–601, 2018.
    Google ScholarLocate open access versionFindings
  • H. Van Hoof, T. Hermans, G. Neumann, and J. Peters, “Learning robot in-hand manipulation with tactile features”, in Humanoid Robots (Humanoids), 2015 IEEE-RAS 15th International Conference on, IEEE, 2015, pp. 121–127.
    Google ScholarLocate open access versionFindings
  • F. Veiga, H. Van Hoof, J. Peters, and T. Hermans, “Stabilizing novel objects by learning to predict tactile slip”, in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, IEEE, 2015, pp. 5065–5072.
    Google ScholarLocate open access versionFindings
  • D. E. Whitney, “Quasi-Static Assembly of Compliantly Supported Rigid Parts”, Journal of Dynamic Systems, Measurement, and Control, vol. 104, no. 1, pp. 65–77, 1982.
    Google ScholarLocate open access versionFindings
  • D. E. Whitney, “Historical perspective and state of the art in robot force control”, Int. J. Rob. Res., vol. 6, no. 1, pp. 3–14, Mar. 1987.
    Google ScholarLocate open access versionFindings
  • X. Yang, P. Ramesh, R. Chitta, S. Madhvanath, E. A. Bernal, and J. Luo, “Deep multimodal representation learning from temporal data”, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5066–5074.
    Google ScholarLocate open access versionFindings
  • Y. Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kramár, R. Hadsell, N. de Freitas, et al., “Reinforcement and imitation learning for diverse visuomotor skills”, ArXiv preprint arXiv:1802.09564, 2018.
    Findings
Your rating :
0

 

Best Paper
Best Paper of ICRA, 2019
Tags
Comments