Skip to main content
Image
F.A.I.

The Forum for Artificial Intelligence meets every other week (or so) to discuss scientific, philosophical, and cultural issues in artificial intelligence. Both technical research topics and broader inter-disciplinary aspects of AI are covered, and all are welcome to attend! Recordings will be made available online by the end of the day each Friday there is a talk.

If you would like to be added to the FAI mailing list, subscribe here. If you have any questions or comments, please send an email to Catherine Andersson.

Upcoming Talks

To Be Announced

 

Check back soon!
 

Past Talks 2023-2024

Friday, August 25, 2023, 11:00 AM
GDC 6.302 | Recording

Models of Human Preference for RLHF

Brad Knox [homepage]
Research Fellow, University of Texas at Austin

Abstract:
The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments, a type of reinforcement learning from human feedback (RLHF). These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. I will discuss why this assumption is flawed and will propose modeling human preferences instead as informed by each segment's regret, a measure of a segment's deviation from optimal decision-making. After covering theoretical and empirical comparisons of the two models of human preference, I will end with a novel framing of the common RLHF algorithm used to fine-turn large language models (LLMs) based upon the regret preference model, explaining and removing the undesirable assumption that multi-turn interaction—a sequential decision-making task—should be tuned with reward in a bandit environment.

About the speaker:
Brad is a research scientist at the University of Texas at Austin. His research has largely focused on the human side of reinforcement learning. He is currently concerned with how humans can specify reward functions that are aligned with their interests. Brad’s dissertation, “Learning from Human-Generated Reward”, comprised early pioneering work on human-in-the-loop reinforcement learning and won the 2012 best dissertation award for the UT Austin Department of Computer Science. His postdoctoral research at the MIT Media Lab focused on creating interactive characters through machine learning on puppetry-style demonstrations of interaction. Stepping away from research during 2015–2018, Brad founded and sold his startup Bots Alive, working in the toy robotics sector. In recent years, Brad co-led the Bosch Learning Agents Lab at UT Austin and was a Senior Research Scientist at Google. He has won multiple best paper awards and was named to IEEE Intelligent System’s AI’s 10 to Watch in 2013.

Friday, September 1, 2023, 11:00 AM
GDC 6.302 | Recording

Deployable Reinforcement Learning: Dealing with Challenges of Sparse Rewards in Robotics via Heavy-Tailed Policy Gradient

Amrit Singh Bedi [homepage]
Research Scientist, University of Maryland

Abstract:
Recent advancements in Artificial Intelligence (AI), such as AlphaZero and ChatGPT, have significantly impacted various fields. Reinforcement learning (RL) plays a crucial role in these achievements. However, deploying RL in real-world applications, including robotics, finance, and healthcare, presents challenges such as efficient exploration, scalability, domain adaptation, and safety. One key aspect common to all these challenges in RL is the design of effective reward functions, which are often assumed to be known but remain elusive in practice. In this talk, we will discuss our recent results in addressing these challenges, specifically focusing on sparse rewards in robotic applications. While designing sparse rewards may seem easier, it introduces significant exploration challenges that make traditional algorithms inefficient. To tackle this, we propose heavy-tailed policy gradient algorithms, which provide a promising solution. We derive precise sample complexity bounds for the proposed algorithms and demonstrate their effectiveness in both simulators and real robots.

About the speaker:
Amrit Singh Bedi is a Research Scientist at the University of Maryland (UMD), College Park, MD, USA. Prior to his time at UMD, he worked with Dr. Alec Koppel and Dr. Brian Sadler at the US Army Research Laboratory, where he gained valuable experience and insights into the real-world applications of his research. He earned his Ph.D. in Electrical Engineering from the Indian Institute of Technology (IIT) in Kanpur under the supervision of Prof. Ketan Rajawat, where he focused on distributed and online learning with stochastic gradient methods for his Ph.D. thesis.

Friday, September 8, 2023, 11:00 AM
GDC 6.302 | Recording

Learning Generalizable and Interpretable Embodied AI from Humans


Bolei Zhou [homepage]
Assistant Professor, University of California, Los Angeles

Abstract:
The recent progress of deep learning paves the way for embodied AI to go beyond visual recognition and become an active participant to interact with the environments. However, it remains challenging to ensure the AI’s generalizability to unseen situations and its alignment with human intents. I will talk about our works of incorporating humans in learning embodied agent for driving and locomotion. I will show that human interactions not only brings high sample efficiency and safety to the learning process but also facilitates the alignment of agent behaviors and human-AI shared control. Lastly I will briefly introduce our on-going effort for building an open-source driving simulator called MetaDriverse for generalizable embodied AI, by incorporating a massive number of real-world scenarios and learning to generate novel ones.

About the speaker:
Bolei Zhou is an Assistant Professor in the Computer Science Department at the University of California, Los Angeles (UCLA). He earned his Ph.D. from MIT in 2018. His recent research interest lies at the intersection of computer vision and machine autonomy, focusing on enabling interpretable human-AI interaction. He has developed many widely used neural network interpretation methods such as CAM and Network Dissection, as well as computer vision benchmarks Places and ADE20K. He has received MIT Tech Review's Innovators under 35 in Asia-Pacific Award and Intel’s Rising Star Faculty Award. More detail is at his webpage https://boleizhou.github.io/.

Friday, September 22, 2023, 11:00 AM
GDC 6.302 | Recording

Teach Large Neural Models to Self-Learn Symbolic Knowledge

Heng Ji [homepage]
Professor, University of Illinois at Urbana-Champaign

Abstract:
Recent large neural models have shown impressive performance on various data modalities, including natural language, vision, programming language and molecules. However, they still have surprising deficiency (near-random performance) in acquiring certain types of knowledge such as structured knowledge and action knowledge. In this talk I propose a two-way knowledge acquisition framework to make symbolic and neural learning approaches mutually enhance each other. In the first stage, we will elicit and acquire explicit symbolic knowledge from large neural models. In the second stage, we will leverage the acquired symbolic knowledge along with external knowledge to augment and enhance these large neural models.

I will present three recent case studies to demonstrate this framework:
(1) The first task is to induce event schemas (stereotypical structures of events and their connections) from large language models by incremental prompting and verification [Li et al., ACL2023], and apply the induced schemas to enhance event extraction and event prediction.
(2) In the second task, we noticed that current large video-language models rely on object recognition abilities as a shortcut for action understanding. We utilize a Knowledge Patcher network to elicit new action knowledge from the current models by designing specialized probing tasks and loss functions, and a Knowledge Fuser component to integrate the Patcher into frozen video-language models.
(3) In the third task, we use large-scale molecule-language models to discover molecule subgraph structures ("building blocks") which contribute to blood brain barrier permeability in the kinase inhibitor family, and propose several candidate kinase inhibitor variants with improved ability to pass the blood brain barrier to accelerate better drug discovery. Then we can encode such graph pattern knowledge using lightweight adapter modules, bottleneck feed-forward networks that are inserted into different locations of backbone large molecule-language models.

About the speaker:
Heng Ji is a professor at Computer Science Department, and an affiliated faculty member at Electrical and Computer Engineering Department and Coordinated Science Laboratory of University of Illinois Urbana-Champaign. She is an Amazon Scholar. She is the Founding Director of Amazon-Illinois Center on AI for Interactive Conversational Experiences (AICE). She received her B.A. and M. A. in Computational Linguistics from Tsinghua University, and her M.S. and Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Multimedia Multilingual Information Extraction, Knowledge-enhanced Large Language Models, Knowledge-driven Generation and Conversational AI. She was selected as a Young Scientist to attend the 6th World Laureates Association Forum, and selected to participate in DARPA AI Forward in 2023. She was selected as "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. She was named as part of Women Leaders of Conversational AI (Class of 2023) by Project Voice. The awards she received include "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, PACLIC2012 Best paper runner-up, "Best of ICDM2013" paper award, "Best of SDM2013" paper award, ACL2018 Best Demo paper nomination, ACL2020 Best Demo Paper Award, NAACL2021 Best Demo Paper Award, Google Research Award in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014 and Bosch Research Award in 2014-2018. She was invited by the Secretary of the U.S. Air Force and AFRL to join Air Force Data Analytics Expert Panel to inform the Air Force Strategy 2030, and invited to speak at the Federal Information Integrity R&D Interagency Working Group (IIRD IWG) briefing in 2023. She is the lead of many multi-institution projects and tasks, including the U.S. ARL projects on information fusion and knowledge networks construction, DARPA ECOLE MIRACLE team, DARPA KAIROS RESIN team and DARPA DEFT Tinker Bell team. She has coordinated the NIST TAC Knowledge Base Population task since 2010. She was the associate editor for IEEE/ACM Transaction on Audio, Speech, and Language Processing, and served as the Program Committee Co-Chair of many conferences including NAACL-HLT2018 and AACL-IJCNLP2022. She is elected as the North American Chapter of the Association for Computational Linguistics (NAACL) secretary 2020-2023. Her research has been widely supported by the U.S. government agencies (DARPA, NSF, DoE, ARL, IARPA, AFRL, DHS) and industry (Amazon, Google, Facebook, Bosch, IBM, Disney).

Friday, October 6, 2023, 11:00 AM
GDC 6.302 | Recording

High-Speed Off-Road Autonomy

Byron Boots [homepage]
Amazon Professor of Machine Learning, University of Washington

Abstract:
State-of-the-art self-driving technology takes advantage of the fact that a vehicle's interactions with the road are engineered to be simple, repeatable, and therefore predictable. By contrast, natural terrain lacks man-made structure and contains features rarely encountered in on-road driving including vegetation, uneven and low-friction surfaces, reduced visibility, obstacles, and rapidly changing terrain surface properties. When navigating at high-speed in these conditions, existing approaches to perception, planning, and control fail. In this talk I’ll present some of our recent work on high-speed off-road autonomous driving. I’ll discuss advances in perception, planning, and control for complex natural and degraded terrain, and demonstrate the resulting capabilities as we race full sized vehicles across multiple biomes including deserts, hills, and forests.

About the speaker:
Byron Boots is the Amazon Professor of Machine Learning in the Allen School of Computer Science and Engineering at the University of Washington where he directs the Robot Learning Laboratory. He is also CoFounder and CEO of Overland AI, a Seattle-based startup building off-road ground vehicle autonomy. Byron’s research is in machine learning, artificial intelligence, and robotics with a focus on developing theory and systems that tightly integrate perception, learning, and control. He has published over 150 technical papers and has been honored with several awards for his work including “Best Paper” awards at ICML, AISTATS, RSS, IJRR, and RAL, the DARPA Young Faculty Award, the NSF CAREER Award, and the Robotics: Science and Systems Early Career Award. Byron received his PhD from the Machine Learning Department at Carnegie Mellon University.

Friday, October 13, 2023, 11:00 AM
GDC 6.302 | Recording

Unlocking Lifelong Robot Learning with Modularity

Jorge Mendez-Mendez [homepage]
Postdoctoral Fellow, Massachusetts Institute of Technology

Abstract:
Embodied intelligence is the ultimate lifelong learning problem. If you had a robot in your home, you would likely ask it to do all sorts of varied chores, like setting the table for dinner, preparing lunch, and doing a load of laundry. The things you would ask it to do might also change over time, for example to use new appliances. You would want your robot to learn to do your chores and adapt to any changes quickly. In this talk, I will explain how we can leverage various forms of modularity that arise in robot systems to develop powerful lifelong learning mechanisms. My talk will then dive into two algorithms that exploit these notions. The first approach operates in a pure reinforcement learning setting using modular neural networks. In this context, I will also introduce a new benchmark domain designed to assess the compositional capabilities of reinforcement learning methods for robots. The second method operates in a novel, more structured framework for task and motion planning systems. I will close my talk by describing a vision for how we can construct the next generation of home assistant robots that leverage large-scale data to continually improve their own capabilities.

About the speaker:
Jorge Mendez-Mendez is a postdoctoral fellow at MIT CSAIL. He received his Ph.D. (2022) and M.S.E. (2018) from the GRASP Lab at the University of Pennsylvania, and his Bachelor's degree (2016) in Electronics Engineering from Universidad Simon Bolivar in Venezuela. His research focuses on creating versatile, intelligent, embodied agents that accumulate knowledge over their lifetimes, leveraging techniques from transfer and multitask learning, modularity and compositionality, reinforcement learning, and task and motion planning. His work has been recognized with an MIT-IBM Distinguished Postdoctoral Fellowship, a third place prize of the Two Sigma Ph.D. Diversity Fellowship, and a Best Paper Award in the Lifelong Machine Learning Workshop (ICML).

Friday, October 27, 2023, 11:00 AM
GDC 6.302 | Recording

Intelligence Augmentation: Effective Human-AI Interaction to Supercharge Scientific Research

Daniel Weld [homepage]
Chief Scientist & General Manager of Semantic Scholar, Allen Institute for AI | Professor Emeritus at the University of Washington

Abstract:
Recent advances in Artificial Intelligence are powering revolutionary interactive tools that will transform the capabilities of everyone, especially knowledge workers. But in order to create synergy, where humans’ augmented intelligence and creativity reaches its true potential, we need improved interaction methods. AI presents several challenges to existing UI paradigms, including nondeterminism, inexplicable behavior, and significant errors (such as LLM hallucinations). We discuss principles and pitfalls for effective human-AI interaction, grounding our discussion in Semantic Scholar - a free, open, AI-powered scientific discovery platform aimed at augmenting the intelligence of human researchers.

About the speaker:
Daniel S. Weld is Chief Scientist and General Manager of Semantic Scholar at the Allen Institute of Artificial Intelligence and Professor Emeritus at the University of Washington. After formative education at Phillips Academy, he received bachelor’s degrees in both Computer Science and Biochemistry at Yale University in 1982. He landed a Ph.D. from the MIT Artificial Intelligence Lab in 1988, received a Presidential Young Investigator’s award in 1989, an Office of Naval Research Young Investigator’s award in 1990; he is a Fellow of the Association for Artificial Intelligence (AAAI), the American Association for the Advancement of Science (AAAS), and the Association for Computing Machinery (ACM). Dan was a founding editor for the Journal of AI Research, was area editor for the Journal of the ACM and on the editorial board for the Artificial Intelligence journal. Weld is a Venture Partner at the Madrona Venture Group and has co-founded three companies, Netbot (sold to Excite), Adrelevance (sold to Media Metrix), and Nimble Technology (sold to Actuate).

Friday, November 3, 2023, 11:00 AM
GDC 6.302 | Recording

Replicating and Auditing Black-box Language Models

Tatsu Hashimoto [homepage]
Assistant Professor, Stanford University

Abstract:
Instruction-following language models have driven remarkable progress in a range of NLP tasks and have been rapidly adopted across the world. However, academic research into these models has lagged behind due to the lack of open, reproducible, and low-cost environments with which to develop and test instruction-following models. In this talk, I will discuss how new, emerging approaches that study an LLM's ability to emulate human annotators and API endpoints hold promise in improving and critiquing LLMs.

To improve instruction-following methods, recent work from our group such as AlpacaFarm shows how an LLM-based simulator can help test scientific hypotheses (e.g. is reinforcement learning helpful?) develop better instruction-following methods, and red-team LLMs in a more open and reproducible way. At the same time, there are major limits to LLMs’ ability to simulate annotators — such as in the opinions they reflect or the consistency of their responses — and we will discuss how these gaps raise important open problems in the trustworthiness of existing LLMs.

About the speaker:
Tatsunori Hashimoto is an Assistant Professor in the Computer Science Department at Stanford University. He is a member of the statistical machine learning and natural language processing groups at Stanford, and his research uses tools from statistics to make machine learning systems more robust and trustworthy — especially in complex systems such as large language models. He is a Kavli fellow, a Sony and Amazon research award winner, and his work has been recognized with best paper awards at ICML and CHI.

Friday, December 1, 2023, 11:00 AM
GDC 6.302 | Recording

From Sparse to Dense, and back to Sparse again?

Fuxin Li [homepage]
Associate Professor, Oregon State University

Abstract:
Computer vision architectures used to be built on a sparse sample of points in the 80s and 90s. In the 2000s, dense models started to become popular for visual recognition as heuristically defined sparse models do not cover all the important parts of an image. However, with deep learning and end-to-end training approaches, this does not have to continue and sparse models may still have significant advantages in saving unnecessary computation as well as being more flexible. In this talk, I will talk about the deep point cloud convolutional backbones that we have developed in the past few years, including the most recent work PointConvFormer that outperforms grid-based convolutional approaches. As applications of those point-based networks, I will talk about two recent work, including AutoFocusFormer, that uses point cloud backbones and decoders to work on 2D image recognition, with a novel adaptive downsampling module that enables the end-to-end learning of adaptive downsampling. This is very helpful for detecting tiny objects faraway in the scene which would have been decimated by conventional grid downsampling approaches. Finally, I will illustrate the use of point convolution backbones in generative models with a recent work in diverse point cloud completion.

About the speaker:
Fuxin Li is currently an associate professor in the School of Electrical Engineering and Computer Science at Oregon State University. He has held research positions in Apple Inc., University of Bonn and Georgia Institute of Technology. He had obtained a Ph.D. degree in the Institute of Automation, Chinese Academy of Sciences in 2009. He has won an NSF CAREER award, an Amazon Research Award, (co-)won the PASCAL VOC semantic segmentation challenges from 2009-2012, and led a team to the 4th place finish in the DAVIS Video Segmentation challenge 2017. He has published more than 70 papers in computer vision, machine learning and natural language processing. His main research interests are point cloud deep networks, human understanding of deep learning, video object segmentation, multi-target tracking and uncertainty estimation in deep learning.

Friday, February 2, 2024, 11:00 AM
GDC 6.302 | Recording

Biosignal-based Digital Biomarkers for Aging

Najim Dehak [homepage]
Associate Professor, Johns Hopkins University

Abstract:
Currently, there are more Americans aged 65 and older (over 49 million) than at any other time in history, according to the US Census Bureau. A significant increase in individuals with severe chronic conditions will have profound social and economic effects on society. Three aspects describe the human aging process: functional (motor system), cognitive, and behavior (social and psychological stressors). In this talk, we will describe several tools to detect, assess, and monitor the functional and cognitive decline of elderly adults. Those tools named biomarkers are based on multimodal biosignals such as speech, handwriting, and eye movement. In addition, we will describe our current work on emotion recognition from speech that can be used to assess social and psychological stressors.

About the speaker:
An expert in machine learning and speech processing/speaker identification, Prof. Najim Dehak is internationally known as the lead developer of I-vector, a factor analysis-based speaker recognition technique. His research focuses on speech processing and modeling, audio segmentation, speaker, language, and emotion recognition. One of his interests has been building robust emotion detection systems that can be useful in several areas, including call centers, mental health, and social applications. He is also currently interested in working on topics related to human aging. In this topic, Dr. Dehak and his team are developing non-invasive, artificial intelligence-based tools to detect, assess, and monitor the functional and cognitive decline of elderly adults. Dr. Dehak is an associate professor of electrical and computer engineering at the johns Hopkins University. Prior joining JHU, he was a research scientist in the Spoken Language Systems Group at the MIT Computer Science and Artificial Intelligence Laboratory.

Friday, March 22, 2024, 11:00 AM
GDC 4.304 | Recording

Co-hosted with UT Good Systems

Conceptualising Trust – and what it means for Artificial Intelligence

Joel Fischer [homepage]
University of Nottingham

Abstract:
The trustworthy development and use of AI is no longer optional - it is now in fact mandated by governments around the world, including in the US and the UK. But what does it mean to ‘trust AI’ - and how do we build systems that are worthy of our trust? In this talk I will go back to basics and unpack the sociological concept of trust. On the basis of an understanding of trust as a precondition for action I then present case studies of AI adoption. Our studies of contact-tracing apps and robotic disinfection during the pandemic in the UK, and the discourse on social media surrounding the launch of ChatGPT highlight the contextual role trust plays in the (non-) adoption of AI. I will close with some considerations on the design of trustworthy systems.

About the speaker:
Dr. Joel Fischer is Professor of Human-Computer Interaction (HCI) at the School of Computer Science, University of Nottingham, UK, and Research Director of the UKRI Trustworthy Autonomous Systems (TAS) Hub and Responsible AI UK (RAI UK) a new national programme on Responsible and Trustworthy AI in the UK. He is Visiting Professor at the University of Texas at Austin and co-leads the Strategic Partnership project between Good Systems at UT Austin and TAS Hub in the UK. His research in Human-AI Interaction takes a human-centred view to understand adoption and embedding of AI-infused technologies into everyday life and work. He has a particular interest in language- and speech-based interaction and is known for his work on the empirical study of conversational interfaces. The technologies he studies include collaborative robotics, mobile, IoT, and web-based applications, in a diverse range of applications and settings from digital contact-tracing to robotic telepresence, home life, disaster response, to Large Language Models (LLMs) in legal advice. He has published more than 120 articles in journals and conferences across AI and HCI including CHI, ToCHI, IJHCI, IJCAI, JAIR, AAMAS, IMWUT, CSCW, JMIR, AI and Ethics, CUI, RO-MAN and HRI.

Monday, April 8, 2024, 11:00 AM
GDC 4.304 | Recording

We are (still!) not giving Data enough credit!

Alyosha Efros [homepage]
University of California, Berkeley

Abstract:
For most of Computer Vision's existence, the focus has been solidly on algorithms and models, with data treated largely as an afterthought. Only recently did our discipline finally begin to appreciate the singularly crucial role played by data. In this talk, I will begin with some historical examples illustrating the importance of large visual data in both computer vision as well as human visual perception. I will then share some of our recent work demonstrating the power of very simple algorithms when used with the right data. Recent results in visual in-context learning and visual data attribution will be presented.

About the speaker:
Alexei (Alyosha) Efros joined UC Berkeley in 2013. Prior to that, he was for a decade on the faculty of Carnegie Mellon University, and has also been affiliated with École Normale Supérieure/INRIA and University of Oxford. His research is in the area of computer vision and computer graphics, especially at the intersection of the two. He is particularly interested in using data-driven techniques to tackle problems where large quantities of unlabeled visual data are readily available. Efros received his PhD in 2003 from UC Berkeley. He is a recipient of CVPR Best Paper Award (2006), Sloan Fellowship (2008), Guggenheim Fellowship (2008), Okawa Grant (2008), SIGGRAPH Significant New Researcher Award (2010), three PAMI Helmholtz Test-of-Time Prizes (1999,2003,2005), the ACM Prize in Computing (2016), Diane McEntyre Award for Excellence in Teaching Computer Science (2019), Jim and Donna Gray Award for Excellence in Undergraduate Teaching of Computer Science (2023), and PAMI Thomas S. Huang Memorial Prize (2023).

Tuesday, April 9, 2024, 2:00 PM
GDC 6.302 | Recording

What Do Pre-Trained Speech Representation Models Know?

Karen Livescu [homepage]
Toyota Technological Institute at Chicago

Abstract:
Pre-trained speech representation models have become ubiquitous in speech processing over the past few years. They have both improved the state of the art and made it feasible to learn task-specific models with very little labeled data. However, it is not well understood what linguistic information is encoded in pre-trained models, where in the models it is encoded, and how best to apply this information to downstream tasks. In this talk I will describe recent work that begins to build an understanding of pre-trained speech models, through both layer-wise analysis and benchmarking on tasks. We consider a number of popular pre-trained models and investigate the extent to which they encode spectral, phonetic, and word-level information. The results of these analyses also suggest some ways to improve or simplify the application of pre-trained models for downstream tasks. Finally, I will describe our efforts to benchmark model performance on a variety of spoken language understanding tasks, in order to broaden our understanding of the semantic capabilities of speech models.

About the speaker:
Karen Livescu is a Professor at TTI-Chicago. This year she is on sabbatical, splitting her time between the Stanford NLP group and the CMU Language Technologies Institute. She completed her PhD at MIT in 2005. She is an ISCA Fellow and a recent IEEE Distinguished Lecturer. She has served as a program chair/co-chair for ICLR, Interspeech, and ASRU, and is an Associate Editor for TACL and IEEE T-PAMI. Her group's work spans a variety of topics in spoken, written, and signed language processing, with a particular interest in representation learning, cross-modality learning, and low-resource settings.

Friday, April 26, 2024, 11:00 AM
GDC 6.302 | Recording

Towards Generalizable Motion-Level Intelligence with Foundation Models

Fei Xia [homepage]
Senior Research Scientist, Google DeepMind

Abstract:
This talk introduces a few novel approaches to motion-level embodied intelligence through integrating Vision Language Models with Robotics. We present a series of works that explore leveraging the capabilities of large vision-language models (VLMs) and language models (LLMs) for robotic control and spatial reasoning tasks. First, we investigate whether we can fundamentally improve VLMs' spatial reasoning capabilities by enriching the data. Second, we look into finding a more efficient interface between robotics control and VLM inference. We propose Prompting with Iterative Visual Optimization (PIVOT), a novel visual prompting approach for VLMs that enables zero-shot control of robotic systems, navigation in various environments, and other spatial reasoning capabilities without requiring task-specific fine-tuning. PIVOT casts tasks as iterative visual question answering, where the VLM selects the best proposals (e.g., robot actions, localizations, or trajectories) annotated on the image, which are then iteratively refined to converge on the optimal answer. Finally, when the VLMs don't work well out of the box, we look into improving their teachability. We investigate fine-tuning robot code-writing LLMs to enhance their teachability and adaptability to human inputs in long-term interactions. We introduce Language Model Predictive Control (LMPC), a framework that formulates human-robot interactions as a partially observable Markov decision process and trains the LLM to complete previous interactions, effectively learning a transition dynamics model. LMPC is combined with model predictive control to discover shorter paths to success, improving non-expert teaching success rates and reducing the average number of human corrections required on a wide range of robot tasks and embodiments. Together, these works highlight the potential and limitations of leveraging foundation models for robotic and spatial reasoning domains, demonstrating promising approaches for generalizable motion-level intelligence.

About the speaker:
Fei Xia is a Senior Research Scientist at Google DeepMind where he works on the Robotics team. He received his PhD degree from the Department of Electrical Engineering, Stanford University. He was co-advised by Silvio Savarese in SVL and Leonidas Guibas. His mission is to build intelligent embodied agents that can interact with complex and unstructured real-world environments. His research has been awarded a CoRL Special Innovation award and an ICRA Outstanding Robot Learning Paper Award, and featured on popular media outlets such as New York Times, Reuters, and WIRED. Most recently, He has been exploring using foundation models for spatial reasoning and low-level control in robotics.