Authors:
(1) An Yan, UC San Diego, ayan@ucsd.edu;
(2) Zhengyuan Yang, Microsoft Corporation, zhengyang@microsoft.com with equal contributions;
(3) Wanrong Zhu, UC Santa Barbara, wanrongzhu@ucsb.edu;
(4) Kevin Lin, Microsoft Corporation, keli@microsoft.com;
(5) Linjie Li, Microsoft Corporation, lindsey.li@mocrosoft.com;
(6) Jianfeng Wang, Microsoft Corporation, jianfw@mocrosoft.com;
(7) Jianwei Yang, Microsoft Corporation, jianwei.yang@mocrosoft.com;
(8) Yiwu Zhong, University of Wisconsin-Madison, yzhong52@wisc.edu;
(9) Julian McAuley, UC San Diego, jmcauley@ucsd.edu;
(10) Jianfeng Gao, Microsoft Corporation, jfgao@mocrosoft.com;
(11) Zicheng Liu, Microsoft Corporation, zliu@mocrosoft.com;
(12) Lijuan Wang, Microsoft Corporation, lijuanw@mocrosoft.com.
Editor’s note: This is the part 1 of a paper evaluating the use of a generative AI to navigate smartphones. You can read the rest of the paper via the table of links below.
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91% accuracy rate in generating reasonable action descriptions and a 75% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at https: //github.com/zzxslp/MM-Navigator.
Building autonomous agents capable of interacting with computing devices and following human commands has been a long-standing topic in the machine learning community (Bolt, 1980; Lieberman et al., 1995). Since the advent of smartphones, there has been a practical demand for creating virtual assistants, like Siri, Cortana, and Google Assistant, which have the potential to significantly enhance user experience and assist individuals who are physically or situationally impaired. Ideally, these assistants would competently carry out everyday tasks based on natural language instructions, ranging from simple actions like setting a timer to more complex tasks such as locating the ideal hotel for a family vacation.
Recent studies have started to explore mobile device control and smartphone task automation following human instructions (Rawles et al., 2023; Wen et al., 2023; Zhan and Zhang, 2023; Wang et al., 2023). Representative approaches include describing screen images with text and processing converted text with large language models (LLMs) (Rawles et al., 2023; Wen et al., 2023), or training a vision-language model to generate actions in a supervised manner (Rawles et al., 2023; Zhan and Zhang, 2023). However, these supervised models, when trained on specific types of screens and instructions (Rawles et al., 2023), exhibit limited effectiveness in generalizing to realworld scenarios. On the other hand, the LLM-based approaches generalize better, but the intermediate step of converting screen images to text results in information loss and consequently hurts performance. Inspired by the efficacy and broad applicability of recent large multimodal models (LMMs), we explore utilizing an LMM, GPT-4V (OpenAI, 2023a,b,c; gpt, 2023; Yang et al., 2023c), for zeroshot smartphone GUI navigation, aiming to set a new strong baseline for this intriguing task.
We identify two primary challenges for GUI navigation with LMMs, namely intended action description and localized action execution. First, the model should understand the screen image and text instruction input, and reason over the query to determine the appropriate action to take, such as providing a natural language description “clicking the Amazon icon in the third row and fourth column.” Second, the model should convert such high-level understanding into a formatted action that can be easily executed based on rules, such as “{Action: Click, Location: (0.31, 0.57)}.” In our approach, we prompt GPT-4V with an image and text for action planning, and place set-of-mark tags (Yang et al., 2023b) to anchor the generated outputs. Specifically, we associate these marks with spatial locations with the help of segmentation or OCR models. To this end, our proposed GPT-4Vbased system, namely MM-Navigator, can generate executable actions conditioned on the screen image, the text instruction and its interaction history.
We benchmark MM-Navigator on two datasets. We start with an iOS GUI navigation dataset with screenshots and user instructions that we manually collected. This clean analytic dataset is designed to probe insights for the two challenges in GUI navigation: intended action description and localized action execution. Human evaluations are used to assess GPT-4V on these two tasks, with accuracy rates of 91% and 75%, respectively. Additionally, we assess the model on a random subset from the recently released Android navigation benchmark (Rawles et al., 2023). We follow the proposed evaluation protocol in the benchmark, together with extra human evaluations. The strong performance demonstrates that MM-Navigator is an effective GUI navigator for smartphones, significantly outperforming previous LLM-based approaches. We provide in-depth analyses of the representative success and failure cases. We find that the current state of GPT-4V may already be effective in aiding humans in various real-world GUI navigation scenarios, as evidenced by the multi-screen results in Figure 4. However, continued enhancements are still essential to further increase the system’s reliability, as revealed in our analyses.
Our contributions are summarized as follows
• We present MM-Navigator, an agent system built on GPT-4V for smartphone GUI navigation. MM-Navigator effectively incorporates action histories and set-of-mark tags to produce precise executable actions.
• We collect a new analytic dataset with diverse iOS screens and user instructions, which evaluates two main challenges in GUI navigation with LMMs: intended action description and localized action execution.
• We perform extensive evaluations, both automatic and human, on two datasets and provide detailed analyses. The impressive results demonstrate the effectiveness of MMNavigator for GUI navigation.
This paper is available on arxiv under CC BY 4.0 DEED license.