Exploring the Use of Virtual Assistants in a Restaurant

This proposal introduces an experimental method to manage recommendation and explore the use of virtual assistants in the context of a restaurant, and got a distinction mark.

Introduction

“Hey Siri, find me a restaurant.” “OK, here is what I found,” said Siri, with a search engine results page presented. This conversation is a common scene for those who use virtual assistants (VA) in their daily lives. Defined as “an artificial agent that simulates a human assistant and prompts users to provide feedback by asking questions” (Ozeki, Maeda, Obata & Nakamura, 2009), VAs are expected to become the next generation of Human Computer Interaction (HCI). Vying to lead this new generation of HCI, numerous enterprises are developing virtual assistants, such as Google Now, Amazon’s Alexa, Apple’s Siri and Facebook M. These virtual assistants innovate HCI as dialogue-style interactions, and allow users to access numerous services, such as launching apps, making phone calls, reading messages, playing music and others (Lee & Choi, 2017; Rong, Fourney, Brewer, Morris & Bennett, 2017).

Amongst these functionalities, recommendation is an existing feature of a few virtual assistants. To date, there is a wide range of literature resources. Most of the studies focus on certain aspects of virtual assistants’ recommendations, such as self-disclosure and reciprocity (Lee & Choi, 2017). However, limited literature regards the virtual assistants’ recommendation as an independent factor and explores its effect on user performance (satisfaction, error, efforts) in different contexts. Neither did these researchers study user expectations when being recommended by virtual assistants. Therefore, in this proposal, both a quantitative study and a qualitative study are proposed for solving these two problems. The quantitative study looks at the effects of virtual assistants’ recommendations on user performance when users have different intentions. In other words, if the users have clear goals in mind. The qualitative study analyses the expectations that users have when selecting restaurants with virtual assistants. The quantitative study results can inspire interaction design by suggesting when it is a suitable circumstance for virtual assistants to recommend users
more options. The qualitative study results can help refine current design by considering the user expectations that have not been met. This proposal focuses on a scene where people use virtual assistants to select restaurants. This is to avoid the influence of confounding factors in other complicated scenes such as booking a flight or requesting an Uber ride. Accordingly, the proposal will be structured around answering the following two research questions:

1) Intention & Recommendation Structure: Whether there is an effect of the user intention and the recommendation structure on the user performance (satisfaction, effort, error) when selecting restaurants with virtual assistants.

2) Expectation: What are users’ expectations when selecting restaurants with virtual assistants in different intentions?

Given the two research questions above, the study will employ a mixture of methods. The first research question is studied by designing a mixed factorial experiment, which will be followed by a mixed design Analysis of Variance (ANOVA). Regarding user expectations, a semi-structured interview is designed before employing grounded theory to investigate the totality of this context. Afterwards, the merits and demerits of these studies will be discussed.

Literature Review

A number of related literature exists on the topic of virtual assistants. Crestani and Du (2006) claim that spoken queries on mobile searches are longer but more “natural”. This finding showed the potential of dialogue-style interactions. Furthermore, a wide range of research focuses on the use of virtual assistants, including the feature of recommendations. Smyth et al. (2004) studied how to rely on user preference and needs to automatically generate compound critiques that can be presented to the user, whereas Matejka et al. (2009) researched the algorithms and succeeded in increasing the number of good recommendations by 2.1 times.

However, little work involves user intention and the recommendation effects on user performance. An exception is Kang et al. (2017), who investigated how users deploy natural language to ask for recommendations. In this study, the taxonomy was employed and the user goals were divided into three levels: navigational, informational and transactional. The follow-up queries were applied and surprisingly, the finding showed that the user employed a ‘critiquing style’ when refining the initial request. Additionally, the users were more likely to query with deep features when speaking as opposed to texting. This indicated the imperfection of current systems and the potential of the conversational assistant. Similarly, Chai et al. (2002) compared the menu-driven system with the natural language system, and found that the number of clicks and average time were reduced by 63.2 percent and 33.3 percent respectively – the efficiency and promise of the natural language system were demonstrated in the research.

Regarding user intention, Lee and Choi (2017) proposed the research question as “How does self-disclosure and reciprocity between a user and a conversational assistant affect user satisfaction and intention to use?”. He found that self-disclosure and reciprocity were paramount factors that affect user satisfaction. Furthermore, the interactional enjoyment and the feeling of trust would make the recommendations more believable and persuasive. Although the above factors can lead to intention to use, the study lacks the research on the original intention of users. Therefore, in this proposal, the effects of user goals and recommendation structures will be discussed. The methods of follow-up queries used by Kang et al. (2017) can be applied in this proposal.

There are also several related works relevant to the evaluation of satisfaction of using virtual assistants. Kiseleva et al. (2016) divided the structured search dialogue into two groups that were single task search dialogue and multi-task search dialogue. Defined by Kiseleva et al. (2016), the multi-task search dialogue consists of multiple interactions with the intelligent assistant that lead towards one final goal. This proposal focuses on the particular context that selects restaurants, which is the multi-task search dialogue. Therefore, a set of analysing methods that are related to these types of interactions can be employed. Kiseleva et al. (2016) proposed an equation that evaluates the satisfaction of users:

Equation 1: SAT(T1,…,Tm) = h(I(T1,…,Tm)

This evaluation describes a function h that, given a set of interaction signals, would predict user satisfaction (SAT). T1, … , Tm means the tasks in the multi-task search dialogue, while I(T) describes the combination of a set of interaction signals in the tasks (Kiseleva et al., 2016). This method shows 71% to 81% over the baseline in Kiseleva’s (2016) study. Additionally, the follow-up queries with a five-point Likert scale were also employed in Kiseleva’s (2016) research. These two methods can also be used in this proposal due to accuracy and acceptance. Although Jiang et al. (2015) modelled the action patterns and assessed the SAT by proposing the equation:

Equation 2: P(v|s,u) = (c(s,u,v)+α·Pml(v|u)+β·Pml(v))/Σvi∈U c(s,u,vi)+α+β

This equation is too complicated and consists of numerous actions factors. Therefore, only Equation 1 will be used in this proposal.

Quantitative Study

Method

Participants

A total of 30 participants are projected to be recruited through the Internet, including via emails, social networks and so forth. The participants should be recruited only when they meet two conditions. To be specific, they should not only have a good command of English but also should be familiar with using digital devices. In other words, those who have never used smartphones or cannot speak English will not be recruited. After the experiment, a £10 gift card will be reimbursed as a reward.

Design

A factorial experiment will be employed. The primary independent variables are user goal and recommendation structure. User goal is a within-subject factor. It denotes whether the user has a clear goal in mind. For example, “Find a Japanese restaurant around me” will be a vague goal, in contrast to a clear goal like, “Find a Japanese restaurant within 0.5 miles with an average price below £30”. As for the recommendation structure, it is a between-subject factor that refers to three different types of recommendation design. In terms of dependent variables, the five-point Likert scale results, completion time, completion rate, the number of queries and clicks, dwell time and Automatic Speech Recognition (ASR) rate will be collected during the experiment. These variables measure the error rate, effort, efficiency and throughput.

Material

In order to prepare the particular task for this experiment, a virtual assistant prototype will be developed. The prototype should be able to use natural language to interact with users. There also should be a Graphical User Interface (GUI) that displays the list of results whenever users request. The prototype has three different modes explained by flowcharts (a), (b) and (c). The participants from three groups will experience these modes respectively. Flowchart (a) only provides a list of restaurants from which users can select, whereas flowchart (b) and (c) can recommend the user options. The difference between flowchart (b) and (c) is that flowchart (b) provides the recommendation when users are dissatisfied, while flowchart (c) recommends after users have selected one restaurant. In flowchart (b) and flowchart (c), the recommended item should be better than the item the user selects at some aspects (price, distance, etc.).

Figure 1. (a) User flow without recommendations, (b) User flow with recommendations after dissatisfaction, (c) User flow with recommendations after satisfaction

Procedure

There will be a two-minute introduction at the beginning of the experiment. This is to introduce the prototype’s basic functions and to inform participants how to use the multi-task search dialogue. After introducing the virtual assistant, a trial task around selecting restaurants will be conducted before the formal experiment. The trial task is aimed at allowing participants to be familiar with the virtual assistants. During the trial task, participants can choose from numerous example query questions that are presented. These questions range from generally asking for restaurants to asking for a particular distance or price. For example, “Find a Japanese restaurant around me” is a general question, but “Find a Japanese restaurant within 0.5 miles with an average price below £30” contains more detail.

Then the 30 participants will be divided randomly into three groups, each group consisting of 10 people. These three groups will conduct the above three user flows separately. Each participant will conduct the experiment twice with different goals. The first time the participants are assigned to the vague goal that is “Find a Japanese restaurant around me”, the following prompt will be presented.

Imagine that, at this moment, you would like to have dinner in a Japanese restaurant, and that you are about to use a virtual assistant to select one Japanese restaurant, but you do not have very clear criteria for selection. Please use natural language to interact with the virtual assistant, the virtual assistant may give you many choices. If so, select one that suits you most.

For the second time, they are assigned a clear goal that is “Find a Japanese restaurant within 0.5 miles with an average price below £30”, and the following prompt will be presented.

Imagine that, at this moment, you are about to use a virtual assistant to select one Japanese restaurant, but this time you have very clear criteria for selection, you would prefer one that is within 0.5 miles with an average price below £30. Please use natural language to interact with the virtual assistant, the virtual assistant may give you many choices, if so, select one that suits you most.

During the experiment, the users’ steps, completion time, number of queries and clicks, dwell time and ASR rate will be collected. After completing the task, all participants will be asked to use the five-point Likert scale to answer the following questions:

1) How satisﬁed are you with your experience in this task in general?

2) How satisﬁed are you with your experience in each subtask?

3) How satisﬁed are you with the final restaurant you selected?

4) Did you put in a lot of effort to complete the task

Regarding the participants in the group with the recommendation, they are required to answer the additional question:

5) How satisﬁed are you with the recommendation?

Data Analysis Plan

The quantitative study deals with the research question about the effects of user goal and recommendation structure on user performance. Because this study is a factorial design, a mixed design ANOVA will be employed to analyse the effects. Regarding the users’ intention, this independent variable has two levels – the vague goal and the clear goal separately. In terms of the effects of the recommendation of virtual assistants, three dependent factors should be paid attention: efforts, error rate and satisfaction.

Initially, the data that involves the severe failure of automatic speech recognition should be excluded. The trials in which users request three times but have not completed the goal should be excluded. This is to decrease the influence of the error on user satisfaction. Secondly, in order to measure the main factors, several measure methods will be utilised. As for the efforts, this factor can be measured by the steps, completion time. Also, the result of five-point Likert scale that corresponds with the question “Did you put in a lot of effort to complete the task?” can measure the efforts. The error can be measured by ASR rate and completion rate. The degree of satisfaction can be measured by the result of five-point Likert scale that involves querying satisfaction. Additionally, Equation 1 that was proposed by Julia Kiseleva (2016) can also measure satisfaction. Therefore, there will be two ways of evaluating satisfaction, the five-point Likert scale and Equation 1. The reason for using two methods is to improve accuracy. What the participants answered may not be consistent with what they really thought, thus the quantified data can reveal such consistency when comparing the two data sets.

Furthermore, a 2x3 repeated measures ANOVA on the participants’ satisfaction value will be run to investigate the effects of the user intention recommendation structure on user satisfaction. By means of comparing the P value with 0.05, the following question can be answered:

1) Whether there is a significant main effect of the user intention on user satisfaction when using virtual assistants in the context of restaurant selection

2) Whether there is a significant main effect of the recommendation structure on user satisfaction when using virtual assistants in the context of restaurant selection

3) Whether there is a significant interaction effect of user intention × recommendation structure on user satisfaction when using virtual assistants in the context of restaurant selection

As the effort the user puts in can be one of the attributes of user satisfaction, a similar 2x3 repeated measures ANOVA on the participants’ effort value can be run to investigate the effects of user intention recommendation structure on the user workload. Accordingly, the similar three questions can be answered. Regarding the error rate, the similar practice can be applied.

Additionally, as there are two sets of user satisfaction that are calculated by Equation 1 and 5-point Likert scale, the results of two methods of evaluating efforts and satisfaction should be compared. If they are different to a large extent, the procedure and analysis should require an in-depth investigation, in which way the accuracy can be enhanced.

Qualitative Study

Method

Participants

Ten participants will take part in the study, these participants will be the subset of participants from the quantitative study. Amongst them, five people will be the participants who have interacted with the system that recommended before selection, whereas the other five participants will be the people who have interacted with the system that recommended after selection.

Procedure

The participants will be asked to have an interview after the quantitative experiment. During the quantitative experiment, the researchers should take notes when observing the participants who are about to be interviewed. These notes are aimed at providing insight that allows researchers to ask relevant and valuable questions during the interview. The interview will be semi-structured, this is to ensure a certain degree of freedom and to focus on the research question. The questions can be adjusted based on what the researchers have observed. The research question of the qualitative study is “What are users’ expectations when selecting restaurants with virtual assistants in different intentions (clear goal or vague goal)?” Therefore, around the topic “expectation”, the interview will include the following eight main questions. The questions can also be changed according to the situation:

1) Did you complete the task?

2) How much effort did you put into the task?

3) How satisfied are you with your final selection?

4) Did you have any obstacles when doing the task?

4.1) If yes, what obstacles did you have during the task?

5) How satisfied are you with the recommendation?

5.1) If yes, do you have any expectations when using it?

5.2). If no, why are you dissatisfied with it, any suggestions?

6) Do you have expectations when you use the virtual assistant?

7) Is there any difference between the recommendation and your expectation?

8) Do you have any comments about the virtual assistants?

The interview will be audio recorded and transcribed later. The researchers are also required to take notes during the interview if the answer is regarded to be important. In terms of interviewees’ answers, if it is insightful, the researcher should ask for an explanation, especially the questions around attitude towards the recommendation. Some participants may select the restaurant that is consistent with the recommendation, some may not. Therefore, the interview questions should be adjusted to discover the deeper reason why they follow the recommendation or not. For example, the user may be dissatisfied with the recommendation because they feel offended or interrupted. The interview time will be limited to 60 minutes. These participants will be reimbursed with a £10 gift card at the end of the interview.

Data Analysis Plan

The grounded theory will be employed to analyse the data. A three-stage analysis will be employed. Open Coding is the first stage, when the audio will be transcribed. In the transcript, the important small chunks of text should be coded and the keywords and quotes should be marked. More specifically, the words about satisfaction, expectation, obstacles and suggestions should be paid more attention, especially the words that describe the gap between what the users expect and what the virtual assistant actually responds.

The following stage is the Axial Coding, when the correlation amongst those keywords is expected to be found. There will be some relationships between the satisfaction and the obstacles, as the obstacles can account for dissatisfaction. Similarly, the expectation can be related to satisfaction and obstacles, as the satisfaction and obstacles reveal the demerits and improvement of current virtual assistant design.

The final stage is the Selective Coding, where the most frequent keywords or categories should be paid attention to. This is because these frequent concepts have more weight more than other factors. By investigating these frequent factors, a link that can hold all categories together is projected to be found, in which way the common expectation that users have when asking virtual assistants for recommendations will be concluded.

Discussion

The results of both the quantitative study and the qualitative study will have implications for designing the interaction of virtual assistant. As for the quantitative study, if there is a significant main effect of user intention on user satisfaction, it can be concluded that the designers should apply different strategies to deal with different user goals. Additionally, the results can indicate the suitable recommendation structure in a different context. If there is a significant main effect of the recommendation structure on the user satisfaction, the designer should refer to the recommendation structure that has better user performance: higher satisfaction, lower error rate and effort. Furthermore, the interaction effect of user intention × recommendation structure suggests the suitable recommendation form for virtual assistants when the users have different goals.

One merit of the quantitative study is the factorial design. This design is more efficient than one-factor-at-a-time experiments, and can reveal the interaction effect. Additionally, the within-subject design in the user goal decreases the number of trials, whereas the between-subjects design in recommendation decreases the “carryover effects”, which means that “user participation in one condition may affect performance in other conditions” (Ludlow & Gutierrez, 1976). In terms of the demerits, although the accuracy can be improved by employing two evaluations methods (Equation 1 and five-point Likert scale), such practice takes plenty of time. Additionally, the requirement of the prototype is high, in order to allow users to experience three types of the recommendation structures and the development of the prototype will probably be time-consuming and money-consuming. Additionally, the failure of recognizing the audio can be the confounding variable, so the prototype should be highly capable of recognizing audio. This increases the workload of developing. The reason for not using existing virtual assistants such as Siri and Cortana is the inconvenience of their fixed structure. In other words, employing prototypes have much more flexibility. Although Klemmer et al. (2000) proposed the Wizard of Oz prototyping tool for speech user interface, this proposal suggests using an actual VA. The reason is that employing this prototype can have better ecological validity than using the Wizard of Oz. Therefore, it is necessary to employ the actual virtual assistant. Furthermore, as there are plenty of open-source virtual assistant APIs online, the difficulty of developing can be decreased.

Regarding the qualitative study, the current obstacles to the recommendations of virtual assistants are expected to be found. This can help the designer to decrease the obstacles and improve user satisfaction. Also, user expectation suggests the improvement of virtual assistant design. Based on the gap between what the users expect and what the virtual assistant actually responds, designers can fulfil the user expectations better. The strength of the qualitative study is simplicity. The semi-structured interview does not require a complicated prototype, so it is easy to conduct and can generate a large amount of detail. However, the flexibility of semi-structured interview may decrease reliability. The participants’ answers can be too open to extract keywords. Furthermore, as the study relies on the grounded theory to investigate the interview data, this method helps develop an understanding of user expectations in a context that is not pre-formed with existing theories (Engward, 2013). However, this method is often difficult to manage, because it requires many more insights and skills. It can be hard to link keywords together in theoretical models at the Selective Coding Stage. Finding a central category that holds everything together is not an easy task. Therefore, the difficulty in analysis and investigation is the weakness of this qualitative study.

Limitations

The proposed study has a number of limitations. Firstly, the study is limited by the number of participants. This is due to the considerations of the cost of time and compensation. Secondly, the proposal is aimed at researching the recommendation effects for virtual assistants in the context of restaurants and it is unclear whether the same effects work for virtual assistants in other particular contexts. In other words, the external validity may not be high. Such limitation leaves space for future work. Furthermore, as participants from different groups may have different criteria about satisfaction when filling in the five-point Likert scale, this can be a confounding factor when investigating the effects of recommendation structure on user satisfaction. Although this problem can be avoided if the recommendation structure is a within-subject factor, such practice will make the participant repeatedly conduct the experiment six times, which significantly increases the workload of a single participant. As the participant may lose patience or feel tired after a few rounds, the data about user satisfaction will be influenced. Therefore, the current practice probably is a practicable trade-off. On the other hand, part of the participants will be required to take part in both the quantitative study and qualitative study, the experimental time can be more than one and half hour. In this case, participants may feel bored and impatient, leading to the inaccuracy of the data that researcher collected.

References

Chai, J., Horvath, V., Nicolov, N., Stys, M., Kambhatla, N., Zadrozny, W., & Melville, P. (2002). Natural language assistant: A dialog system for online product recommendation. AI Magazine, 23(2), 63.

Crestani, F., & Du, H. (2006). Written versus spoken queries: A qualitative and quantitative comparative analysis. Journal of the Association for Information Science and Technology, 57(7), 881-890.

Engward, H. (2013). Understanding grounded theory. Nursing Standard, 28(7), 37-41.

Jiang, J., Hassan Awadallah, A., Jones, R., Ozertem, U., Zitouni, I., Gurunath Kulkarni, R., & Khan, O. Z. (2015, May). Automatic online evaluation of intelligent assistants. In Proceedings of the 24th International Conference on World Wide Web (pp. 506-516). International World Wide Web Conferences Steering Committee.

Kang, J., Condiff, K., Chang, S., Konstan, J. A., Terveen, L., & Harper, F. M. (2017, August). Understanding How People Use Natural Language to Ask for Recommendations. In Proceedings of the Eleventh ACM Conference on Recommender Systems (pp. 229-237). ACM.

Kiseleva, J., Williams, K., Hassan Awadallah, A., Crook, A. C., Zitouni, I., & Anastasakos, T. (2016, July). Predicting user satisfaction with intelligent assistants. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 45-54). ACM.

Kiseleva, J., Williams, K., Jiang, J., Hassan Awadallah, A., Crook, A. C., Zitouni, I., & Anastasakos, T. (2016, March). Understanding user satisfaction with intelligent assistants. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval (pp. 121-130). ACM.

Klemmer, S. R., Sinha, A. K., Chen, J., Landay, J. A., Aboobaker, N., & Wang, A. (2000, November). Suede: a Wizard of Oz prototyping tool for speech user interfaces. In Proceedings of the 13th annual ACM symposium on User interface software and technology (pp. 1-10). ACM.

Lee, S., & Choi, J. (2017). Enhancing user experience with conversational agent for movie recommendation: Effects of self-disclosure and reciprocity. International Journal of Human-Computer Studies, 103, 95-105.

Ludlow, A., & Gutierrez, R. (2014). Developmental psychology. Basingstoke: Palgrave Macmillan

Matejka, J., Li, W., Grossman, T., & Fitzmaurice, G. (2009, October). CommunityCommands: command recommendations for software applications. In Proceedings of the 22nd annual ACM symposium on User interface software and technology (pp. 193-202). ACM.

Ozeki, M., Maeda, S., Obata, K., & Nakamura, Y. (2009). Virtual assistant: enhancing content acquisition by eliciting information from humans. Multimedia Tools and Applications, 44(3), 433-448.

Rong, X., Fourney, A., Brewer, R. N., Morris, M. R., & Bennett, P. N. (2017, May). Managing Uncertainty in Time Expressions for Virtual Assistants. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (pp. 568-579). ACM.

Smyth, B., McGinty, L., Reilly, J., & McCarthy, K. (2004, September). Compound critiques for conversational recommender systems. In Web Intelligence, 2004. WI 2004. Proceedings. IEEE/WIC/ACM International Conference on (pp. 145-151). IEEE.