What Impacts the Quality of the User Answers when Asked about the Current Context?

Ivano Bison University of Trentovia Sommarive, 9TrentoItaly Ivano [email protected] , Haonan Zhao University of Trentovia Sommarive, 9TrentoItaly Haonan [email protected] and Fausto Giunchiglia University of Trentovia Sommarive, 9TrentoItaly Fausto [email protected]

Abstract.

Sensor data provide an objective view of reality but fail to capture the subjective motivations behind an individual’s behavior. This latter information is crucial for learning about the various dimensions of the personal context, thus increasing predictability. The main limitation is the human input, which is often not of the quality that is needed. The work so far has focused on the usually high number of missing answers. The focus of this paper is on the number of mistakes made when answering questions. Three are the main contributions of this paper. First, we show that the user’s reaction time, i.e., the time before starting to respond, is the main cause of a low answer quality, where its effects are both direct and indirect, the latter relating to its impact on the completion time, i.e., the time taken to compile the response. Second, we identify the specific exogenous (e.g., the situational or temporal context) and endogenous (e.g., mood, personality traits) factors which have an influence on the reaction time, as well as on the completion time. Third, we show how reaction and completion time compose their effects on the answer quality. The paper concludes with a set of actionable recommendations.

Human-Machine interaction, Quality of user answers, Context, Respondent characteristics, Data quality, Ecological Momentary Assessment, EMA, Experience Sampling Method, ESM, Smartphones, Cognition, Working memory.

^†^†conference: Make sure to enter the correct conference title from your rights confirmation email; 2024; NY^†^†price: 00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Human-centered computing Ubiquitous and mobile computing design and evaluation methods

1. Introduction

Various studies have highlighted how predictable various aspects of human behavior are, see, for instance, the work on mobility (Brockmann et al., 2006; Gonzalez et al., 2008; Song et al., 2010; Cuttone et al., 2018; Alessandretti et al., 2020), social interactions (Eagle and Pentland, 2009; Eagle et al., 2009), or people preferences for their favorite places (Alessandretti et al., 2018) and friends (Miritello et al., 2013). Some of these studies show that contextual information is useful in applications such as health and physical activity monitoring (Rabbi et al., 2015; Intille, 2016; Yue et al., 2020, 2021a), mental health monitoring (Wang et al., 2018b, 2020; Yue et al., 2019) or elderly care (Lee and Dey, 2015; Berke et al., 2011; Wang et al., 2021; Yue et al., 2021b), and also for predicting the individuals’ behaviours and traits (Do and Gatica-Perez, 2012; Harari et al., 2016; Wang et al., 2018a; Peltonen et al., 2020). In this latter case, the challenge is how to compute a high-quality characterization of this type of information (Lane et al., 2010). In this line of thought, the work in (Bettini et al., 2010; Riboni and Bettini, 2011; Helaoui et al., 2013; Suchman, 1987; Holtzblatt and Beyer, 1997) and in (Radu et al., 2018; Vaizman et al., 2017, 2018; Bradley and Dunlop, 2005) are, respectively, examples of early and more recent work on the general topic of context recognition. Nevertheless, most such studies have concentrated on the use of (only) sensor data. This, in turn, has generated various kinds of errors, most noticeably data validity, namely the accuracy of the indicator of the phenomenon being measured, and data completeness, where the occurrence of missing data should be random rather than systematic.

The main limitation is that, while providing a good objective representation of reality, sensors are unaware of the subjective motivations behind a given individual’s activities. Starting from the consideration that human behaviour is based on the individual’s subjective perception of the current context, a few studies have put the role of context at the core of the analysis (Zhang et al., 2021). These studies use sensor data as additional, even if crucial, information. They show that when a user-provided subjective description of the current context is available, any target modality (e.g., where the person is, or what s/he does) becomes substantially more predictable if one also exploits information about the other modalities (e.g., time, user characteristics, social ties). This opens the possibility for the machine to collect information and learn about every aspect of the daily life of a person, with high-impact applications in all research areas focusing on the flow of individuals’ behaviour and thinking (Hormuth, 1986; Wilhelm et al., 2012) and the ecology of human development (Bronfenbrenner, 1977). Examples of applications are in Medical behaviour (Stone et al., 2007), Clinical Psychology, Social Sciences, Human-Computer Interaction and, lately, Human-in-the-loop Artificial Intelligence (AI) (Bontempelli et al., 2022). But, for this information to become available, there is a need of an active collaboration of the user with the machine, where this collaboration has a main limitation in that the human input is often not accurate, see, e.g., (Huang et al., 2016; Wang et al., 2014).

The goal of this paper is to provide an in-depth study of which factors influence the quality of the answers that users provide when asked in the wild. Here by quality, we mean a low number of errors in the answers themselves. We focus on two main types of factors, that we call exogenous and endogenous, which influence the overall behaviour of an individual and, in turn, the answer quality. Examples of exogenous factors are the physical and social situational context (e.g., where users are, what they are doing, who they are with), the temporal context (e.g., the day of the week), and the computing context, (e.g., network connectivity, communication bandwidth). Examples of endogenous factors are the user’s personality, cognitive and emotional states (e.g., mood, burden). The analysis provided follows the Experience Sampling Methodology (ESM), see, e.g., (Larson and Csikszentmihalyi, 2014; Stone et al., 2007; Zeni et al., 2021), where the reference dataset has been built via an interval-based data collection from 158 University students over a period of two weeks, including 58,340 answers with corresponding GPS positions (Bison et al., 2021) ¹¹1A clean version of this dataset, GDPR compliant and suitably anonymized, can be downloaded from the LivePeople catalog at the URL https://datascientiafoundation.github.io/LivePeople. The search keyword SU2 returns all the datasets associated to this data collection. LivePeople contains extensive information about the datasets if indexes, including: (a) A technical report describing the details of the data collection; (b) The dataset documentation and metadata and (c) the procedure to be followed in order to request the dataset. To be fully compliant with GDPR, a licence must be signed before downloading the dataset.. This is in line with both recent ESM studies (Van Berkel et al., 2017) where the decision of a duration of 2-4 weeks duration follows the recommendation by Stone et al. (Stone et al., 1991). This experiment was crucially based on the use of an EMA/ESM application named iLog (Zeni et al., 2014; Zhao et al., 2024), which allows to collect sensor data, typically but not only, from smartphones, and to ask questions about the user’s situational context (Giunchiglia et al., 2017; Giunchiglia, 1993), in the form of time diaries, i.e., sets of questions asked multiple times, in various programmable moments of the day (Bowers, 1939) ²²2The version of time surveys used in this paper is based on HETUS (Harmonised European Time Use Surveys), see the details at the URL https://ec.europa.eu/eurostat/web/microdata/harmonised-european-time-use-surveys..

The analysis described in this paper provides the following insights:

(1)

The elapsed time from notification to answer, i.e., the reaction time, relates to the situational and temporal context, as well as to mood and procrastination. The behaviour also changes during the observation period where, over time, users seem to find their own routine inn the usage of iLog.
(2)

The time to fill in a question, i.e., the completion time, relates to delays in the reaction time, the cognitive ability of the respondent and also possible disturbance or multitasking effects due to the social context.
(3)

The quality of an answer is influenced by both the reaction and the completion time. However the main impact is from the reaction time which, as from the previous item, has also an effect on the completion time.
(4)

The number of wrong answers is substantial. In the test case we used in the experiment there were 730 answering errors out of a total number of 7624 questions, for an overall percentage of 9.6% of wrong answers. This percentage looks even more relevant if we take into account the specific question we tested, that is, whether the user was at home, a type of question that would seem impossible to get wrong.
(5)

Two main lessons about how to improve the quality of answers in future experiments where the user is asked to answer questions about the current context. The first is that reaction time is the main variable to keep in control. The second is that the most effective to control reaction time is by operating on the various types of context, as detailed below, being these more controllable than endogenous factors, e.g., procrastination, personality or mood.

The paper is organized as follows. In Section 2 we provide the related work. In Section 3 we describe the overall methodology. In Section 4 we describe the analysis on reaction time, completion time and answer quality, respectively. In Section 5 we discuss and highlight the main take-away lessons which result from this work, including a set of recommendations for future ESM studies (Section 5.3). Finally, Section 6 concludes the paper.

2. Related Work

Mobile self-reports are a popular technique for collecting data from participants in the wild. Because of its potential ability to record participants’ behaviour, this type of data allows for the development of a ground-truth, where the meaning of the data itself is provided directly by the user. One of the main problems is the impossibility of capturing the real causes of mistakes, mainly because of the impossibility of observing the behaviour of the respondent in-the-wild (e.g., which causes? which conditions?) (Wenz et al., 2019). A further complications is that researchers have no or little control on the surrounding environment. The user cannot be ”supervised” by a human interviewer who can decide when and where s/he should answer the query. Despite all of this, the assessment of the accuracy of responses has received little attention in the literature. Researchers tend to wrongly assume that responses are accurate, without further validating this assumption (van Berkel et al., 2019).

The problem of low-quality responses has been extensively studied in the EMA/ESM research (Shiffman et al., 2008; Larson and Csikszentmihalyi, 2014), where the results have then been incorporated in more complex and sophisticated research protocols (Larson and Csikszentmihalyi, 2014; Csikszentmihalyi, 1992; Hektner et al., 2007). However, the literature has mainly focused on increasing the response rate of participants. For example, Boukhechba et al. (Boukhechba et al., 2018) found a higher response when ESM questions are sent after phone calls, with respect to social media usage. Similarly, Berkel et al. (van Berkel et al., 2020) found that active phone use before an ESM question could increase the response rate. Looking at the context, some work has focused on how location (Sun et al., 2021), activities (Mishra et al., 2017), different times during the day (Pielot et al., 2017), and social interactions (Boukhechba et al., 2018) impact the response rate. Berkel et al. (van Berkel et al., 2019) concentrated on the problem of correctness and found that the highest accuracy values are obtained when the participant’ screen is turned off at notification arrival (that is, when the phone is not being used). They also show that longer completion times of questions generate answers with lower accuracy; the main limitation of this work is that the only contextual factor considered is the smartphone usage.

In Artificial Intelligence, related work has been done on understanding the users’ reactions to notifications. Here the focus is on the reaction time and on how to predict the probability of a user checking a notification on time. Iqbal et al. (Iqbal and Horvitz, 2010) show that users more easily accept disruptions from messages carrying useful information. Fischer et al. (Fischer et al., 2010) found that content is more important than notification time. The participants are willing to be interrupted by what they are interested in. Moreover, the same authors (Fischer et al., 2011) have investigated the dependence of the response time on the previous types of interaction (e.g., a phone call or an SMS). Ho and Intille (Oh et al., 2015; Ho and Intille, 2005) developed an application based on the idea that a transition between two physical activities (e.g., sitting and standing) is a suitable moment for a notification (i.e., phone calls, messages and reminders). In another study, Mehrotra et al. (Mehrotra et al., 2015) suggested using context information and application usage in order to predict the best notification time. Aminikhanghahi et al. (Aminikhanghahi et al., 2019) worked on combining the ESM technology with knowledge of the user activities.

The Total Survey Error paradigm, as elaborated in Sociology, provides a theoretical framework for optimizing surveys by maximizing data quality within budgetary constraints (Sen et al., 2019; Amaya et al., 2020). According to this framework there are four main sources of error which are independent among one another while correlating with the phenomena under study. These factors can be summarised as follows:

(1)

The situational and temporal context in which the user inputs information into the smartphone (Lavrakas et al., 2007; Force, 2010; Wenz et al., 2019).
(2)

The cognitive task involved in the response process (Weisberg, 2009), in time-related questions in the multi-component approaches (Lynn and Kaminska, 2013) as well as in respondent motivation two-track theories (Cannell et al., 1981; Krosnick, 1991; Krosnick et al., 1987).
(3)

Those conscious or unconscious factors, e.g., personality, attitudes and habits, that influence the user’s behaviour (Lynn and Kaminska, 2013; Read, 2019).
(4)

The technical problems related to the functioning of the technology, e.g., phone and phone app (Grammenos et al., 2018; Schilit et al., 1994; Fielding et al., 2016; Sarmadi et al., 2023; Balto et al., 2016; Struminskaya et al., 2020).

These four sources are assumed to generate errors with any question-answering process and, therefore, also with mobility self-reports. This hypothesis is the main theoretical foundation on which the work presented in this paper is based. As far as we know, this hypothesis has never being applied to mobility self-reports, in particular in the scenario of asking questions about context. In this perspective, the work in this paper can be seen as validating and detailing the specific factors of how the hypotheses of the Total Survey Error paradigm get instantiated in this scenario.

3. Methodology

We organize this section as follows. In Section 3.1 we articulate the motivations behind the selection of the model and, consequently, the three research questions Q1, Q2 and Q3. In Section 3.2 we describe the sample selection process and the scheduling of the time diaries. Finally, in Section 3.3, we synthetically describe the three statistical models used for the analysis of Q1, Q2, and Q3, respectively.

3.1. The overall model

The key issue is to identify the key factors which influence the quality of an answer, where these factors must be operationalizable as part of a strategy which allows a machine to get the best possible results out of the user answers. The starting point is the work from the Total Survey Error paradigm (see Section 2). Based on this work, we assume the existence of a causal chain of events which influences the quality of answers and that we asbtractly model according to the schema sketched in Fig. 1.

Refer to caption — Figure 1. The causal chain of events impacting the answer quality.

Proceeding from left to right we distinguish between endogenous and exogenous factors, that play a role and have an impact on the quality of answers. These two sources of error can be detailed as follows.

•

exogenous factors, mainly related to the context of use, that we organize along three dimensions, that is: the physical and social situational context, the temporal context and the computing context. The first two dimensions relate to factors such as the degree and the type of distractions, the presence of others, or the multitasking behaviour – whether on the same device, on a different device, or even on a different medium (Lavrakas et al., 2007, 2010; Lynn and Kaminska, 2013; Link et al., 2014; Read, 2019; Wenz et al., 2019). The third dimension, i.e., the computing context, relates to the technical problems connected to the functioning of iLog, taking into account of all the possible things which may go wrong (Heron et al., 2017).
•

endogenous factors, that we organize along two main dimensions. The first is the user characteristics and behaviour, for instance, the familiarity or comfort with the device, how the respondent uses the device (i.e., frequencies of use, duration and frequency of browsing session) (Link et al., 2014; Groves et al., 2011; Lai et al., 2009; Raento et al., 2009; Lynn and Kaminska, 2013; Read, 2019; Jäckle et al., 2017). The second is the willingness and ability to follow and complete a task on a device (Groves et al., 2011; Lynn and Kaminska, 2013; Oulasvirta et al., 2005). Relevant to this factor is a series of psycho-physical attitudes, personality, behaviour and habits, conscious or unconscious, that influence the individual (Lynn and Kaminska, 2013; Read, 2019). These cognitive performances differ across individuals (Cowan, 2012), where human cognition is also time-variant. Notice that the person mental state (Beilock and DeCaro, 2007; Cowan, 2010), as well as the some exogenous factors, may influence cognitive performance. Examples of these exogenous factors are the time of the day (Schmidt et al., 2007; West et al., 2002), and smartphone usage (Hyman Jr et al., 2010; Kushlev et al., 2016).

We organize our analysis around three research questions, Q1, Q2, Q3, which allow to model the chain of effects of how exogenous and endogenous factors influence the quality of answers. We have the following.

•

Reaction time (Q1). The goal of Q1 is to understand which endogenous as well as exogenous factors have an impact on the reaction time, where the reaction time is defined as the time between receiving a notification and initiating a response.
•

Completion time (Q2). The goal of Q2 is to understand which endogenous as well as exogenous factors have an impact on the completion time, where the completion time is defined as the time it takes to complete a response, starting from the end of the reaction time.
•

Answer quality (Q3). The goal of Q3 is to assess and validate the causal chain, as represented in Fig.1, by providing a quantitative evaluation of how reaction and completion times jointly affect the answer quality.

Let us focus on Q1, Q2 and Q3.

3.1.1. Q1: The reaction time

The optimal user behaviour in data collection would be that the user responds as soon as s/he receives the notification. The reason is that, the longer the elapsed time, the higher is the risk of a memory error (mainly related to forgetting) and, consequently, a higher probability of a wrong answer (McCabe et al., 2012).

We focus our attention on three factors which may have an impact on the reaction time, as follows:

•

context history (Chen and Kotz, 2000) (an exogenous factor), namely a mix of user context (the social situation), physical context, computing context (network connectivity, etc.) and time context. The social and the physical context are taken into account with three multiple choices self-reported time diary information: (a) ”What are you doing?” captures the distracting effects of the activities the student is doing at the time; (b) ”Where are you?” captures the environmental context and the consequent distracting effects due to where s/he is at that moment; (c) ”With whom are you?” captures the disturbing effects of the social context. The details of these three questions and of the possible answers is reported in Fig.2. The question ”How are you moving?” is asked any time the user answer that s/he is travelling. The computing context is considered as the time elapsed, in seconds, from when the notification is sent from the server to when the smartphone receives it. Finally, the time context accounts for how different weekly activities can affect the reaction time. In the model, this latter idea has been operationalized as a distinction between weekdays and weekends.
•

motivation and burden (an endogenous factor) concerning both the effort (time and resources) and the degree of difficulty. In the model, this factor is modeled as a quadratic function of the day of study, quantified in a range from 0 (first day in the experiment) to 13 (fourteenth day in the experiment).
•

user characteristics (an endogenous factor), modeled in terms of psychological traits and emotional status over the day. In fact, mood states and personality traits play a crucial role in the processing of emotion-congruent information across different cognitive tasks (Rusting, 1998). This factor was taken into account by asking, the user about his/her procrastination syndrome and emotional mood state. The first question was done using the Irrational Procrastination Scale (IPS) (Steel, 2007, 2010). The second question was organized as a 5-point Likert scale, with options ranging from happy (0) to sad (4) (LORR, 1989; Killgore, 1999), see Fig.2.

3.1.2. Q2: The completion time

As from above, a delayed answer may be the cause of a memory error. This applies to the reaction time but also to the completion time. In particular, the cognitive process enabling the response processing may generate a long completion time and, consequently, have an impact on the response accuracy (McCabe et al., 2012). Furthermore, according to the multi-component approaches (Tourangeau et al., 2000), the respondent’s ability to focus on something specific while ignoring other stimuli (Kellogg, 2015) has an impact on the quality of an answer; also in this case, the longer the completion time, the lower the expected answer quality. This seems to suggest that shorter completion times generate a higher answer quality, as they decrease the risk of memory errors (Sudman and Bradburn, 1973). However, this is not necessarily the case. First of all, with a fast completion time the probability of inaccuracy or ty** errors is high (Callegaro et al., 2015; McCabe et al., 2012). Furthermore, as from (Tourangeau et al., 1984, 2000), the answer completion process is organized in four steps as follows: (1) comprehension of the question; (2) retrieval of relevant information from memory; (3) judgment required by the question; and (4) selection of an answer. An increase in the average compilation time may mean more time spent retrieving relevant information from memory, thus having a positive effect on the quality of an answer. While the negative effect of an increase of the reaction time seems to be established, the same cannot be said in the case of the completion time.

In the following we consider the following four factors as the possible sources of changes in the completion time.

•

the competence in the usage of iLog. This competence grows in time, and the result is a rapid reduction of the completion time which is quite rapid even in the early days. We model this learning process using two components. The first is the date/time of use, measured as the day of study. The second is the memory of the response alternatives as they physically appear in the list of answers. The more a response modality is used, the more likely a respondent will remember its exact location and the faster the user will answer. We model the list of activities in the time diary as a pseudo-continuous variable where each activity is associated with its percentage of appearance. Thus, for instance, the activity “studying” is replaced with its occurrence percentage 19.04%, “eating” with 4.61%, and so on (Chen and Wang, 2014).
•

the reaction time. One or more notifications may remain unanswered, so as to form blocks of notifications that the user can quickly fill after one another in a single response session. We model this factor as the total number of pending notifications which exists when the respondent starts the completion process.
•

the social context. This factor models the disturbance or multitasking effects due to the presence of other people during the completion process (Kellogg, 2015). We model this factor by taking into account whether the respondent is alone at the time he has to respond (see Fig.2).
•

the psycho-social aspects. According to the literature, both mood and procrastination syndrome have an important role on memory and motivation (Forgas, 2013; Steel, 2007) and, in turn, on the completion process.

3.1.3. Q3: The chain of causal effects on the quality of the data

To test the chain of events on the error, we use the position detected by the GPS of the smartphone when the user claims to be “at home”. As from Fig.2 , “at home” is one of the possible answers to the question ”Where are you?”, while the GPS is collected ever minute, as from the list of collected sensor data reported in Fig. 3 (here the GPS is labeled as “location”). We have selected the variable “at home” for three reasons. The first is because we know the GPS position of the respondent’s home . The second is because home usually covers a very small area, differently from what is the case with other locations, for instance “at the university”. The third is that that this is the type of question which would seem impossible to get wrong; it requires no thinking or reasoning of any kind, and it is also self-evident from the perceptual and habit point of view. Being at home is most likely the best known situation for everybody.

The variables used here are the same four used for (Q2) (see Section 3.1.2 above), that is, competence with iLog, reaction time, social context, mood and procrastination, plus the accuracy of the GPS (considered as an additional effect of the computing context). The resulting number of wrong answers turned out to be substantial thus justifying the research hypothesis motivating this research, that is, that it cannot be assumed that, modulo a minor number of local mistakes, all the user-provided answers are correct (see Section 2 on the Related Work). We had infact 730 answering errors out of a total number of 7624 questions, for an overall percentage of 9.6% of wrong answers.

3.2. Sample Recruitment and questionnaires scheduling

The data was collected as part of the Smart UNITN 2 project, as preliminarily approved by the Ethical Committee and GDPR Committee of the University of Trento, Italy. The project lasted for a total of four weeks (28 days) from the beginning of May to the beginning of June. The data collection was organized in five phases, as follows.

Phase 1: Process bootstrap. A first short questionnaire was sent to a random sub-sample of 10006 students asking them if they regularly attended classes and if they had an Android smartphone. The sample was selected from the entire student population of University of Trento.

Phase 2: Sample set-up and profiling. For those who filled out the questionnaire and responded positively to both previous requests, the next step was an invitation to participate in the survey. The invitation explained the aim of the study and that students could choose to participate in the study for two or four weeks, and that they would receive a notification every half-hour in the first two weeks, and every two hours in the second two weeks. Students were also informed of the fact that various types of sensor data would be collected (see Fig.3). As stated in literature (Keusch et al., 2019; Aminikhanghahi et al., 2019), the willingness to participate in mobile data collection is strongly influenced by the incentive promised for study participation. A reward of 20 euros was promised to each participant for each of the two weeks of participation. In addition, each participant was informed that, at the end of the survey, there would be a lottery among those who responded to more than 75% of the notifications; and that the lottery would assign three prizes of 100 euros for the first two weeks and three prizes of 150 euros for the second two weeks. The invitation also included a second questionnaire asking for additional information. The collected data was about the general characteristics of participants, their university experience, and the profiling of their procrastination syndrome. In this phase also the personal email plus the signed GDPR-compliant consent were collected.

Phase 3: Sample finalisation. The hiring process resulted into 1042 applicants. The response rate is perfectly aligned with other web surveys in which no reminders are sent. From these, those over the age of 25 were excluded in order to limit students with non-regular careers or who were close to the dissertation. A stratified sample of 318 students, proportional to the student population of each department of University of Trento, was drawn from the remaining 860 applicants. A second questionnaire was sent to the 318 sampled students, whose goal was to investigate their university life, their habits and their routines. All students who filled out the questionnaire and signed a second informed consent form were sent a password which allowed them to install iLog. 275 students completed the questionnaire and installed iLog.

Phase 4: Time diary data collection - first two weeks. As declared in Phase 2, students received a notification, with the questions as from Fig.2, every half an hour. For each notification, participants had 720 minutes (12 hours) to provide an answer. After this period the question would be dropped and treated as a missing. Lastly, to reduce the burden, the user could stop data collection for 6 hours when going to sleep. Of the 275 students, only 237 answered the notifications at least once. In the first days 25 students dropped out, partly because of technical problems, partly because they were no longer interested in participating. At the end of the first two weeks, 184 students had provided valid responses to notifications for at least 13 out of 14 days.

Phase 5: Time diary data collection - second two weeks.. Two days before the end of the first two weeks, a third questionnaire was sent out to ask about their iLog user experience. In this questionnaire, students were also asked whether they were willing to continue for the following two weeks. Of the 237 students, 202 declared their willingness to continue. In this last two weeks, as declared in Phase 2, students received a notification every hour. Of these students, 113 completed the task for at least 12 days and provided more than 100 valid answers.

3.3. The Statistical Models

In the analysis of Q1 and Q2 we have used a multilevel discrete-time event history model (MLM) (Steele, 2008; Tekle and Vermunt, 2012) and a Cox regression model (Allison, 2018; Steele, 2008; Cox, 1972). An MLM is a generalized linear model where repeated notifications over time (level 1) are nested within users (level 2) (Goldstein, 2011; Singer et al., 2003) by modeling variability across upper-level units using random effects. Furthermore, MLM’s allow to estimate the parameter attributes at multiple distinct levels. The most common example of MLM is that of a school, in which the competence of a student in a class is due in part of the student (first level, e.g., his/her ability) and in part is due to the class (second level, e.g., the attitude of the teachers of the class).

For Q3, we have used both a multilevel structural equation path analysis model (MSEM) (Holland, 1986; Hox et al., 2013; Joreskog et al., 1979) and a standard SEM path analysis that integrates the evidence from the first two analyses into a single chain of causal events. In this case, reaction and completion time are influenced by the user features, while the response error is modeled as a random variable. For this reason, an MSEM has been chosen where reaction and completion time are formalized as a two-level model (responses nested to respondents) while the distance between the device and the user statement of being “at home” is treated at the response level.

4. Analysis

In the analysis reported below, we use the data from the 158 respondents who have filled out the notifications for all 14 days of the first two weeks, that have provided at least 300 valid answers, and that have agreed to share their GPS location. We have used the data only from the first two weeks based on the guidelines from (Stone and Shiffman, 2002). The main motivation is that, because of the different types of data collection, given the analysis performed in this article, the datasets from the first and the second two weeks are not comparable. In the first two weeks, a total of 58340 observations were collected,where we excluded sleep-related events, and GPS values where radius accuracy was greater than 100 meters. The analysis for Q3 is based on a sub-sample of 78 respondents and 7507 events. This reduction in number is motivated by the lack of information about the precise location of the relatives’ home; this has forced us to exclude all respondents who, during the observation window, moved from the university to the parents’ home and vice versa. (Both home and the relatives’ home are possible answers in Fig. 2.)

In the analysis of Q1, the dependent variable modeling the reaction time is the time, measured in minutes, elapsed between the server sending the notification to the device and the moment the user starts answering. To limit the effects of memory errors, all reaction times exceeding 8 hours are not considered. Similarly, technical problems, e.g., noise in an answer, due to transmission errors or when the reception time exceeds 150 minutes are also excluded. In the analysis of Q2, the dependent variable modeling the completion time is based on the time taken to complete the four questions in Fig.2. In the analysis we do not consider filling times of more than 75 seconds. In the analysis of Q3, the reaction time and completion time are those computed by the models for Q1 and Q2 respectively. No cases are dropped, and the available cases are analyzed using a linear regression model. Sleep events, compilation times of more than 75 seconds and notifications of more than 8 hours of age are excluded from the analysis. Furthermore, to reduce the uncertainty, we have considered only GPS positions with an accuracy of fewer than 100 meters. This has also allowed us to exclude other errors, for instance, confusing the parents’ house with the student house, where we have considered only distances of less than 4000 meters. Given the above assumptions, the total number of notifications we have analyzed is 7507.

Finally, the students’ activities and locations, as from Fig. 2, have been reclassified as follows:

•

Activities: (1) Personal care; (2) Eating; (3) Study alone or with others; (4) Classroom lecture; (5) Social life & Break: Social life, Coffee break; (6) Watching YouTube/Tv-shows, etc.; (7) social media/Phone/Chat: Facebook, Instagram, etc., on the phone/chat; (8) Free time: Reading a book; listen to music; Movie, Theater, Concert, Exhibit etc.; Shop**; Sport; Rest/nap; Hobbies; (9) Work: Housework, Work, Other activities; (10) Travel: By car, By foot, By bike, By bus, By train, By motorbike;
•

Place: (1) Home, Apartment, Room; (2) Relatives (house); (3) House (friends others); (4) University: Classroom/Lab., Classroom/Study Hall, Library; Other university place, Canteen; (5) Shop/Pub/Theatre: Shop supermarket etc., Pizzeria, pub, bar, restaurant, Movie Theater, Museum, etc.; (6) Workplace; (7) Other place: Other Library; Gym; Other place; (8) Outdoors; (9) Moving.

The further data considered in the analysis are

•

the time context, namely the specific moment and time when a question or an answer occurred, as collected by APPX.
•

the computing context was automatically collected using the smartphone internal hardware (e.g., GPS, accelerometer, gyroscope) as well as the data collected by the so-called software sensors, e.g., the applications running on the device, see Fig. 3.

Based on the assumptions described above, Section 4.1 reports the analysis for Q1, Section 4.2 reports the analysis for Q2, while Section 4.3 the final and conclusive analysis for Q3, which builds on top of the results of the first two subsections.

Features	Cox-regression	Multilevel
Features	Coef B	Coef B
(Spatial context): Where are you?
Home/Room	Reference	Reference
Relatives Home	-0.061***	-0.100***
House(friends/others)	-0.021	-0.029
University	0.188***	0.277***
Shop/Pub/Theatre	0.028	0.013
Work place	-0.234***	-0.331***
Other place	-0.071***	-0.128***
Outdoors	-0.218***	-0.324***
(Event context): What are you doing?
Personal care	Reference	Reference
Eating	-0.141***	-0.193***
Study alone or with others	-0.174***	-0.277***
Class room lecture	-0.130***	-0.175***
Social life/Break	-0.137***	-0.167***
Watching YouTube TV, etc.	-0.080***	-0.126***
Social media, Phone call, chat	0.214***	0.311***
Free time	-0.298***	-0.406***
Work	-0.195***	-0.271***
Travel	0.007	0.043
(Social Context): Who are you with?
Alone	Reference	Reference
Friend(s)	-0.188***	-0.292***
Relative(s)	-0.046***	-0.079***
Classmate(s)	-0.222***	-0.329***
Roommate(s)	0.025	0.018
Colleague(s)	-0.238***	-0.396***
Partner	-0.291***	-0.432***
Other	-0.575***	-0.834***
(Time Context): When questions sent
Sunday	Reference	Reference
Monday	0.089***	0.122***
Tuesday	0.115***	0.154***
Wednesday	0.068***	0.076***
Thursday	0.131***	0.172***
Friday	0.032**	0.049**
Saturday	-0.022	-0.032
Study day	-0.076***	-0.123***
Study day²	0.003***	0.005***
Question delivery delay time (sec.)	0.0002***	0.0003***
User Characteristics
Mood	0.060***	0.092***
Procrastination	-0.010**	-0.021**
var(cons[user])		0.643***
var(cons[user>notid])		0.604***
Observations	58340	2,565,159
Number of groups	158	158

Table 1. Cox Regression model and Multilevel discrete time model with random intercept on the reaction time. (Note: (*) p<0.1; (**) p<0.05; (***) p<0.01).

4.1. Q1: Reaction time

The results are reported in Tab.1. The median survival time of the reaction time is 20 minutes, i.e., fifty per cent of all notifications receive a response in 20 minutes or less. As assumed (see Section 3.1), the reaction time is influenced by both the historical context and the characteristics of the user. For example, the median reaction time of responses varies from a minimum of 10 minutes when the user is involved in social media/phone/chat activities to a maximum of 27 minutes when involved in free time activities (Log-rank test. $\chi^{2}(9)=1046.69$ , p<0.05). In a social context, the median reaction time varies from 16 minutes when the user is alone to 33 minutes when in a social setting (Log-rank test. $\chi^{2}(7)=635.38$ , p<0.05). With respect to location, the median reaction time ranges from 15 minutes when the user is at university to 32 minutes when outdoors. (Log-rank test. $\chi^{2}(8)=708.50$ , p<0.05).

To test the net effects of all the features influencing the reaction time considered here, we have run two regression models (Tab. 1), a Cox-regression model with user notification shared probability and a Multilevel discrete time model. For both models we have performed a likelihood ratio test with the null model (Bolker et al., 2009). The results show that the two models are both statistically significant ( $\chi^{2}(34)=3437.34$ , p<0.05; $\chi^{2}(47)=5819.65$ , p<0.05) and capture the same meaning (i.e., the probability to react at a notification at time t). The parameters differ slightly, but the sign and interpretation meanings are exactly the same. Moreover, with the Cox regression we can also estimate that the model explains 5.0% of the variance in accuracy ( $R_{D}^{2}=0.0462$ ;) (Royston, 2006). This means that by using the independent variables listed in Table 1, we can explain 5.0% of fluctuation in reaction time.

4.2. Q2: Completion time

The results are reported in Tab.2. The median survival time of the completion time is 9 seconds, i.e., fifty per cent of all notifications are filled in 9 seconds or less. As assumed, also the completion time is influenced by both the historical context and the characteristics of the user. For example, the median completion time varies from a minimum of 7 seconds when the user is involved in study (study alone or with others) or lesson (classroom lecture) activities to a maximum of 11 seconds when involved in free time activities (Log-rank test. $\chi^{2}(9)=5478.37$ , p<0.05). In a social context, the median time varies from 8 seconds when the user is alone to 10 seconds when with the partner (Log-rank test. $\chi^{2}(7)=1682.50$ , p<0.05). With regard to location, the median time ranges from 7 seconds when the user is at university to 12 seconds when Shop/Pub/Theatre. (Log-rank test. $\chi^{2}(8)=2072.78$ , p<0.05).

To disentangle the effects of different contexts on completion time, we also run a Cox regression model and a multilevel discrete-time model here (Tab.2). The likelihood ratio test with the null model provide evidence that the models are both statistically significant ( $\chi^{2}(8)=3683.55$ , p<0.05; $\chi^{2}(19)=8457.39$ , p<0.05) and show the same meaning. As for Q1, the parameters differ only slightly, but the sign is exactly the same. Moreover, with the Cox regression, we can also estimate that the model explains 4.1% of fluctuation in completion time ( $R_{D}^{2}=0.0407$ ;).

Features	Cox-regression	Multilevel
Features	Coef B	Coef B
Event context: Activity	0.038***	0.104***
Social context: Alone vs not alone	0.267***	0.637***
User Characteristics: Mood	0.039***	0.095***
User Characteristics: Procrastination	-0.006*	-0.015*
Study time	0.048***	0.131***
Study time²	-0.002***	-0.005***
Reaction time (min.)	-0.0002***	-0.001***
Pending notification (count)	0.044***	0.093***
Theta/constant	0.0731***	-14.704***
Observations	58,340	2,565,159
Number of groups	158	158

Table 2. Cox regression model and Multilevel model with random intercept on the completion time. (Note: (*) p<0.1; (**) p<0.05; (***) p<0.01).

4.3. Q3: Answer correctness

Fig. 4(a) reports a first exploratory analysis of the effects of reaction and completion time on the answer quality. This analysis shows that there is a statistically significant negative impact on the response quality from both the reaction time (Fig. 4(a)) and the completion time (Fig. 4(b)). While the first result was expected, the second shows that the negative effects of time on the completion time (e.g., memory errors) are bigger that the positive effects (e.g., increased time in the computation of the answer), see the discussion in Section 3.1. In fact, a comparison of the two groups of correct and incorrect answers, where a distance greater than 50 meters is considered an incorrect answer, provides the following values:

•
Reaction time
- –
  
  Mean : Correct (38 minutes), Incorrect (43 minutes) (Fisher $F=5.02$ , p<0.05);
- –
  
  Median survival reaction time: Correct (17 minutes), Incorrect (19 minutes) (Log rank test: $\chi^{2}(1)=4.93$ , p<0.05; Non-parametric equality-of-medians test: $\chi^{2}(1)=3.73$ , p<0.05) (Mann and Whitney, 1947).
•
Completion time
- –
  
  Mean : Correct (11.0 seconds), Incorrect (11.8 seconds) (Fisher $F=4.73$ , p<0.05);
- –
  
  Median survival completion time: Correct (9 seconds), Incorrect (9 seconds) (Log-rank test: $\chi^{2}(1)=4.36$ , p<0.05; Non-parametric equality-of-medians test: $\chi^{2}(1)=5.17$ , p<0.05).

For both reaction and completion time the mean time for correct answers is lower than that of incorrect answers. This applies also to the median survival reaction time, while the median survival completion time is the same for correct and incorrect answers.

However, reaction and completion are not independent of one another, but are links of the same chain. As from Fig. 5, the two models we have developed, i.e., a Multilevel Structural Equation Model (MSEM) (Radu et al., 2018) and a Structural Equation Model or Path Analysis (SEM) (see Section 3.1), both support the theoretical model depicted in Fig. 1 and the idea of a chain of events on answer quality (Read, 2019)³³3Due to the correlation between the reaction time and the number of pending notifications, the covariance between the errors of these two variables has been defined in the two models.. In fact, the input variables are (mostly strongly) relevant and the fit indices show that the model fits the observed data very well (RMSEA= 0.022; SRMR=0.013; chi2=95.41 (20); $R^{2}=0.127$ ). The model shows a 13.0% of fluctuation in the answer quality. Unfortunately, multilevel models present challenges in constructing fit indices because there are multiple levels of hierarchy to account for in establishing goodness of fit. As a result, there is a lack of consensus on suitable fit indices for multilevel models (Comulada, 2021).

As shown in Fig. 5, context history, burden, technology, and personality aspects play a different role in “when” the user decides to respond and in “how” the user fills out the notification. In a chain of causal effects, reaction time (when) and completion time (how) have a direct effect on the quality of the data collected and the associated errors. In the causal model in Fig. 5, the “when” influences both the question-answer process and the quality of the data through the number of notifications to be filled in when the user begins to respond. In turn, this is influenced by the burden, repetitiveness of the activity, and compliance with the research protocol, which varies, according to the subject’s personality characteristics, such as procrastination and daily mood, and the social context. The “how,” is influenced by the burden, distraction due to the presence of other people, the repetitiveness of the activity, and the subject’s personality level of procrastination, and directly influences the response error. In a nutshell, burden, activities, personality idiosyncrasy and cognitive aspects have a direct effect on the user behaviour and in turn the user behaviour has a direct effect on the error.

Finally, the overall conclusion is that reaction time is the key variable with the highest impact on the answer quality. First of all, as from Fig. 4, the reaction time is much higher than the completion time, where the mean of the former is 38 minutes for correct answers and 43 minutes for wrong answers, while that of the latter is 11 seconds for correct answers and 11.8 seconds for incorrect answers. Second, reaction times of the order of half-hours or longer largely facilitate memory errors and, therefore, wrong answers. It is also worthwhile noticing the relatively small difference between the median completion time for correct and incorrect answers (from 11 to 11.8 seconds).

5. Take-away lessons

The goal of this section is to draw some conclusive remarks about the general take-away lessons. Section 5.1 analyses the role of the key exogenous factors (i.e., situational and temporal context) as well of the endogenous factors (i.e., the general user characteristics) on the answer quality. Section 5.2 analyses the chain affect from reaction time to completion time. Finally, Section 5.3 provides a set of general recommendation, based on the results from this project but also on results from the literature, about how to execute future ESM data collections and experiments.

5.1. The role of exogenous and endogenous factors on the answer quality

Although we trust the quality of our memories, and so has done the previous work on EMA/ESM compliance (see Section 4.1), research on autobiographical memory teaches us that memory can be unreliable (Bradburn et al., 1987; Tourangeau et al., 2000; Stone et al., 2007). Our recollections are not just inaccurate, they are often systematically biased (Shiffman et al., 2008). The more time elapses from what we want to recall, the greater the risk of making mistakes. Everything pivots around the effects that the different contexts have on time and, in turn, on memory and the cognitive process concerning the response quality. Based on the state of the art (Section 2) and on the analysis we have performed (Section 4), we present below a detailed view of how all these aspects have a direct and/or indirect significant effect on the entire response process and, in turn, how these, taken together, influence the quality of responses.

5.1.1. The situational context

Location, activity, and social context (Tab. 1) influence in a very different way the reaction time (Fig. 7, 7, 9). Thus, e.g., being “at university”, “alone”, connected with “social media” significantly increases the likelihood that the user will respond very early to the incoming notification. On the other hand, being “outdoors”, in “leisure”, with a “partner” can significantly increase the delay with which the user decides to respond (Fig. 16). As it emerges from this model, one cannot simply detect whether the phone is on/off (Wang et al., 2018b), or whether the subject is changing activity (Pielot et al., 2017; Holland, 1986). The interaction between location, activity and social relation, combined with time, plays a decisive role on the reaction time.

Activities and social context also influence completion time at two different levels (Tab. 2). The first level, as expected, is that the repetition of events (fig. 13) has the effect of reducing the compilation time. In this case, the cause is the cognitive training process in which, notification after notification, the user learns how to classify different activities in a limited list of alternatives. The greater the probability that an event occurs, the greater the probability that the user remembers its position. The second level is that externally induced distraction effect affects time; thus being alone decreases completion time (Tab. 2). That is, when alone the user finds it easier to concentrate on the answer.

5.1.2. The temporal context

The time context matters. Over time (Fig. 9), the user tends to increase the reaction time. However, this delay increases rapidly in the first few days and then seems to find a balance (trade-off) between the task and the time at which it is performed.

Opposite is the effect of app usage on the completion time. In this case, the user tends to reduce the completion time over time, see Fig. 15. However, we can assume that this effect of the learning process is only related to the first observation period. In fact, the indirect effect on the quality of responses seems to become constant after one week (Fig. 5). In other words, over time the user finds a balance between the frequency of the notifications sent by the server (every half an hour), the burden of the task, the compliance with the research, his life routine and his cognitive ability. This means that if we want to observe the user over a long period, we must replace the day of the survey with a more sensitive routine of response.

Finally, there is a third type time, that we can call the social time (Paetzold, 2008). This type of time refers to the daily activities of the students and to how they change during the week. Students do not necessarily attend classes every day nor do they follow precisely the same routine. Looking at the days of the week, the response delay behaviour is opposite to what one might expect. Longer reaction times do not occur during the period of maximum academic activity, i.e., weekdays, but rather during the weekend, i.e., Saturdays and Sundays. Taking time for oneself and relaxation seems to reduce the attention to the task.

5.1.3. User characteristics

Sex and age were found to play no statistically significant or even moderated effects. Procrastination increases the delay for both reaction time and completion time (Fig. 11 & 15) while a sad mood reduces it (Fig. 11 & 13). This confirms the literature (Forgas, 2013) which states that a lower value of mood corresponds to an increase in attention and also memory. Notice that, from the Q3 model in Section 4.3 (Fig. 5), the mood has a significant direct effect only on reaction time and not on completion time. However, Steel (Steel, 2007) argues that there is a relationship between mood and procrastination. Procrastination is a way of temporarily evading anxiety and therefore to improve mood in the shorter term, but with a negative effect in the longer term (because of the increase of tasks not completed). This opens the possibility of a depressive spiral where depression may lead to procrastination and then to a bad mood. Therefore, in an interactive survey process with the smartphone, there is the possibility that mood influences memory, accuracy, motivation, and therefore the completion time.

5.2. The chain effect of reaction and completion time on the answer quality

Completion and reaction time correlate positively with the distance between home and user, measured as the distance between home and the phone. We write below this distance as the variable “Home $\Leftrightarrow$ phone”. This distance is very important in this analysis, being home the location where the correctness of the user answers is computed. This correlation has a series of negative consequences on the answer quality, as shown in Fig. 4(a) & 4(b). In fact, a longer reaction time (Section 4.3, Fig. 5), has two effects. The first is a direct effect on increasing the distance “Home $\Leftrightarrow$ phone” (the more the user waits, the more s/he is moving away from home). The second indirect effect occurs through the increase of the “pending notifications” (the more the user waits, the more unanswered questions accumulate). This effect, in turn, (Section 4.2, Tab. 2 & Section 4.3, Fig. 5) induces a shorter completion time which in turn should decrease the error, as from the previous analysis. This result does not contradicts our analysis, but it shows the twofold way in which an error can appear. With long completion times we may have memory errors. However, when the same activity is repeated many times (e.g., a student quickly and “automatically” filling a sequence of four notifications with the same answer at the end of a two-hour class), there is a high risk of reducing attention with an increased risk of ty** errors.

5.3. Recommendations for future ESM studies

The key lesson learned from this work is that the reaction time is the crucial factor to be controlled in order to improve the answer quality, see the discussion at the end of Section 4.3. The completion time is anyhow much harder to control, because of the many contradicting factors influencing it and also because of the small difference between the time taken to complete a correct answer and the time taken to complete an incorrect answer (see again the discussion at the end of Section 4.3). In order to improve the reaction time, it is of paramount importance to have accurate information about where the user is, what he is doing and with whom, namely the current context in its various dimensions, e.g., situational, social, temporal. As it emerges from this model, one cannot simply note whether the phone is on/off (Wang et al., 2018b) or whether the subject is changing activity (Pielot et al., 2017; Holland, 1986). It is the combination of the different contextual dimensions, combined with time, that defines with greater precision when a user is most likely to respond, and to provide a correct answer.

There is evidence that, if we want to study human activities in the wild, probably we cannot limit our observation windows to a couple of weeks, as also suggested in (Wang et al., 2018b), nor to a single historical moment. Human activities change according to the social time activities, e.g., weekdays and weekends, change according to the season, and change according to social relations. There is no single best practice for doing ESM research. This depends on (a) the research question and research design (e.g., event-based sampling; time-based sampling; combination of time- and events-base); (b) the phenomenon under investigation, its frequency and regularity of occurrence, complexity, and interaction with other factors that need to be controlled (e.g., social context); and (c) the duration of the survey. So, as with traditional survey research, we need to continue to evaluate and report the limits and our strategies. The goal is to limit errors and their consequences.

Based on these considerations, and also on the experience matured in the work described in this paper, we can provide the following recommendations about to concrete organize the various aspects of an EMA/ESM experiment.

(a) Avoid voluntary participation. As several research studies have shown, participants must be paid (Aminikhanghahi et al., 2019; Keusch et al., 2019). No payment increases user dropout and non-cooperation.

(b) Avoid short time limits for responding to notifications. Over time, the user finds his or her own response routine. On the one hand, a long reaction time reduces the workload; on the other hand, it can increase memory errors. However, it is better to have a larger number of responses, which can be possibly excluded from the analysis, than no information at all.

(c) Avoid complex cognitive tasks and limit the questions to simple, factual information, especially in the case of a long observation period (weeks or more).

(d) Evaluate the data collection duration and the notification frequencies according to the topics under study. If habits or specific daily routines are studied, the observation time and notification frequencies should be designed on the temporal shape of the evolution of the phenomenon, e.g., days or weeks.

(e) Consider both short and too long completion times as part of the response set (in these situations, the user often gives the same answer). These answers are useful to compute the median completion and the diversity of answers (for instance, when be compared with the other users).

(f) Use the smartphone sensors to evaluate the probability that an answer is correct. For example, in order to find the probability of being distracted, or to predict the reaction time.

(g) Prioritize sending notifications when the phone is in active use (better when the user is on social media), and when the user is alone (better in places where there is a high probability that the mood is lower or the subject is bored, e.g., travel or self-care). A good such example is Profile 1 of Fig. 16, where Fig. 16 provides an estimated survival curve of the reaction time from three different participant profiles, as follows:

•

Profile 1: (What) Social media/chat – (With whom) Alone – (Where) University – (Mood) sad - (Procrastination syndrome) Low. The median reaction time is 4 min.;
•

Profile 2: (What) Free time - (With whom) Partner - (Where) Home – (Mood) Happy – (Procrastination syndrome) High. The median reaction time is 27 min.;
•

Profile 3: (What) Study - (With whom) Classmate - (Where) University – (Mood) neutral -Procrastination syndrome (on average). The median reaction time is 11 min.

(h) Finally, avoid as much as possible involving users with high procrastination syndrome. Look, for example, at Profile 2 in Fig. 16. When 50.0% of Profile 2 users have filled out the notification, about 75.0% of Profile 2 and 95.0% of Profile 1 users have done the same.

Two conclusive remarks. The first is that, assuming that the focus should be on how to improve the reaction time, then the next research question that needs to be answered is: What is the best time to ask a question? But this raises the question of how to do it. Our answer is that we should focus on minimizing the exogenous effects due to the context history. The future research will need to develop a holistic approach to the problem, so that systems will be able to learn what are right time, to ask a question taking into account i.e., the best situational contexts and, for what is possible, the subject’s endogenous factors. In this perspective, some of them, e.g., procrastination syndrome and personality, can be computed once for all and can therefore be taken as input parameters to the machine learning algorithm. The second remark is that the analysis provided in this paper is based on the data collected from a population of students. The selection of the sample is motivated by the fact students are easier to reach and also more prone to innovation and research experiments. It is not by chance that the choice made here follows a long tradition of papers working on this population, see, e.g., (Wang et al., 2015, 2018b; Zhang et al., 2021; Giunchiglia et al., 2018; Ben-Zeev et al., 2017). The full generality of the results provided, future studies can, of course, be achieved by extending this type of study to other populations.

6. Conclusion

In this work, we have investigated the effects of various factors on the correctness of answers. Our study, based on an empirical analysis of a large dataset of daily behaviours, captures a rich and multifaceted picture of individual behaviours on potential interaction errors.

Our results suggest that while they are very useful in predicting certain contextual patterns, subjective annotations present a certain degree of error due to both exogenous and endogenous factors affecting the quality of responses. When focusing on research studies where the user is asked to provide data to a third party, these problems are in addition to many others which are already known. Some examples are: the social desirability effect that may prevent the study participant from reporting certain (socially disapproved) activities (Corbetta, 2003); unreported activities when the participant perceives them as an intrusion on his or her privacy (Callegaro et al., 2015); and the incorrect design of the data collection instrument, for example, the lack of an exhaustive list of response alternatives that are allowed to the respondent (Blaikie and Priest, 2019; Groves et al., 2011). Furthermore, in practical applications, collecting self-reported annotations is not always an option.

References

(1)
Alessandretti et al. (2020) Laura Alessandretti, Ulf Aslak, and Sune Lehmann. 2020. The scales of human mobility. Nature 587, 7834 (2020), 402–407.
Alessandretti et al. (2018) Laura Alessandretti, Piotr Sapiezynski, Vedran Sekara, Sune Lehmann, and Andrea Baronchelli. 2018. Evidence for a conserved quantity in human mobility. Nature human behaviour 2, 7 (2018), 485–491.
Allison (2018) Paul D Allison. 2018. Event history and survival analysis. In The reviewer’s guide to quantitative methods in the social sciences. Routledge, 86–97.
Amaya et al. (2020) Ashley Amaya, Paul P Biemer, and David Kinyon. 2020. Total error in a big data world: adapting the TSE framework to big data. Journal of Survey Statistics and Methodology 8, 1 (2020), 89–119.
Aminikhanghahi et al. (2019) Samaneh Aminikhanghahi, Maureen Schmitter-Edgecombe, and Diane J Cook. 2019. Context-aware delivery of ecological momentary assessment. IEEE journal of biomedical and health informatics 24, 4 (2019), 1206–1214.
Balto et al. (2016) Julia M Balto, Dominique L Kinnett-Hopkins, and Robert W Motl. 2016. Accuracy and precision of smartphone applications and commercially available motion sensors in multiple sclerosis. Multiple Sclerosis Journal–Experimental, Translational and Clinical 2 (2016), 2055217316634754.
Beilock and DeCaro (2007) Sian L Beilock and Marci S DeCaro. 2007. From poor performance to success under stress: working memory, strategy selection, and mathematical problem solving under pressure. Journal of Experimental Psychology: Learning, Memory, and Cognition 33, 6 (2007), 983.
Ben-Zeev et al. (2017) Dror Ben-Zeev, Rachel Brian, Rui Wang, Weichen Wang, Andrew T Campbell, Min SH Aung, Michael Merrill, Vincent WS Tseng, Tanzeem Choudhury, Marta Hauser, et al. 2017. CrossCheck: Integrating self-report, behavioral sensing, and smartphone use to identify digital indicators of psychotic relapse. Psychiatric rehabilitation journal 40, 3 (2017), 266.
Berke et al. (2011) Ethan M Berke, Tanzeem Choudhury, Shahid Ali, and Mashfiqui Rabbi. 2011. Objective measurement of sociability and activity: mobile sensing in the community. The Annals of Family Medicine 9, 4 (2011), 344–350.
Bettini et al. (2010) Claudio Bettini, Oliver Brdiczka, Karen Henricksen, Jadwiga Indulska, Daniela Nicklas, Anand Ranganathan, and Daniele Riboni. 2010. A survey of context modelling and reasoning techniques. Pervasive and mobile computing 6, 2 (2010), 161–180.
Bison et al. (2021) Ivano Bison, Fausto Giunchiglia, Mattia Zeni, Enrico Bignotti, Matteo Busso, and Ronald Chenu-Abente. 2021. Trento 2018 - An extended pilot on the daily routines of University students. DataSet soon to be available at https://ri.internetofus.eu. University of Trento Technical Report - DataScientia dataset descriptors.
Blaikie and Priest (2019) N. Blaikie and J. Priest. 2019. Designing social research: The logic of anticipation. John Wiley, New York.
Bolker et al. (2009) Benjamin M Bolker, Mollie E Brooks, Connie J Clark, Shane W Geange, John R Poulsen, M Henry H Stevens, and Jada-Simone S White. 2009. Generalized linear mixed models: a practical guide for ecology and evolution. Trends in ecology & evolution 24, 3 (2009), 127–135.
Bontempelli et al. (2022) Andrea Bontempelli, Marcelo Rodas Britez, Xiaoyue Li, Haonan Zhao, Luca Erculiani, Stefano Teso, Andrea Passerini, and Fausto Giunchiglia. 2022. Lifelong Personal Context Recognition. https://doi.org/10.48550/ARXIV.2205.10123
Boukhechba et al. (2018) Mehdi Boukhechba, Lihua Cai, Philip I Chow, Karl Fua, Matthew S Gerber, Bethany A Teachman, and Laura E Barnes. 2018. Contextual analysis to understand compliance with smartphone-based ecological momentary assessment. In Proceedings of the 12th EAI International Conference on Pervasive Computing Technologies for Healthcare. 232–238.
Bowers (1939) Raymond V Bowers. 1939. Time Budgets of Human Behavior.
Bradburn et al. (1987) Norman M Bradburn, Lance J Rips, and Steven K Shevell. 1987. Answering autobiographical questions: The impact of memory and inference on surveys. Science 236, 4798 (1987), 157–161.
Bradley and Dunlop (2005) Nicholas A Bradley and Mark D Dunlop. 2005. Toward a multidisciplinary model of context to support context-aware computing. Human-Computer Interaction 20, 4 (2005), 403–446.
Brockmann et al. (2006) Dirk Brockmann, Lars Hufnagel, and Theo Geisel. 2006. The scaling laws of human travel. Nature 439, 7075 (2006), 462–465.
Bronfenbrenner (1977) Urie Bronfenbrenner. 1977. Toward an experimental ecology of human development. American psychologist 32, 7 (1977), 513.
Callegaro et al. (2015) M. Callegaro, K. L. Manfreda, and V. Vehovar. 2015. Web survey methodology. Sage.
Cannell et al. (1981) Charles F Cannell, Peter V Miller, and Lois Oksenberg. 1981. Research on interviewing techniques. Sociological methodology 12 (1981), 389–437.
Chen and Kotz (2000) Guanling Chen and David Kotz. 2000. A survey of context-aware mobile computing research. (2000).
Chen and Wang (2014) Han-Ching Chen and Nae-Sheng Wang. 2014. The assignment of scores procedure for ordinal categorical data. The Scientific World Journal 2014 (2014).
Comulada (2021) W Scott Comulada. 2021. Calculating level-specific SEM fit indices for multilevel mediation analyses. The Stata Journal 21, 1 (2021), 195–205.
Corbetta (2003) P. Corbetta. 2003. Social research: Theory, methods and techniques. Sage.
Cowan (2010) Nelson Cowan. 2010. The magical mystery four: How is working memory capacity limited, and why? Current directions in psychological science 19, 1 (2010), 51–57.
Cowan (2012) Nelson Cowan. 2012. Working memory capacity. Psychology press.
Cox (1972) David R Cox. 1972. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 34, 2 (1972), 187–202.
Csikszentmihalyi (1992) M Csikszentmihalyi. 1992. The experience of psychopathology: Investigating mental disorders in their natural settings. Cambridge University Press.
Cuttone et al. (2018) Andrea Cuttone, Sune Lehmann, and Marta C González. 2018. Understanding predictability and exploration in human mobility. EPJ Data Science 7 (2018), 1–17.
Do and Gatica-Perez (2012) Trinh Minh Tri Do and Daniel Gatica-Perez. 2012. Contextual conditional models for smartphone-based human mobility prediction. In Proceedings of the 2012 ACM conference on ubiquitous computing. 163–172.
Eagle et al. (2009) Nathan Eagle, Alex Pentland, and David Lazer. 2009. Inferring friendship network structure by using mobile phone data. Proceedings of the national academy of sciences 106, 36 (2009), 15274–15278.
Eagle and Pentland (2009) Nathan Eagle and Alex Sandy Pentland. 2009. Eigenbehaviors: Identifying structure in routine. Behavioral ecology and sociobiology 63, 7 (2009), 1057–1066.
Fielding et al. (2016) Nigel G Fielding, Grant Blank, and Raymond M Lee. 2016. The SAGE handbook of online research methods. (2016).
Fischer et al. (2011) Joel E Fischer, Chris Greenhalgh, and Steve Benford. 2011. Investigating episodes of mobile phone activity as indicators of opportune moments to deliver notifications. In Proceedings of the 13th international conference on human computer interaction with mobile devices and services. 181–190.
Fischer et al. (2010) Joel E Fischer, Nick Yee, Victoria Bellotti, Nathan Good, Steve Benford, and Chris Greenhalgh. 2010. Effects of content and time of delivery on receptivity to mobile interruptions. In Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 103–112.
Force (2010) AAPOR Cell Phone Task Force. 2010. New considerations for survey researchers when planning and conducting RDD telephone surveys in the US with respondents reached via cell phone numbers. Deerfield, IL: American Association for Public Opinion Research (2010).
Forgas (2013) Joseph P Forgas. 2013. Don’t worry, be sad! On the cognitive, motivational, and interpersonal benefits of negative mood. Current Directions in Psychological Science 22, 3 (2013), 225–232.
Giunchiglia (1993) F. Giunchiglia. 1993. Contextual reasoning. Epistemologia, special issue: I Linguaggi e le Macchine 16 (1993), 345–364.
Giunchiglia et al. (2017) Fausto Giunchiglia, Enrico Bignotti, and Mattia Zeni. 2017. Personal context modelling and annotation. In 2017 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops). IEEE, 117–122.
Giunchiglia et al. (2018) Fausto Giunchiglia, Mattia Zeni, Elisa Gobbi, Enrico Bignotti, and Ivano Bison. 2018. Mobile social media usage and academic performance. Computers in Human Behavior 82 (2018), 177–185.
Goldstein (2011) Harvey Goldstein. 2011. Multilevel statistical models. John Wiley, New York.
Gonzalez et al. (2008) Marta C Gonzalez, Cesar A Hidalgo, and Albert-Laszlo Barabasi. 2008. Understanding individual human mobility patterns. nature 453, 7196 (2008), 779–782.
Grammenos et al. (2018) Andreas Grammenos, Cecilia Mascolo, and Jon Crowcroft. 2018. You are sensing, but are you biased? a user unaided sensor calibration approach for mobile sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1 (2018), 1–26.
Groves et al. (2011) R. M. Groves, Fowler Jr, M. P. F. J., Couper, J. M. Lepkowski, E. Singer, and R. Tourangeau. 2011. Survey methodology. John Wiley, New York.
Harari et al. (2016) Gabriella M Harari, Nicholas D Lane, Rui Wang, Benjamin S Crosier, Andrew T Campbell, and Samuel D Gosling. 2016. Using smartphones to collect behavioral data in psychological science: Opportunities, practical considerations, and challenges. Perspectives on Psychological Science 11, 6 (2016), 838–854.
Hektner et al. (2007) Joel M Hektner, Jennifer A Schmidt, and Mihaly Csikszentmihalyi. 2007. Experience sampling method: Measuring the quality of everyday life. Sage.
Helaoui et al. (2013) Rim Helaoui, Daniele Riboni, and Heiner Stuckenschmidt. 2013. A probabilistic ontological framework for the recognition of multilevel human activities. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. 345–354.
Heron et al. (2017) Kristin E Heron, Robin S Everhart, Susan M McHale, and Joshua M Smyth. 2017. Using mobile-technology-based ecological momentary assessment (EMA) methods with youth: A systematic review and recommendations. Journal of pediatric psychology 42, 10 (2017), 1087–1107.
Ho and Intille (2005) Joyce Ho and Stephen S Intille. 2005. Using context-aware computing to reduce the perceived burden of interruptions from mobile devices. In Proceedings of the SIGCHI conference on Human factors in computing systems. 909–918.
Holland (1986) Paul W Holland. 1986. Statistics and causal inference. Journal of the American statistical Association 81, 396 (1986), 945–960.
Holtzblatt and Beyer (1997) Karen Holtzblatt and Hugh Beyer. 1997. Contextual design: defining customer-centered systems. Elsevier.
Hormuth (1986) Stefan E Hormuth. 1986. The sampling of experiences in situ. Journal of personality 54, 1 (1986), 262–293.
Hox et al. (2013) Joop J Hox et al. 2013. Multilevel regression and multilevel structural equation modeling. The Oxford handbook of quantitative methods 2, 1 (2013), 281–294.
Huang et al. (2016) Yu Huang, Haoyi Xiong, Kevin Leach, Yuyan Zhang, Philip Chow, Karl Fua, Bethany A Teachman, and Laura E Barnes. 2016. Assessing social anxiety using GPS trajectories and point-of-interest data. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 898–903.
Hyman Jr et al. (2010) Ira E Hyman Jr, S Matthew Boss, Breanne M Wise, Kira E McKenzie, and Jenna M Caggiano. 2010. Did you see the unicycling clown? Inattentional blindness while walking and talking on a cell phone. Applied Cognitive Psychology 24, 5 (2010), 597–607.
Intille (2016) Stephen Intille. 2016. The precision medicine initiative and pervasive health research. IEEE Pervasive Computing 15, 1 (2016), 88–91.
Iqbal and Horvitz (2010) Shamsi T Iqbal and Eric Horvitz. 2010. Notifications and awareness: a field study of alert usage and preferences. In Proceedings of the 2010 ACM conference on Computer supported cooperative work. 27–30.
Jäckle et al. (2017) Annette Jäckle, Jonathan Burton, Mick P Couper, and Carli Lessof. 2017. Participation in a mobile app survey to collect expenditure data as part of a large-scale probability household panel: Response rates and response biases. Institute for Social and Economic Research, University of Essex: Understanding Society Working Paper Series No 9 (2017).
Joreskog et al. (1979) Karl G Joreskog, Sumorbom Dag, and Jay Magidson. 1979. Advances in factor analysis and structural equation models. Abt books.
Kellogg (2015) Ronald T Kellogg. 2015. Fundamentals of cognitive psychology. Sage Publications.
Keusch et al. (2019) Florian Keusch, Bella Struminskaya, Christopher Antoun, Mick P Couper, and Frauke Kreuter. 2019. Willingness to participate in passive mobile data collection. Public opinion quarterly 83, S1 (2019), 210–235.
Killgore (1999) William D Scott Killgore. 1999. The visual analogue mood scale: can a single-item scale accurately classify depressive mood state? Psychological reports 85, 3_suppl (1999), 1238–1243.
Krosnick (1991) Jon A Krosnick. 1991. Response strategies for co** with the cognitive demands of attitude measures in surveys. Applied cognitive psychology 5, 3 (1991), 213–236.
Krosnick et al. (1987) Jon A Krosnick, Duane F Alwin, Charles Cannell, et al. 1987. Satisficing: A strategy for dealing with the demands of survey questions. (1987).
Kushlev et al. (2016) Kostadin Kushlev, Jason Proulx, and Elizabeth W Dunn. 2016. ” Silence your phones” Smartphone notifications increase inattention and hyperactivity symptoms. In Proceedings of the 2016 CHI conference on human factors in computing systems. 1011–1020.
Lai et al. (2009) Jennie Lai, Lorelle Vanno, Michael Link, Jennie Pearson, Hala Makowska, Karen Benezra, and Mark Green. 2009. Life360: usability of mobile devices for time use surveys. In American Association for Public Opinion Research annual conference, Hollywood, FL. 5582–5589.
Lane et al. (2010) Nicholas D Lane, Emiliano Miluzzo, Hong Lu, Daniel Peebles, Tanzeem Choudhury, and Andrew T Campbell. 2010. A survey of mobile phone sensing. IEEE Communications magazine 48, 9 (2010), 140–150.
Larson and Csikszentmihalyi (2014) Reed Larson and Mihaly Csikszentmihalyi. 2014. The experience sampling method. In Flow and the foundations of positive psychology. Springer, 21–34.
Lavrakas et al. (2007) Paul J Lavrakas, Charles D Shuttles, Charlotte Steeh, and Howard Fienberg. 2007. The state of surveying cell phone numbers in the United States: 2007 and beyond. Public Opinion Quarterly 71, 5 (2007), 840–854.
Lavrakas et al. (2010) Paul J Lavrakas, Trevor N Tompson, Robert Benford, and Christopher Fleury. 2010. Investigating data quality in cell phone surveying. In annual American Association for Public Opinion Research conference, Chicago, Illinois. 13–16.
Lee and Dey (2015) Matthew L Lee and Anind K Dey. 2015. Sensor-based observations of daily living for aging in place. Personal and Ubiquitous Computing 19, 1 (2015), 27–43.
Link et al. (2014) Michael W Link, Joe Murphy, Michael F Schober, Trent D Buskirk, Jennifer Hunter Childs, and Casey Langer Tesfaye. 2014. Mobile technologies for conducting, augmenting and potentially replacing surveys: Executive summary of the AAPOR task force on emerging technologies in public opinion research. Public Opinion Quarterly 78, 4 (2014), 779–787.
LORR (1989) MAURICE LORR. 1989. Chapter 2 - MODELS AND METHODS FOR MEASUREMENT OF MOOD. In The Measurement of Emotions, Robert Plutchik and Henry Kellerman (Eds.). Academic Press, 37–53. https://doi.org/10.1016/B978-0-12-558704-4.50008-6
Lynn and Kaminska (2013) Peter Lynn and Olena Kaminska. 2013. The impact of mobile phones on survey measurement error. Public Opinion Quarterly 77, 2 (2013), 586–605.
Mann and Whitney (1947) Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50–60.
McCabe et al. (2012) Kira O McCabe, Lori Mack, and William Fleeson. 2012. A guide for data cleaning in experience sampling studies. (2012).
Mehrotra et al. (2015) Abhinav Mehrotra, Mirco Musolesi, Robert Hendley, and Veljko Pejovic. 2015. Designing content-driven intelligent notification mechanisms for mobile applications. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 813–824.
Miritello et al. (2013) Giovanna Miritello, Rubén Lara, Manuel Cebrian, and Esteban Moro. 2013. Limited communication capacity unveils strategies for human interaction. Scientific reports 3, 1 (2013), 1–7.
Mishra et al. (2017) Varun Mishra, Byron Lowens, Sarah Lord, Kelly Caine, and David Kotz. 2017. Investigating contextual cues as indicators for EMA delivery. In Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers. 935–940.
Oh et al. (2015) Hyungik Oh, Laleh Jalali, and Ramesh Jain. 2015. An intelligent notification system using context from real-time personal activity monitoring. In 2015 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.
Oulasvirta et al. (2005) Antti Oulasvirta, Sakari Tamminen, Virpi Roto, and Jaana Kuorelahti. 2005. Interaction in 4-second bursts: the fragmented nature of attentional resources in mobile HCI. In Proceedings of the SIGCHI conference on Human factors in computing systems. 919–928.
Paetzold (2008) Heinz Paetzold. 2008. REVIEW ESSAY: Respect and toleration reconsidered (Under consideration: Rainer Forst’s Toleranz im Konflikt: Geschichte, Gehalt, und Gegenwart eines umstrittenen Begriffs (Frankfurt am Main: Suhrkamp, 2003)(English translation forthcoming, Cambridge University Press)). Philosophy & social criticism 34, 8 (2008), 941–954.
Peltonen et al. (2020) Ella Peltonen, Parsa Sharmila, Kennedy Opoku Asare, Aku Visuri, Eemil Lagerspetz, and Denzil Ferreira. 2020. When phones get personal: Predicting Big Five personality traits from application usage. Pervasive and Mobile Computing 69 (2020), 101269.
Pielot et al. (2017) Martin Pielot, Bruno Cardoso, Kleomenis Katevas, Joan Serrà, Aleksandar Matic, and Nuria Oliver. 2017. Beyond interruptibility: Predicting opportune moments to engage mobile phone users. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 3 (2017), 1–25.
Rabbi et al. (2015) Mashfiqui Rabbi, Min Hane Aung, Mi Zhang, and Tanzeem Choudhury. 2015. MyBehavior: automatic personalized health feedback from user behaviors and preferences using smartphones. In Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing. 707–718.
Radu et al. (2018) Valentin Radu, Catherine Tong, Sourav Bhattacharya, Nicholas D Lane, Cecilia Mascolo, Mahesh K Marina, and Fahim Kawsar. 2018. Multimodal deep learning for activity and context recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 4 (2018), 1–27.
Raento et al. (2009) Mika Raento, Antti Oulasvirta, and Nathan Eagle. 2009. Smartphones: An emerging tool for social scientists. Sociological methods & research 37, 3 (2009), 426–454.
Read (2019) Brendan Read. 2019. Respondent burden in a mobile app: Evidence from a shop** receipt scanning study. In Survey Research Methods, Vol. 13. European Survey Research Association, 45–71.
Riboni and Bettini (2011) Daniele Riboni and Claudio Bettini. 2011. OWL 2 modeling and reasoning with complex human activities. Pervasive and Mobile Computing 7, 3 (2011), 379–395.
Royston (2006) Patrick Royston. 2006. Explained variation for survival models. The Stata Journal 6, 1 (2006), 83–96.
Rusting (1998) Cheryl L Rusting. 1998. Personality, mood, and cognitive processing of emotional information: three conceptual frameworks. Psychological bulletin 124, 2 (1998), 165.
Sarmadi et al. (2023) Hassan Sarmadi, Alireza Entezami, Ka-Veng Yuen, and Bahareh Behkamal. 2023. Review on smartphone sensing technology for structural health monitoring. Measurement 223 (2023), 113716.
Schilit et al. (1994) Bill Schilit, Norman Adams, and Roy Want. 1994. Context-aware computing applications. In 1994 first workshop on mobile computing systems and applications. IEEE, 85–90.
Schmidt et al. (2007) Christina Schmidt, Fabienne Collette, Christian Cajochen, and Philippe Peigneux. 2007. A time to think: circadian rhythms in human cognition. Cognitive neuropsychology 24, 7 (2007), 755–789.
Sen et al. (2019) Indira Sen, Fabian Floeck, Katrin Weller, Bernd Weiss, and Claudia Wagner. 2019. A total error framework for digital traces of humans. arXiv preprint arXiv:1907.08228 (2019).
Shiffman et al. (2008) Saul Shiffman, Arthur A Stone, and Michael R Hufford. 2008. Ecological momentary assessment. Annu. Rev. Clin. Psychol. 4 (2008), 1–32.
Singer et al. (2003) Judith D Singer, John B Willett, John B Willett, et al. 2003. Applied longitudinal data analysis: Modeling change and event occurrence. Oxford university press.
Song et al. (2010) Chaoming Song, Zehui Qu, Nicholas Blumm, and Albert-László Barabási. 2010. Limits of predictability in human mobility. Science 327, 5968 (2010), 1018–1021.
Steel (2007) Piers Steel. 2007. The nature of procrastination: a meta-analytic and theoretical review of quintessential self-regulatory failure. Psychological bulletin 133, 1 (2007), 65.
Steel (2010) Piers Steel. 2010. Arousal, avoidant and decisional procrastinators: Do they exist? Personality and Individual Differences 48, 8 (2010), 926–934.
Steele (2008) Fiona Steele. 2008. Multilevel models for longitudinal data. Journal of the Royal Statistical Society: series A (statistics in society) 171, 1 (2008), 5–19.
Stone et al. (2007) Arthur Stone, Saul Shiffman, Audie Atienza, and Linda Nebeling. 2007. The science of real-time data capture: Self-reports in health research. Oxford University Press.
Stone et al. (1991) Arthur A Stone, Ronald C Kessler, and Jennifer A Haythomthwatte. 1991. Measuring daily events and experiences: Decisions for the researcher. Journal of personality 59, 3 (1991), 575–607.
Stone and Shiffman (2002) Arthur A Stone and Saul Shiffman. 2002. Capturing momentary, self-report data: A proposal for reporting guidelines. Annals of Behavioral Medicine 24, 3 (2002), 236–243.
Struminskaya et al. (2020) Bella Struminskaya, Peter Lugtig, Florian Keusch, and Jan Karem Höhne. 2020. Augmenting surveys with data from sensors and apps: Opportunities and challenges. Social Science Computer Review (2020), 0894439320979951.
Suchman (1987) Lucille Alice Suchman. 1987. Plans and situated actions: The problem of human-machine communication. Cambridge university press.
Sudman and Bradburn (1973) Seymour Sudman and Norman M Bradburn. 1973. Effects of time and memory factors on response in surveys. J. Amer. Statist. Assoc. 68, 344 (1973), 805–815.
Sun et al. (2021) Jessie Sun, Mijke Rhemtulla, and Simine Vazire. 2021. Eavesdrop** on missing data: What are university students doing when they miss experience sampling reports? Personality and Social Psychology Bulletin 47, 11 (2021), 1535–1549.
Tekle and Vermunt (2012) Fetene B Tekle and Jeroen K Vermunt. 2012. Event history analysis. (2012).
Tourangeau et al. (1984) Roger Tourangeau et al. 1984. Cognitive sciences and survey methods. Cognitive aspects of survey methodology: Building a bridge between disciplines 15 (1984), 73–100.
Tourangeau et al. (2000) Roger Tourangeau, Lance J Rips, and Kenneth Rasinski. 2000. The psychology of survey response. (2000).
Vaizman et al. (2017) Yonatan Vaizman, Katherine Ellis, and Gert Lanckriet. 2017. Recognizing detailed human context in the wild from smartphones and smartwatches. IEEE pervasive computing 16, 4 (2017), 62–74.
Vaizman et al. (2018) Yonatan Vaizman, Nadir Weibel, and Gert Lanckriet. 2018. Context recognition in-the-wild: Unified model for multi-modal sensors and multi-label classification. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 4 (2018), 1–22.
Van Berkel et al. (2017) Niels Van Berkel, Denzil Ferreira, and Vassilis Kostakos. 2017. The experience sampling method on mobile devices. ACM Computing Surveys (CSUR) 50, 6 (2017), 1–40.
van Berkel et al. (2020) Niels van Berkel, Jorge Goncalves, Simo Hosio, Zhanna Sarsenbayeva, Eduardo Velloso, and Vassilis Kostakos. 2020. Overcoming compliance bias in self-report studies: A cross-study analysis. International Journal of Human-Computer Studies 134 (2020), 1–12.
van Berkel et al. (2019) Niels van Berkel, Jorge Goncalves, Peter Koval, Simo Hosio, Tilman Dingler, Denzil Ferreira, and Vassilis Kostakos. 2019. Context-informed scheduling and analysis: improving accuracy of mobile self-reports. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.
Wang et al. (2014) Rui Wang, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tignor, Xia Zhou, Dror Ben-Zeev, and Andrew T Campbell. 2014. StudentLife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing. 3–14.
Wang et al. (2015) Rui Wang, Gabriella Harari, Peilin Hao, Xia Zhou, and Andrew T Campbell. 2015. SmartGPA: how smartphones can assess and predict academic performance of college students. In Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing. 295–306.
Wang et al. (2018b) Rui Wang, Weichen Wang, Alex DaSilva, Jeremy F Huckins, William M Kelley, Todd F Heatherton, and Andrew T Campbell. 2018b. Tracking depression dynamics in college students using mobile phone and wearable sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1 (2018), 1–26.
Wang et al. (2018a) Weichen Wang, Gabriella M Harari, Rui Wang, Sandrine R Müller, Shayan Mirjafari, Kizito Masaba, and Andrew T Campbell. 2018a. Sensing behavioral change over time: Using within-person variability features from mobile sensing to predict personality traits. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 3 (2018), 1–21.
Wang et al. (2020) Weichen Wang, Shayan Mirjafari, Gabriella Harari, Dror Ben-Zeev, Rachel Brian, Tanzeem Choudhury, Marta Hauser, John Kane, Kizito Masaba, Subigya Nepal, et al. 2020. Social sensing: assessing social functioning of patients living with schizophrenia using mobile phone sensing. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–15.
Wang et al. (2021) Yanda Wang, Weitong Chen, Dechang Pi, Lin Yue, Sen Wang, and Miao Xu. 2021. Self-Supervised Adversarial Distribution Regularization for Medication Recommendation.. In IJCAI. 3134–3140.
Weisberg (2009) Herbert F Weisberg. 2009. The total survey error approach: A guide to the new science of survey research. University of Chicago Press.
Wenz et al. (2019) Alexander Wenz, Annette Jackle, and Mick P Couper. 2019. Willingness to use mobile technologies for data collection in a probability household panel. In Survey Research Methods, Vol. 13. European Survey Research Association, 1–22.
West et al. (2002) Robert West, Kelly J Murphy, Maria L Armilio, Fergus IM Craik, and Donald T Stuss. 2002. Effects of time of day on age differences in working memory. The Journals of Gerontology Series B: Psychological Sciences and Social Sciences 57, 1 (2002), P3–P10.
Wilhelm et al. (2012) Peter Wilhelm, Meinrad Perrez, and Kurt Pawlik. 2012. Conducting research in daily life: A historical review. (2012).
Yue et al. (2021a) Lin Yue, Hao Shen, Sen Wang, Robert Boots, Guodong Long, Weitong Chen, and Xiaowei Zhao. 2021a. Exploring BCI control in smart environments: intention recognition via EEG representation enhancement learning. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 5 (2021), 1–20.
Yue et al. (2020) Lin Yue, Dongyuan Tian, Weitong Chen, Xuming Han, and Minghao Yin. 2020. Deep learning for heterogeneous medical data analysis. World Wide Web 23, 5 (2020), 2715–2737.
Yue et al. (2021b) Lin Yue, Dongyuan Tian, **g Jiang, Lina Yao, Weitong Chen, and Xiaowei Zhao. 2021b. Intention recognition from spatio-temporal representation of EEG signals. In Australasian Database Conference. Springer, 1–12.
Yue et al. (2019) Lin Yue, Haonan Zhao, Yiqin Yang, Dongyuan Tian, Xiaowei Zhao, and Minghao Yin. 2019. A mimic learning method for disease risk prediction with incomplete initial data. In International Conference on Database Systems for Advanced Applications. Springer, 392–396.
Zeni et al. (2021) Mattia Zeni, Ivano Bison, Fernando Reis, Britta Gauckler, and Fausto Giunchiglia. 2021. Improving time use measurement with personal big data collection–the experience of the European Big Data Hackathon 2019. Journal of Official Statistics 37, 2 (2021), 341–365.
Zeni et al. (2014) Mattia Zeni, Ilya Zaihrayeu, and Fausto Giunchiglia. 2014. Multi-device activity logging. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. 299–302.
Zhang et al. (2021) Wanyi Zhang, Qiang Shen, Stefano Teso, Bruno Lepri, Andrea Passerini, Ivano Bison, and Fausto Giunchiglia. 2021. Putting human behavior predictability in context. EPJ Data Science 10, 1 (2021), 42.
Zhao et al. (2024) Haonan Zhao, Ivan Kayongo, Leonardo Malcotti, and Fausto Giunchiglia. 2024. Human-AI Collaborative Big-Thick Data Collection. arXiv preprint arXiv:2404.17602 (2024).