Research on User Behaviors and Tolerance of Faulty Web Interactions

Even if we think that all the computer systems that are in operation work perfectly, the background might not be as it seems. We might face some faulty web interactions on a popular website or software as well. User behaviors are vital for developers in creating a satisfying computer system. in the aim of this study was threefold. Firstly, to determine if users’ tolerance of different kinds of faulty web interactions changes depending on the environment, and then to find how users’ behaviors differ when they encounter a faulty web interaction. Lastly, to detect how faulty web interactions shape users’ perceptions. To achieve these aims, we conducted a test on a manipulated mobile e-commerce website with 11 tasks including five faulty ones. Participants were not informed that the test includes faulty tasks. Faulty tasks consist of different kinds of web errors: Not Responding, Blank Page, Connection Timeout (HTTP-500), Not Found (HTTP-404), and Redirect (HTTP-301). The other tasks were organized as dummy tasks, and they were not examined. In the results of this study, we reached quantitative (for the collection of quantitative data, we used a Tolerance Evaluation Scale (TES) that we developed for this study) and qualitative findings. According to the quantitative findings, there is no difference between the tolerance levels of users for different environments. On the other hand, it was determined that when there is an error that includes feedback, user tolerance is affected positively. In addition to this, it can be seen that users have a low tolerance towards giving another chance to any kind of website which has a faulty interaction. In terms of qualitative findings, participants emphasized that it does not matter what purpose a website serves, the errors give an amateur impression by damaging usability and professionalism.


Introduction
has two complicated entities trying to "speak" the same language. These entities are human beings and computer-like machines. This kind of communication comes with some difficulties. Lazar, Meiselwitz, and Norcio (2004) stated that there are four different types of error when using the World Wide Web (WWW). These are user errors, system errors, situational errors and poor web design. User errors occur because of incorrect user actions, system errors are related to software or hardware problems, and situational ones are errors such as network errors. On the other hand, poor web design is related to websites designed in a confusing manner. Even though these authors grouped errors in four categories, it can be said that the four error types are connected to two different kinds of errors: user-based (user errors) and system-based (system and situational errors and poor web design). Similar to these error types which are stated by Lazar, Meiselwitz, and Norcio (2004), Ma and Tian (2007) also grouped web-based errors in 3 different categories: host, network and browser errors, source and content errors and user-based errors. Even if these group names are different from Lazar, Meiselwitz and Norcio's (2004), the errors are related to the same types. In other words, except the user-based errors, the first two types are among the error types that can occur independently from users. For this reason, they can also be defined as system-based errors.
As people who use a system/software have different backgrounds and knowledge levels (Graham, 2003), user-based errors might occur differently. Nevertheless, an error which is made by a user in using a computer system for an individual purpose would affect just the user who made it. In these kinds of situations, user guides included in systems/software can be one of the solutions for reducing the errors made by users. On the other hand, system-based errors, which occur independently from users, need to be approached differently, since these kinds of errors, which can be considered as the background functionality of a computer system, will affect all users of that system. When considering each scenario, it can be said that these two kinds of error occurrence do not have the same effect on users. Especially when we imagine that system-based errors might lead to catastrophic user experiences, this case might also result in expensive recovery processes (Heckel & Mariani, 2005). In the light of these explanations, even if user-based errors can be tolerated in the context of individual usage, it is not possible for the system-based errors. In other words, the fact that the final product will most likely contain some faulty interactions that might be user-based or system-based does not mean that system-based errors can be ignored by developers, since faults constitute a critical threat for the dependability of computer systems (Ploski et al., 2007). Meyers (2004) stated that "responsibility for interface usage errors belongs to the interface designer, not the interface user". Similar to Meyers' (2004) statement, we can easily indicate that responsibility for system-based errors always lies with the developers/designers. A fault has been basically defined as a structural imperfection in a system that might end with unexpected results (Munson et al., 2017) or the cause of an error that means the delivery of a service is not performing as expected (Laprie, 1995). From the perspective of system-based errors, faults might occur for different kinds of reasons. It might be a functionality error unnoticed during the development process or a design error made by an inexperienced developer. Furthermore, a well-performing system might not work at another time in thenear or distantfuture because of being updated. This situation can also be counted as an error type. Even if the behavior of making mistakes is in human nature and cannot be avoided in terms of user-based errors, every kind of system-based errors can be fixed by a direct intervention of developers. Developers' actions can reduce or eliminate system-based errors completely. The most usual action for this is to remove or repair the cause of the fault (Avizienis, 1978). No matter what kind of error it is, the main point which has to be considered is how system-based errors affect users and what kind of errors should be taken more seriously in order not to lose the users in every aspect, since it should not be forgotten that the reliability of a computer system depends on understanding the impact of the faults (Ocariza et al., 2013).
Generally speaking, systems always have undetected errors, and users always make mistakes. When a user encounters a fault on a system, she/he can easily blame the system or just herself/himself. This endless loop can and will not be broken. However, looking from the perspective of system-based errors, this situation can also be considered a part of quality. The products which have a lot of faults are considered as having poor quality (Card, 1998). Rubin and Chisnell (2008) stated that "usability is a quality that many products possess, but many, many more lack". From this perspective, it is obvious that the system-based errors should be reduced to minimum levels by developers so that a system can be considered high quality or usable.
Usability, which was defined as "the effectiveness, efficiency and satisfaction with which specified users achieve specified goals in particular environments" by ISO (1998), has five fundamental components: learnability, effectiveness, memorability, satisfaction, and errors (Nielsen, 1993;Shneiderman & Plaisant, 2005). There is a lot of studies concerning each component, but in this study, we focused on errors. In other words, how user behavior, tolerance, and perception change when encountering faulty interactions. Some studies which might be considered related to ours and contain user behaviors, are mentioned below: Ramsey, Barbesi, and Preece (1998) performed a study by injecting delays into the page loading process. Their aim was to examine whether the latency between requesting a page and receiving it affects user perceptions. In this research, they found that faster pages are more interesting than slower ones. In addition to this, being slow results in a reduction of user motivation and increases user frustration. Tzeng (2004) carried out a research aiming to understand how users react to computers' apologies. In this study, in which a computer guessing game was designed, some minor flaws were intentionally integrated into the game, such as repetitively selecting the same keys and clues, an unattractive interface, irrelevant clues. The aim of this integration was to create a reason for the computers to apologize. This action can be considered a manipulation similar to that of our study. The results of this study show that even if some subjects felt manipulated when the computers offered apologies to them, the computer apologies helped to create more desirable psychological experiences for the users. Another study that examined the effects of different delays on two websiteswas carried out by Galletta et al. (2004). The authors created two manipulated websites in order to observe user behavior in a total of 196 participants. The results of this study show that an increase in delay time(s) affects performance, attitudes, and behavioral intentions negatively.
Another study was conducted by Everard and Galletta (2005), aiming to explore whether website presentation flaws affect consumers' perceived quality of the online store, trust and consumers' intention to purchase from the online store. They used three types of manipulative factors: a poor style (contrast and design flaws), incompleteness (placeholders such as "under construction" or "image not yet available" on each page) and language error (making a grammatical error on every page). The results of this study show that every kind of flaw that was tested affects users' perceived site quality, trust, and users' intention to purchase negatively. Guse et al. (2015) conducted a study assessing how delayed loading and partly loading webpages affected users' perceived quality. The authors of this study, which focused on Task Completion Time (TCT) and Page Load Time (PLT), concluded that PLT and TCT alone are not sufficient quality indicators when considering partial load failures. Another research on exploring the relationship between response time and user perception in the context of smartphone interactions was conducted by manipulating the response times for four tasks in three applications (Tan et al., 2019). The authors of this study found that while switching between pages, interfaces with a loading animation affect user tolerance positively. This loading animation can be understood as feedback, which this study will emphasize and also focus on the way in which this feedback is important for user tolerance.
In this study, in the light of the explanations and the studies mentioned above, we aimed to detect the differences in users' behaviors and perceptions and to investigate users' tolerance when they encounter a faulty web interaction on a manipulated mobile e-commerce website. Our research questions were as follows: • How does users' tolerance of different kinds of faulty web interactions change depending on the environment? • How do users' behaviors differ when they encounter a faulty web interaction?
• How do faulty web interactions shape users' perceptions?
In order to answer these questions, we created an e-commerce website which is specific to this study and includes some faulty interactions. Participants were requested to complete all tasks connected to a scenario. The scenario had 11 tasks, including five faulty ones. The remaining tasks were dummy tasks, which were used in order to convince users that faulty tasks were not integrated into the website intentionally and to secure the objectiveness of results. Findings include various metrics and indicators. The next section describes the method of the study. The third section shows our findings with various metrics and indicators. In the fourth section, we present our conclusion.

Method
In this study, the usability test method, which is used to determine the weaknesses of any product, was used differently. Instead of detecting weaknesses, it was used to examine user behavior, tolerance and perception on a mobile website containing various intentionally placed faulty web interactions. Before conducting the test, we built a mobile compatible e-commerce website and placed five kinds of errors appearing as system-based errors. After that, we planned a scenario which consisted of 11 tasks including five faulty ones. Working tasks (dummy task) were placed in order to distract the participants' attention so that they would not realize the faulty tasks had been placed intentionally. The details of the method of the study are described in subsections.

Digital Test Environment
The website was built as an e-commerce website that included various products such as computers, mobile phones, software, etc. Even though the website did not include any sale or payment process, it was designed as if it had such components. We introduced the website to the participants as newly built and as if it were to be put into service soon.
The design of the website was in a responsive structure in order for it to be compatible with mobile browsers. Thus, the study could be performed on the participants' own mobile devices. A view of the website in a mobile browser is presented in Figure 1. The view of the main page of the website was divided into three parts because of the height of the page.

Participants and Test Environments
Budiu (2014) stated that it is important to choose participants who have used their phone for at least three months. Therefore, we checked this information first in order to identify suitable participants. Then, we focused on choosing the participants who experienced internet shopping in the past. After checking all the volunteers, we decided on 14 graduate participants (including 3 females and 11 males) who were appropriate. Each participant signed a consent document before the individual sessions started. After the selection process, the participants were divided into two groups, and the sessions were held in different places. The first place was a room that was customized for the sessions in Kırklareli University Distance Learning Implementation and Research Center for the first nine participants. The other places were the participants' own houses. This type of test environment was stated as "Informal Lab" by Barnum (2010). The reason for conducting home sessions is to investigate how participants behave in their natural environments.

Test Scenario and Tasks
In order to keep participants' motivation high, we prepared a scenario. In this way, participants had a single and exact purpose instead of independent tasks, as suggested by Barnum (2010). The introduction scenario was as follows.
"You got a job for the first time and you're waiting for the salary day, excitedly. You said to a friend of yours that you are looking for a budget-friendly e-commerce website for technological shopping. Your friend recommended a website which he/she did shopping on before. Finally, you got the salary and visited the website recommended by your friend!" After the introduction, the scenario continues with the tasks in Table 1, respectively. The "faulty tasks" were taken into account for this study. However, we added "dummy tasks" in order not to lose the motivation of the participants. This type of task is not examined, but it is believed that this is essential to get valid and realistic findings. If the participants realized that the faulty interfaces are fiction, their behaviors could get unrealistic. It was determined at the post-test interview that no participant perceived that faulty tasks were placed intentionally.
In task 3, the participants faced a webpage which did not respond to clicking on the "register" link. In the regular process, they would have seen the registration form normally. Figure 2 shows screenshots from task 3. In the faulty interaction, when the signup button is clicked, the label of the button changes to "processing". However, the process does not continue.
In task 5, we asked the participants to find a product on the website. They tried to access the product page in various ways (interaction 1: reaching the product list from menu or interaction 2: search box, Figure 3). We removed shortcuts that help to add any product to the shopping cart simply in the product lists. Participants were forced to access the product page. Even if they followed the correct way to reach the related product, they were not able to complete the task since the product page had been replaced with a blank page. In other words, they could not add the product to their shopping cart. Figure 3 shows a view of the product list. In task 7, we asked the participants to open the software list to find an antivirus software. This faulty interaction was triggered by clicking the "Software" item on the menu. The category list was adjusted to give the "timeout" error after waiting 10 seconds. The view of the error page is in Figure 4. Timeout duration was defined as 10 seconds according to the proposal of Nielsen (1993) who stated that the loading time of the page should be around 10 seconds in order to avoid the distraction of a user. Even though the newest studies in the literature suggest shorter duration, this study was predicated on Nielsen's statement.
In task 9, similar to task 5, we asked the participants to access the details page of the product. However, the link of the product redirected the participants to the NF error page as in Figure 5. In task 11, we asked the participants to access their shopping carts. But the shopping cart button redirected the user to the main page in every trial. Figure 6 shows the placement of the button. In general, participants tried various ways to trigger all errors. However, the mobile website was manipulated in such a way that it was not possible to avoid encountering the mentioned errors. Thus, the participants were forced to interact with the errors.

Data Collection
In the data collection process, both qualitative and quantitative methods were used. While the qualitative findings include opinion and reaction of the participants during the process/end of test sessions, the quantitative findings were collected by the Tolerance Evaluation Scale (TES) that has been created for this research. TES consists of different kinds of metrics. The definitions and the descriptions of TES are given in Table 2. Retrying Duration RTD Elapsed time in the retry process. It is also the difference between FT and FRT.  On the other hand, although NOR was calculated as described in Table 2 for this research, it can be calculated in different ways for different researches. For this research, in the calculation of NOR, we used six different user actions per error, but all types of user action did not occur in every instance. For example, 4 was only calculated in the NR error, because the participants clicked the submit button again and again without refreshing the page. To sum up, NOR is a flexible variable that can be changed depending on user actions for every individual research. The remaining five metrics of TES are calculated by using the collected seven metrics. We defined the abbreviations for the metrics so that they could be easily used in the text.
In case of a faulty interaction, TES can be used to inspect the process in terms of determining the effects of an error on users. Thus, user behaviors can be foreseen in any kind of faulty interaction that can occur in any kind of system. The values collected by TES might differ for a website, a mobile application or user type. Users of a banking website or of a news website would probably not act in the same way. Consequently, TES can be used for different situations to measure users' tolerance of any kind of system errors. We believe that TES can be developed for different types of errors in future works as well.
We created a form in order to obtain quantitative data about the TES variables. This form was filled in both during the sessions and by watching video records after the sessions were completed. Thanks to TES, we were able to observe the participants while they were struggling with the mentioned errors.

Test Process
In this study, face-to-face interviews were first conducted with the volunteers. In these interviews, the volunteers filled in a form that included demographic questions. After the participant selection process, the day and time when they would participate in the study was agreed. On the test days, the participants signed a consent document and read the introduction of the scenario. Before starting the sessions, all participants were asked to turn their mobile phone to airplane mode. In the sessions that were held in the university environment, Wi-Fi connection provided by the university was used to access the website. Similarly, home Wi-Fi connections were used in home environments. After the participants stated that they were ready, the sessions were started by the moderator. After completing all tasks and conducting the interviews, the sessions were ended.
In order to test whether the faulty tasks worked well, we performed a pilot test with the first two participants. After some corrections were made, the real tests were performed with the other 12 participants. The results of the pilot test were removed from the findings. We did not give any rewards to the participants in order to support their motivation; we thanked them instead.

Data Analysis
In the data analysis process, we examined the TES findings, voice, and video records. While qualitative findings were the voice and video records, quantitative findings consisted of the TES variables.
In order to answer our third research question, qualitative findings were clustered by similarity and later discussed. For quantitative findings, firstly, descriptive statistics such as mean, standard deviation, etc. were used. After the explanations of descriptive statistics, we discuss significant test results based on our first two research questions.

Limitation
Problems due to the mobile device, server, or internet infrastructure are the limitations of this study. Also, page loading duration is different for various devices, at various times. However, our controls on these factors demonstrated that there was not any negative effect on the study.

Findings
The findings of the study are divided into two sections as quantitative and qualitative findings. The results of dummy tasks are not given as findings.

Quantitative Findings
In this section, we describe the findings collected by TES into 6 groups: ED, FRD, NOR, RTD, RD, and GUD, which are indicated in Table 2. The duration information in these variables is given in seconds (the NOR variable does not include any time data). Environment 1 (E1) indicates the test room in Kırklareli University Distance Learning Implementation and Research Center; Environment 2 (E2) indicates participants' own houses. All Environments (AE) represents both E1 and E2. The findings for task 3 are given in Table 3. When AE is considered, the mean of ED was found to be 62.08 seconds (SD: 27.22). It was determined that after encountering this error, the participants behaved patiently (FRD: 56.08, SD: 67.24). The case of zero "0" valued FRD, RC, and RTD shows that the participants did not take any action after they had encountered the error. FRD of the participants ranges between 15 seconds and 207 seconds. It can also be seen that the participants tried to recover from the error 2.42 times on average for AE (SD: 2.11). As the number of our observations was small (n=12), we used Spearman's (rho) test in order to find association between the TES variables for this error. The results show that the GUD variable has a strong positive relationship with FRD (rs=.787; p<.05), NOR (rs=.63; p<.05), RTD (rs=.674; p<.05), and RD (rs=.851; p<.05) as expected. Additionally, it can also be seen that FRD has a strong positive relationship with RD (rs=.775; p<.05) and NOR has also a strong positive relationship with RTD (rs=.813; p<.05).
In task 5, users were faced with a blank page when trying to access a product page. The related findings are given in Table 4. The mean elapsed time to face the error for the first time (ED) was 31.42 for AE (SD: 24.18). It was determined that according to the FRD, FRT, RD, and the GUD variables, the participants behaved more impatiently in this task, compared to task 3. At the same time, the participants realized there was an error faster than in task 3. They interacted with the error after 10.42 seconds on average for AE (SD: 3.23). For the NOR variable, low standard deviation values show that the participants' behaviors are similar to each other. They retried to recover from the error 6.33 times on average (SD: 2.96), and this took 112.83 seconds (SD: 66.86) for AE. The GUD variable shows that participants could tolerate this error for 129.5 seconds on average for AE (SD: 63.35). In addition to these findings, it can be seen that task 5 is the most retried task by the participants. Since there was not any feedback or sign about what was happening, this error type seemed confusing to the participants. According to the findings from Spearman's test, while GUD has a strong positive relationship with RTD (rs=.993; p<.05) and RD (rs=.832; p<.05), RTD has also a strong positive relationship with RD (rs=.818; p<.05).
In task 7, the participants were made to wait for 10 seconds deliberatively after they clicked the related menu link, and then the CT error page was shown to them. In the findings of this task, some of the participants preferred to be patient until they saw the page while the others acted in the opposite way. The detailed findings are given in Table 5. In task 9, similar to the functionality in task 5, the participants faced the NF error page instead of seeing a blank page when they tried to access the product page. The findings are given in Table 6. In this task, even though E2 participants detected the error more quickly than E1's (ED), the first reaction of E1 participants was slower (FRD). In addition to this, E2 participants gave up recovering from the error more quickly (GUD: 30.2) when compared to E1's. When considering both environments, the mean of FRD was determined as 11 seconds (SD: 4.43), and NOR was 1.25 times (SD: 0.75) in 33.82 seconds (SD: 25.95). Another remarkable finding about this task is that even though this and the fifth task have the same functionality, the NOR values of this task are very different from the task 5. This finding indicates that participants retried much more often when not receiving an informative response. When we compare the findings of all errors, it can be seen that the findings of this error have the lowest values in general. From this point, it can be seen that because of the fact that feedback was given, the participants knew what they had encountered and they spent less time on recovery. In addition to these findings, Spearman's test results show that similar to task 7, the GUD variable has a strong positive relationship with NOR (rs=.840; p<.05), RTD (rs=.677; p<.05), and RD (rs=.698; p<.05). On the other hand, NOR has also a strong positive relationship with RTD (rs=.844; p<.05), and RD (rs=.691; p<.05). The findings for task 11 are given in Table 7. In this task, E2 participants quickly realized that there was an error (RD: 44.8) and acted more impatiently to recover from the error (NOR: 3, RTD: 57.2 and GUD: 88.6). Additionally, the mean of FRD of the participants is 16 seconds (SD: 7.46) for AE. At the same time, the participants tried to recover from the error 4.83 times (SD: 3.16) in 70.33 seconds (SD: 57.01) for both environments. According to Spearman's test results, as in the GUD variable has a strong positive relationship with NOR (rs=.671; p<.05) and RTD (rs=.930; p<.05), NOR and RTD have also a strong positive relationship (rs=.696; p<.05).
The findings given in Tables 4, 5, 6, and 7 are summarized and represented, respectively, in Table 8.  When we inspect the FRD indicator, it can be seen that all participants act similarly except for task 3. In task 3, because of simulating the NR error, we waited for the participants for a while. This is the reason for high FRD for task 3. The mean of the FRD value was determined as 21.37 seconds (SD: 17.47) when all of the FRD values were considered.

Mean of GUD (SD)
The mean of NOR is 3.65 (SD: 1.78) for all errors. This value shows that the participants usually tend to retry 2-5 times when facing a faulty interaction. In addition to this, RTD is another important variable which shows the duration of the retry process. The mean of RTD was determined as 58.15 seconds (SD: 33.4). It means that the participants retried for about 30-90 seconds when they interacted with a fault. On the other hand, the mean of RD, which indicates the mean elapsed time of error detection, was determined as 68.45 seconds (SD: 40.26). Additionally, it can be seen that the participants gave up in 95.68 seconds, on average (SD: 42.5).
For the first two research questions of this study, we searched for an answer by conducting some hypothesis tests. While conducting these tests, we excluded the ED variable of TES, because data collected over this variable would probably change on every distinct system. Besides this, it can be easily stated that data collected over the other variables represent the actions that users perform similarly in all systems after encountering an error. In this way, without the ED variable, the other five variables were considered a tolerance indicator for TES. Regarding this explanation, our null and alternative hypothesis are as below for the first question of this research: H0: There is no difference between the tolerance levels of the participants in the two environments.
H1: There are some differences between the tolerance levels of the participants in the two environments.
As our observation count (n=7 for E1 and n=5 for E2) is not enough for a parametric test, we used the Mann-Whitney U test in order to compare two different environments' data. Every comparison is based on both the type of error and the data collected over variables individually. The findings are given in Table 9. According to the results in Table 9, H0 hypothesis cannot be rejected. Although the FRD values of the CT and the NF errors may almost reject H0, the other twenty-three comparisons show that there is no difference between the tolerance levels of the participants in the two environments.
Because we did not detect any difference between the two environments, we created another null and alternative hypothesis for the second question of the study by considering E1 and E2 together (AE): H0: There is no difference between the errors in terms of tolerance indicators (variables).
H1: There are some differences between the errors in terms of tolerance indicators (variables).
For this analysis, in which we used Friedman test, the indicator values collected from all participants were tested based on all errors, individually. In other words, this test was done in order to detect if there was any difference between the errors in terms of indicator values. The findings from Friedman test are given in Table 10. According to the test results in Table 10, it can be seen that H0 can be rejected for each indicator. From this point, our next step was to find out how the error types differentiate user behavior in terms of indicators, which correspond to the answer to our second research question. In order to determine behavioral differences of the participants, we used Wilcoxon Signed Rank Test. For this analysis, we made ten comparisons between the errors for each indicator. The findings based on the FRD indicator are given in Table 11. According to Table 11, except for the RE error, all comparisons made between the NR and the other errors show that the NR error is the error to which the participants had the slowest reaction. In addition to this, the BP and the CT errors caused a faster reaction than the RE error.
The analysis results based on the NOR indicator are given in Table 12. Although there is no significant difference between the BP and the RE errors, these two types of errors occupied the participants more than any other error type in terms of recovering from the error. It was also determined that the CT error obligated users to try again more often than the NF error (Table 12 -Comparison 5). As mentioned above, the NOR and the RTD indicators did have a strong positive relationship according to Spearman's test results. Because of the fact that the findings from the comparisons based on the RTD indicator have similar clues as NORs', we did not consider it necessary to include the findings here. Instead of that, the analysis results based on the RD indicator are given in Table 13. According to Table 13, although there is no significant difference between the BP and the NR errors, the participants were able to detect these two types of errors later than other types of errors. Additionally, the error type with the fastest detection was found to be NF. In the correlation tests performed between the RD and the GUD parameters, a strong positive relationship was found in all errors except the RE error (rs=.538; p=.071) for the significance level of .05. For this reason, we did not consider it necessary to include the findings of the GUD indicator, either.

Qualitative Findings
Qualitative findings were collected from both interviews after the test and from verbal expressions during the test. In the interviews, the participants were first asked what they thought about the mobile website that was being tested. After that, we wanted to obtain their opinions about the effects of the same faults on any website. All collected information, which also corresponds to the answer to our third research question, is divided into three sub-sections (reliability, alternatives, and quality) and presented below:

Reliability
All of the participants stated that the website was insecure. They said that if it was a real-life experience, they would give up shopping at the website. The participant numbered as 5, who thought that the failure was because of his own mobile device, indicated that it might be tried on another device and then he/she would give up if the same errors persisted. The participant numbered as 12 said that "if I knew the website before, I would try to warn website administration. But if it was my first encounter with the website, I would never use it again".
When asked about their thoughts about websites serving other purposes than shopping, they stated that even if it was still insecure, they could be more forgiving. As an example, participants 5 and 10 stated that if it was a website serving a different purpose, the same failures could be ignored.

Alternatives
The participants stated that there were so many alternative websites with a similar purpose. For this reason, they implied that they would prefer an alternative e-commerce website if facing an insecure situation. The participant numbered as 6 said that "Even if it were not a fraud, I would prefer to shop at another e-commerce website". Nevertheless, some views show that the price of the products on the website might affect user behavior. Some comments such as "I can give it one more chance next time but just once", "I can give it one more chance, but I would buy cheap things" and "If the prices are really lower than any other website, I could give a few more chances" can be counted as an example. In the light of these explanations, it can easily be said that prices play an important role in users' tolerance of website faults.

Quality
Participants emphasized that the faults on the website were related to the quality of the website and lack of usability. "More attention should be paid to monetary transactions", "The website seems like an amateur site" and "It does not look nice in terms of professionalism" are some of the comments which need to be considered regarding any kind of website. In addition, some comments questioning the trustworthiness of the website were made as well.

Discussion and Conclusion
In this study, we conducted a test in order to detect the differences in users' behaviors and perceptions and investigate users' tolerance of encountering a faulty web interaction while using a manipulated mobile e-commerce website. Instead of detecting the weaknesses of an application with a usability test, it was aimed to enable users to interact with the weaknesses. A shopping scenario which had 11 tasks, including five different faulty tasks, was created for the test. In this way, the changes in users' behaviors, tolerance and perception when encountering a faulty interaction were inspected using Tolerance Evaluation Scale (TES) which had been created for this research by the authors.
From the findings collected from the descriptive analyses, it can be seen that the participants' behaviors are different for each type of error. The interaction duration is longer for the NR error than the others because the system made the participants wait for a while for a response. The FRD value of E2 participants is higher than E1s' (Table 3), but E2 participants behaved more impatiently with regard to avoiding the errors when GUD values are considered (except the NR error). This finding is an important result that shows that internet users are more intolerant in their natural environment. Additionally, Spearman's test results in the CT error also revealed that the participants acted impatiently while they were made to wait for any reason. We believe that this behavior is not limited to a particular error but can also be generalized for any situation that results in having to wait. On the other hand, it was discovered that the participants were also surprised when they understood that they were redirected to the main page unexpectedly instead of reaching the cart page (the RE error).
Similar to the result of the study made by Nah (2004), in this study, it was seen that feedback had a positive effect on the participants' tolerance. When we compare the BP and the NF errors, we can see that whilst the BP error contains a blank page, NF contains an informative "Not Found" feedback. Even if their FRD is similar, there is a big difference between the NOR and the RTD values. For this reason, it can be said that the feedback message plays an important role in improving user experience. Additionally, we believe that feedback should be given for any kind of faulty situation, but it might also be given for the processes that make users wait.
For the first two questions of the study, we also performed significance tests. One of the important findings is that there is no difference between the tolerance levels of the participants in the two environments, which is the answer to our first question. For the second research question of the study, the important points are summarized below: • The BP and RE errors are the errors that the participants struggled at most.
• The BP and NR errors are the errors that took the participants the longest to understand as an error. • The NR error is the error to which the participants showed the slowest reaction (except for the comparison with RE). • NF is the most quickly recognized error thanks to the effect of feedback.
In the quantitative results, the measured values are considered high due to the participants' psychology of being tested. In other words, it is predicted that users might have lower tolerance if they encounter errors similar to this study in daily life. Even so, in our opinion, the behavioral differences between the error types will be similar in real-life experiences. For this reason, error types and their different effects can be considered as valid. Mahajan et al. (2016), stated that the service quality and the trustworthiness of a website can be negatively affected by the presence of failures. In the qualitative findings of the present study, similar conclusions were reached. The participants pointed out the importance of the quality of the e-commerce website. On the other hand, they especially emphasized that it does not matter what purpose a website serves; the errors give an amateur impression by damaging usability and professionalism. Even if some of the participants declared that reasonable prices might result in giving an extra chance(s) to an e-commerce website, it is clear that the likelihood of this action is very low.
As a result of this study, it can be stated that creating positive experiences for users depends on knowing how users behave when encountering any type of errors. Generally, on the basis of the findings of the study, our suggestions are as follows: • Faulty situations can be automatically directed to a page that gives feedback by any system. • In case of a long process, feedback can be given at certain intervals during the time in which a transaction is performed. • The types of errors examined in this study might be difficult to track one by one in heavily operating systems. In order to facilitate this process, web mining can be used to detect related/similar error types. • Another qualitatively obtained information in this study was the price evaluations of the participants. It can be suggested that a newly opened e-commerce site might sell cheaper than other markets in order not to lose its users in case of possible errors in the recognition process.
Thanks to this study, we observed how users behave when encountering some kind of faulty web interaction. Even though we were not able to use eye-tracking glasses in this study for technical reasons, using these kinds of devices will probably present important clues about user behaviors. The presented findings of the study might be helpful for system designers and academics for future work.