Modelling COVID-19 Hotspot Using Bipartite Network Approach

COVID-19 causes a jarring impact on the livelihoods of people in Malaysia and globally. To prevent an outbreak in the community, identifying the likely sources of infection (hotspots) of COVID-19 is important. The goal of this study is to formulate a bipartite network model of COVID-19 transmissions by incorporating patient mobility data to address the assumption on population homogeneity made in the conventional models and focus on indirect transmission. Two types of nodes – human and location – are the main concern in the research scenario. 21 location nodes and 31 human nodes are identified from a patient’s pre-processed mobility data. The parameters used in this study for location node and human node quantifications are the ventilation rate of a location and the environmental properties of the location that affect the stability of the virus such as temperature and relative humidity. The summation rule is applied to quantify all nodes in the network and the link weight between the human node and the location node. The ranking of location and human nodes in this network is computed using a web search algorithm. This model is considered verified as the error obtained from the comparison made between the benchmark model and the COVID-19 bipartite network model is small. As a result, the higher ranking of the location is denoted as a hotspot in this study, and for a human node attached to this node will be ranked higher in the human node ranking. Consequently, the hotspot has a higher risk of transmission compared to other locations. These findings are proposed to provide a framework for public health authorities to identify the sources of infection and high-risk groups of people in the COVID-19 cases to control the transmission at the initial stage.


Introduction
The COVID-19 pandemic that shocked the world in early January 2020 witnessed many countries rushing to characterise the nature of the SARS-COV-2 virus that causes the COVID-19 disease and its highly infectious transmission. Globally, we witnessed unprecedented use of digital technology and innovation in assisting in the detection and prevention of the disease. Major technological companies and numerous governments focused their attention on rapid deployment of digital applications and platforms that utilise digital data such as geolocation, tracker, mobile reception signal and Bluetooth data in efforts to model and predict COVID-19 transmissions (Whitelaw et al., 2020;Gasser et al., 2020). Identifying the sources of infection of COVID-19 is crucial for preventing community transmission and averting formation of new transmission clusters. In order to detect the source of infection, the locations visited by an infectious person as well as the environmental properties of the locations need to be taken into consideration. In this study, the source of infection is referred to as a hotspot, which is the location that has a higher risk of COVID-19 transmission. This definition corresponds to one of the descriptions discussed by Lessler et al. (2017), which is an area that has high efficiency of the disease transmission. A hotspot can be identified by statistical approaches, which are adopted by many studies associated with several factors. For instance, the daily number of new cases and incidence rate in a weekly unit are used to perform the time series forecasting and calculate the Moran index in Brazil for spatial analysis (Gomes et al., 2020). The Moran index is frequently applied to obtain the autocorrelation for spatial data; it has been used to study the relationship between the spatial effect and COVID-19 transmission in China . The temperature is used to determine the transmission rate in China at a macro level, by integrating a number of incidence and demographic factors (Xie et al., 2020).
Early studies of COVID-19 transmission focussed on the prediction of incidents Shen et al., 2020;Wu et al., 2020). This is because COVID-19 is a newly emerging disease and the primary public health concern is to flatten the outbreak curve and reduce the burden of medical resources. Many of these studies used conventional approaches such as compartmental models, which assumed the population is well mixed (homogeneous) and thus providing astute simulation to decide on the action that public health authorities need to take. The main issue with the homogeneity assumptions made in the conventional approach is that it does not reflect the real world since people's mobility is influenced by age, social activity, occupation and other factors. To overcome this problem, partial differential equation (PDE) is used to integrate those factors in and combine them with population dynamics (Viguerie et al., 2020;Wang et al., 2020). The spatial factor considered in those studies determined the diffusion of the infection at the macro level, resulting in a computationally complex model.
The network model is widely used in infectious disease studies for location detection and contact tracing (Block et al., 2020;Kok et al., 2017;Meyers et al., 2006;Weeden & Cornwell, 2020). The network model is developed based on the network theory, which is a branch of graph theory that captures the heterogeneity of a real-world problem. Several studies have used the network model to simulate different scenarios of COVID-19. Block et al. (2020) use the network model to understand the effect of social distancing policy as control measures since the social interaction in different places is addressed by the model. Contact tracing is one of the important control measures adopted worldwide. The efficacy of contact tracing can be simulated using a network model whereby the nodes in the model act as hosts in the real world and the health status of each host will change over time after contact with other nodes in the network (Firth et al., 2020).
Interaction between different sub-populations is also described using the contact network model Majumdar & Mehta, 2020;Weeden & Cornwell, 2020). Those models focus on the interaction between hosts in a specific area, but the influence of the location visited by persons is excluded. As previously stated, the host's movement is crucial for identifying the source of infection. To integrate the host's mobility in a network, the bipartite network model (BNM) is proposed for this study. The relationship of the host, location and environment can be illustrated using an epidemiology triangle (Eze, 2013) as shown in Figure 1(a). The location component and environment properties are highly related and the environmental properties of each location are different from others; thus, the location component can act as a component consisting of environmental properties. Thus, the epidemiology triangle is modified into a basic building block depicted in Figure 1 As proposed by Eze et al. (2014), the two-type nodes shown in Figure 1 (b) form a bipartite network when they are quantified according to the problem considered. Eze et al. (2014) illustrated the method successfully in identifying the source of infection for malaria and Kok et al. (2017) applied a similar approach to dengue. Kok et al. (2017) fully integrated the dengue patient mobility data into the network to form the location node. In the location node, several environmental properties such as temperature, humidity and precipitation are used to quantify mosquitoes' life cycle and their survival rate at a particular location. In this study, the droplet produced by the COVID-19 patient carries the virus and is hypothesized to stay in the air and on the surfaces of nearby objects. The environmental properties will thus be quantified to estimate the stability of the virus in a location. As mentioned above, the environmental properties are highly related to the location; thus, the basic building block depicted in Figure 1(b) is the main structure to connect the host (human) and location in this study. Therefore, the BNM is suitable to model the problem concerned with the two discrete entities.
This study aims to predict a hotspot of COVID-19 by formulating a contact network that includes patient movement history two weeks prior to being tested positive. By identifying the high-risk location visited by the patient, the people who visited the location can be isolated beforehand to prevent the disease spreading to more people. Even though there is no case reported of indirect transmission of COVID-19 in Malaysia, there were several cases reported from other countries with the possibility of being infected in public places via fomite or aerosol transmission (Cai et al., 2020;Lauer et al., 2020;Miller et al., 2021). Those findings inspired this study to address indirect transmission in Malaysia by formulating a network to connect the host with the locations visited. These data are called patient mobility data and can be obtained via a mobile tracing application, such as the MySejahtera application used in Malaysia. Therefore, the source of infection may be more accurately determined if there are more mobility data from other patients that can be included in the network. The output of this study is a form of proof-of-concept that the BNM can be employed to identify a hotspot of COVID-19 using a patient's past mobility data. The findings aim to help public health authorities in Malaysia to perform more effective contact tracing and reduce the community transmission cluster.

Materials and Methods
The methodology adopted in this research is the Bipartite Network-Based Methodology Framework (BNB-RMF) (Liew, 2016). This methodology was applied to model the habitat suitability of the Irrawaddy dolphin and dengue hotspot identification (Liew et al., 2015;Kok et al., 2017). There are a few stages necessary in this methodology: (i) pre-processing of the data used; (ii) formulation of the bipartite graph structure and network; (iii) quantification and ranking of the nodes.

Data pre-processing
The dataset used in this study is a real case contact tracing investigation form provided by the Sarawak Bintulu Division Heath Office. Currently, contact tracing of COVID-19 is triggered once a person is tested positive; thus, the dataset provided is based on a case. The investigation form (MoH, 2020) consists of the location visited by the patient with visit time or period. The close contacts mentioned in the investigation form are defined as (i) household member of the COVID-19 patient; (ii) workplace colleague of the COVID-19 patient in close proximity; (iii) social contact with COVID-19 in close proximity; (iv) travelling with the COVID-19 patient in the same transportation. The contact tracing is to trace all the close contacts of the patient in 14 days before the point of diagnosis. All locations listed in the investigation form will be converted into Global Positioning System (GPS) coordinates and any duplicated locations will be removed. Each location will be coded with an Identity Document (ID) and will be referred to as a location node in this study. In order to generate an anonymized dataset, the names of close contacts and patients stated in the investigation form will be replaced with a unique ID to conceal their personal information.
A total of 21 location nodes and 31 human nodes (which include the case patient) are identified from the investigation form and used as the input to formulate the network. 79 data points are formed, where each person having visited a location is defined as a unique data point. Every data point consists of the GPS coordinates of the locations visited, average surface air temperature (T, in degrees Celsius, °C), average relative humidity (RH, in percentage, %), location type (Ltype, in binary, outdoor, 0 or indoor, 1), ventilation rate of the location (Q, in per hour, h -1 ), the frequency of a person having visited the location (Fh), the total duration of stay of one person in the location (D, in hours, h), the frequency of one location visited by a patient (Fl) and human status (Hs, in binary, 1 for patient or 0 for close contact). The environmental data used in this study, which are T and RH, are obtained from the World Weather Online platform. The GPS coordinates of each location node and the date and time of visit are used to retrieve the T and RH for each location node.

Modelling of bipartite network model
A hotspot is defined as a location that has a higher risk of COVID-19 transmission, and it is the location visited by an infectious person. To identify the hotspot, the parameters of human and location nodes are important so that the link weight can be quantified. A hotspot of COVID-19 in this study is determined by the ranking of the location according to its COVID-19 hotspot ranking value (CHR). Higher ranking indicates a higher risk of transmission. For the human node, the ranking value is termed the COVID-19 infection risk ranking (CRR) and used to predict the potential infectious person. The model formulation process is discussed below.

Formulation of bipartite graph structure
A bipartite COVID-19 contact (BCC) graph is formulated by employing the basic building block illustrated in Figure 1(b). The patient and the close contact retrieved from the investigation form are substituted into the host vertex in the figure and defined as the human node (H). The location vertex in the figure is replaced by the location visited by the patient, which is referred to as the location node (L). The edge connecting the human node and the location node in the network is denoted as a link (E). The E represents the interaction between the human and the location and the form when the H visited the L. The weight of the link implies the contact strength between the human and the location node. Higher contact strength indicates a stronger relationship between the human and the given location. The higher contact strength suggests that the location has a higher risk of spreading the virus and the person has a higher possibility to be infected from the location.
There are 30 close contacts identified from the investigation form; hence, a total of 31 human nodes are listed and labelled as shown in Equation (2). There are 21 locations visited by the patient and labelled accordingly as in Equation (3). The edge between the human node and location nodes are the 79 data points mentioned in Section 2.1 and defined accordingly as in Equation (4). Therefore, the BCC graph is formulated and defined as in Equation (1)

Quantification of parameters for location node
There are three parameters that are chosen to quantify the location node as shown in Figure 3. As mentioned earlier, the aim of the study is to incorporate the indirect transmission of the disease via aerosol. Thus, the first parameter derived to be used in the location node is the stability of the virus in aerosol (Ka). Aerosols are generated when people talk, sneeze, and cough; hence, such actions done by an infectious person may result in the aerosols carrying the virus. Aerosol generated by COVID-19 patients can carry an active virus and remain viable for a certain period. Since the decay rate of the virus depends on the decay constant, the decay constant for the virus infectivity is used to quantify the stability of the virus in this study. The stability of the virus is affected by environmental properties such as temperature and humidity (Fears et al., 2020;Santarpia et al., 2020;van Doremalen et al., 2020). Hence, the relationship between the stability of the virus in aerosol and the environment needs to be quantified in this study.
To address the effect of temperature and humidity on the infectivity of the virus, Dabisch et al. (2021) conducted an experiment to analyse the virus infectivity under several conditions. The temperature condition was adjusted from 10°C to 40°C and the condition of relative humidity was tested from 20% to 70%. They then performed a stepwise regression analysis to obtain a regression model. This regression model is utilized to obtain the equation of stability of the virus in aerosol (Ka) with temperature (T) and relative humidity (RH) for this study as shown in Equation (5) Note that a higher decay constant indicates a rapid decrease in the infectivity of the virus in aerosol in a location. The stability of the virus on a surface is considered in this study and used as the second of the derived parameters in the location node. Several studies claim that the virus can stay on various surfaces and remain infective from an hour to several days (Guo et al., 2020;Santarpia et al., 2020;van Doremalen et al., 2020). The virus generated by the patient via droplets can fall on the surface of an object and can then be touched by people. The infectivity of the virus on the surface will be influenced by several factors such as temperature, relative humidity and surface type.
In order to obtain the relationship for infectivity of the virus on a surface, an experiment is conducted by Biryukov et al. (2020) where the temperature is controlled from 24°C to 35°C and the relative humidity is adjusted from 20 to 80%; the surfaces chosen are stainless steel and acrylonitrile butadiene styrene (ABS) plastic, which are materials most seen in public places. A linear regression analysis is performed to fit the experimental data, and the half-life of the virus is the quantity of interest in the experiment. Since the decay constant will act as the unit of measurement for virus stability, the half-life equation obtained from that experiment will be converted into a decay constant format as shown in Equation (6): The third parameter is Fl, which is defined as the frequency of a location node visited by a patient. The number of visits made by a close contact is not taken into account because only the patient is assumed to carry and spread the virus. Thus, the Fl is quantified to integrate the impact of the patient's location visiting time. To quantify the Fl, a link matrix is formed where Li is the location node i and Hj is the human node with human type; (Hs) is 1. Since there is only one patient in this network, j = 1 with ℎ 1: representing the number of times the patient visited the location i, where i ∈ {1, 2, 3, …, 21}. Therefore, the Fl is mathematically defined as in Equations (7)

Quantification of parameters for human node
There are also three parameters chosen to quantify the human node as illustrated in Figure 3. The first parameter concerned is the indoor infection risk (Pwr), affecting the close contact only. The indoor infection risk Pwr is quantified using the Wells-Riley equation, which is based on the concept of 'quantum of infection' proposed by Wells (1955) and formulated by Riley et al. (1978). Measles was the first infectious disease to apply the Wells-Riley equation for risk assessment. The 'quantum of infection' is a hypothetical infectious dose unit exhaled by an infectious person. The quantum generation rate, q, found in the Wells-Riley equation, cannot be directly measured but it can be estimated, as suggested by Dai & Zhao (2020), based on the epidemiology data such as the number of incidents and the basic reproduction number, R0 . The equation to quantify the q is defined by Dai & Zhao (2020) and shown in Equation (9): = + 1 0 + 2 0 2 (9) Where the A, B1, and B2 are constant numbers which are -30.27958, -44.81536, and 19.67934 respectively. The R0 for Malaysia is 3.08, as reported by Gill et al., (2020); therefore, the q is found to be 18.38 h -1 . The pulmonary ventilation rate of a susceptible p is estimated as 0.3 m 3 per hour based on Duan (2013). The ventilation rate of a location (Q) found in the Wells-Riley equation, is quantified based on the location type in accordance to the value recommended by the American Society of Heating, Refrigerating and Air-Conditioning Engineers guideline (ASHRAE, 2013). I indicates the number of infectious persons in the location, and t refers to the total duration of stay (D) by the close contact. Putting all the information described earlier together, the quantification of Pwr is therefore defined as Equation (10): The parameter (D) mentioned earlier is the second parameter for the human node. It is defined as the total duration of a person's stay at one location and is measured in hours. This parameter is calculated as the cumulative duration of an infectious person having visited a particular location. The final parameter required for the human node is the frequency of a person visiting a location. This is simply defined as the number of times a person visited a location Fh j:i , shown mathematically in Equation (11)

Quantification of link weight
In order to convert the BCC graph depicted in Figure 2 into a network, the weight of the edge needs to be quantified. The link weight is defined as the contact strength between the human and the location node, it is termed the COVID-19 contact strength (CCS). A higher value of the CCS indicates a stronger intensity between the human node and the location node in the network. There are four rules introduced by Liew et al. (2015): (i) summation, (ii) multiplication, (iii) summation of products, and (iv) multiplication of sums. The summation rule is used in dengue hotspot identification . Since the aspect of this study is the same as the dengue hotspot study, which is disease hotspot identification, the summation rule is applied for the link weight quantification at this stage. All of the parameter values are normalized before the link weight is quantified. The link weight quantification is formulated in Equation (12) The complete BCC network is illustrated in Figure 4 with the link weight value shown on all edges. To provide better visualization for the network, the human nodes from H2 to H31 are clustered according to the location contacted with H1. The link weight value will be substituted to each data point on the link matrix as an input to generate the ranking of the human node and the location node.

Implementation of ranking algorithm
To rank the human nodes and the location nodes, the Hypertext-Induced Topic Search (HITS) algorithm is used in this study. The HITS is a web page searching algorithm with a query-dependent algorithm, which is more dynamic than other types of web search algorithms. It not only considers the link popularity or node degree, but the link weights of the network are also taken into account. Thus, this is one of the reasons for choosing HITS as the ranking algorithm in this study. Another reason is that the HITS involves the authority and hub where the hub matrix is pointing to the authority matrix, which corresponds to the location node and human node in this study. The HITS has been used to rank the hotspots of mosquitoborne diseases: malaria (Eze, 2013) and dengue (Kok, 2017) and suitability of dolphin habitats (Liew, 2016). All the mentioned studies have used the BNM to identify the hotspots and habitat suitability, proving the HITS to be suitable for bipartite network analysis.
The CCS matrix computed by Equation (12) is used to generate a hub matrix and an authority matrix as the inputs for the HITS to determine the corresponding eigenvalues and eigenvectors for each location node and human node. The R language is chosen to implement the HITS algorithm by using the power method to identify the principal eigenvalues and eigenvectors. The eigenvector of the location node is defined as CHR while the eigenvector of human nodes is defined as CRR. These CHR and CRR values range from 0 to 1 and thus, the ranking is based on these values evaluated using Equation (13): The CRR values are computed using the same Equation (13), where the Eigenvector is the principal eigenvector generated from the HITS algorithm and n is the number of nodes, which is 21 for CHR and 31 for CRR.

Results and Discussion
The BCC network is considered fully formulated when all the parameters defined earlier are quantified. As mentioned, there are three parameters for the human node and three for the location node. In terms of node degree, the highest one is location node L1 with a total of 14 links attached to human nodes H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12, H13 and H14. This is followed by L8, L11, L12, L17 and L19, sharing the same number of node degrees, which is 7.
The ranking results of the location nodes are listed in Table 1. As expected, L1 is placed first in the ranking results since it has the highest CHR value and its highest node degree is 14. However, L3 has only 7 node degrees but it was ranked higher than L11, L12, L17 and L19. According to the literature that used the same methodology, the node degree of location is not a principal factor that affects the location node ranking Ying et al., 2015). In fact, the CCS of L3 with the linked human nodes are higher than the CSS values of L11, L12, L17 and L19. This statement implies that the other parameters of the human nodes and location nodes signify the ranking. Consistent with the literature, this study shows that the statement is true based on the ranking shown in Table 1. Thus, not only does the node degree of location play a role in identifying the hotspot, but also other factors need to be taken into consideration. With respect to the aim of this study, as anticipated, location L1 is found to be a hotspot since it is the workplace of the infected person H1. Nevertheless, locations L8 and L3 should also be focussed on because both of them are ranked after L1. The ranking of the human node tabulated in Table 2 can be used as a potential indicator to predict the risk of infection for close contact and identify the 'super spreader' if there are more than one patient in the network. Based on the data obtained and hence the results, since there is only one patient (H1) in the network, then, as expected, the human node with the highest CRR is H1, implying that H1 is the person spreading the virus in this network. On the other hand, the close contacts, H2 to H14, hold the same CRR values and are clustered into the same rank. They are the close contacts of H1 via location node L1, and since L1 was ranked first among the location nodes, this implies that this cluster of close contacts has a higher risk than other human nodes in this network. To ensure that the implementation of the HITS algorithm is correct, a benchmark verification is performed using UCINET to generate a benchmark model. UCINET is a network analysis software that can support many network analysis types, for instance centrality measurement, elementary network analysis and subgroup identification. One of the built-in functions of UCINET is network centrality analysis, where the input data is a hub matrix and an authority matrix to obtain the eigenvector of the network and is termed CHRB for the location node and CRRB for the human node. Since the hub matrix and authority matrix are also used in this study to generate the ranking, it is suitable to choose UCINET as the benchmark model. The root mean square error (RMSE) between CHRB and CHR as well as between CRRB and CRR will be evaluated and used as an indicator to verify the model. The model is verified if the RMSE value is less than 0.05. The RMSE found when comparing between CRRB and CRR is 0.000224 and the RMSE obtained when comparing between CHRB and CHR is 0.001419. Since the RMSE are found to be less than 0.05 for both instances, the implementation process is verified.
At the point of writing, there are no new cases reported from the close contacts (human node) of this formulated network. A possible explanation for this might be that the patient in this study, which is H1, has probably not developed the symptoms yet. According to Cevik et al. (2020), 5-6 days are required to develop the symptoms once exposed to the virus, and the virus transmission capacity of COVID-19 is highest in the first week after symptoms have developed. The diagnosis results obtained by viral detection do not correspond to the infectivity of the patient (Cevik et al., 2020). Thus, the viral load of the patient (H1) in this network may not be sufficient enough to infect other people before being diagnosed as positive.
Another possible explanation for this is that the patient's exact onset date may be much earlier than the tracing done as reported in the investigation form. The reverse transcription polymerase chain reaction (RT-PCR) is a common method for detecting the pathogen of COVID-19 in the human body. A swab sample from the upper respiratory tract, which is between the nose and the mouth, is taken for the RT-PCR test. The RT-PCR can capture the virus in the sample even 17 days after the symptoms appear . This shows that the patient's point of exposure is crucial for estimating the actual onset date.
The dataset used in this study did not state the point of exposure of H1, and this is difficult to estimate because H1 is the only patient in this network. Hence, there is a possibility that some close contacts are left untraced if the actual onset date of H1 is much earlier.
To create a full picture of this study, the transmission ability of asymptomatic patients is suggested to be included in the model. An asymptomatic patient is a patient who has not developed any symptoms of COVID-19 before and after diagnosis, mostly detected through close contact tracing. Several studies have claimed that the transmission ability of asymptomatic patients is lower than that of symptomatic and presymptomatic patients (Buitrago-Garcia et al., 2020;Qiu et al., 2021). A pre-symptomatic patient is a patient who has no symptoms at the point of diagnosis but develops them after that, while a symptomatic patient is a patient who has developed symptoms before the point of diagnosis, and the symptoms can be further classified into three levels: (i) mild-moderate illness, (ii) severe illness, and (iii) critical illness. The parameters used in the model are only concentrated on the effect of virus transmission under environmental conditions of the location at the current stage. The formulation lacks the consideration of the kinetic dynamics of the virus in the human body. Even though the transmission potential of the asymptomatic patient is low, it should not be neglected. It is proposed that this could be included in the human node quantification stage, thus ensuring that the network obtained will be richer as it adds more human and location nodes to the network.

Conclusion
The main goal of this study is to identify hotspots of COVID-19 transmission by formulating a bipartite network model based on patient mobility data. There are 30 close contacts and 1 patient with 21 locations visited by the patient in the network. The stability of the virus on surfaces and in aerosols are the derived parameters of the location node in the network as it determines the effect of temperature and relative humidity of the location on the virus. The indoor infection risk is another derived parameter of the human node to evaluate the risk of infection by considering the ventilation rate of a particular location. The BCC network was formulated by quantifying the link weight between each pair of the bipartite nodes and was verified using UCINET. Even though there is only one patient (H1) in this network model, the findings show that BNM can identify the hotspots, which for this study were found to be L1, followed by L8 and L3. It is evident from the results that in identifying disease hotspots, it is insufficient to rely only on the locations with a higher node degree; other factors must also be taken into consideration such as location size, location activity type, and density of people in the location. The limitation of this study is the data considered as it happened to be imbalanced: only one patient appears in the network and there are no subsequent cases linked with the patient, H1.
Due to this, the model validation is not included in this paper. Another issue that was not addressed in this study was the transmission ability of asymptomatic patients, which is hypothesized to affect the results because the attack rate of those patients is distinct from that of symptomatic patients. Therefore, further study is needed to analyse the difference between several sets of networks by using different sets of data. In addition, the influence of each parameter needs to be investigated by performing significance and sensitivity analyses. It is hoped that the outcome of this study will be able to assist public health authorities in increasing the efficacy of contact tracing and identifying the source of infection for every patient so that the outbreak of COVID-19 in Malaysia will be contained.