Data Analytics Approach for Short-term Sales Forecasts Using Limited Information in E-commerce Marketplace

E-commerce has become very important in our daily lives. Many business transactions are made easier on this platform. Sellers and consumers are the two main parties that gain a lot of benefits from it. Although many sellers are attracted to set up their businesses on this online platform, it also causes challenges such as a highly competitive business environment and unpredictable sales. Thus, we propose a data analytics approach for short-term sales forecasts using limited information in the e-commerce marketplace. Product details are scraped from the e-commerce marketplace using a content scraping tool. Since the information in the e-commerce marketplace is limited and essential, scraped product details are pre-processed and constructed into meaningful data. These data are used in the computation of the forecasting methods. Three types of quantitative forecasting methods are computed and compared. These are simple moving average, dynamic linear regression and exponential smoothing. Three different evaluation metrics, namely mean absolute deviation, mean absolute percentage error and mean squared error, are used for the performance evaluation in order to determine the most suitable forecasting method. In our experiment, we found that the simple moving average has the best forecasting accuracy among other forecasting methods. Therefore, the application of the simple moving average forecasting method is suitable and can be used in the e-commerce marketplace for sales forecasting.


Introduction
A marketplace is a particular place where buyers and sellers come together to interact directly and indirectly to exchange goods and services for money. A market involves a process which enables demand and supply to operate and determine prices agreed by both buyers and sellers. In the early stage before the Internet explosion, marketplaces were physical commercial locations such as shopping malls, teleshopping via TV, door-to-door, etc. Buyers and sellers typically met physically to promote products and services and negotiate prices with each other.
In the late 1990s, the Internet changed the method of people doing commerce. People can purchase their preferred products and services through the Internet. The main reason that people tend to shop online is because it saves time and energy compared to traditional shopping (Naseri et al., 2021). Besides, consumers can receive the purchased products directly at their door. This convenient purchasing method increases the usage of e-commerce platforms. High usage of these platforms has increased the volume of trade in the e-commerce marketplace exponentially from year to year (Najmi et al., 2015). Apart from that, it has been reported that 50 percent of Malaysians (16.53 million) are online shoppers (International Trade Administration, n.d.). The statistics shows that the e-commerce platform has created a very significant impact on our daily life.
The increasing number of sellers in the e-commerce marketplace has created a situation where the market is crowded with many privately labelled items of similar products. The volume and variety of products in the e-commerce marketplace greatly increases with the increasing numbers of sellers. In order to compete for sales among the sellers, a seller may cut down revenue and sell the product at a lower price. By using this approach, the seller may attempt to get a lot of sales and gain market share, but the seller cannot earn much profit to support the business in the long term. Some sellers even practice selling at a loss to gain market share. Therefore, it is important for a seller to choose the right product (e.g., product in high demand and relatively less competition). In fact, this situation can be improved if the seller can perform some sales forecasting such as a short-term forecast for a product before selling it. This is definitely helpful for a seller to compete in a crowded marketplace, especially for new sellers.
Besides competition, lack of meaningful data on the platform interface is another issue. Although the digitalisation of the marketplace generates many data on product details such as product name, specifications, manufacturer, distributor, etc., these data are insufficient for sellers to identify profitable products. Sellers cannot use the data directly to make comparisons. One of the main reasons is that the data are considered raw data and meant for buyers instead of sellers. The data shown on the interface are not representative if used independently. Thus, such data are less useful and it is very difficult for sellers to make an informed decision in choosing the right product.
In order to solve these problems, product research tools are introduced. A product research tool is a necessary tool for sellers to venture into the e-commerce marketplace. This tool can be used to filter and shortlist products in high demand, which will help sellers narrow down and find the most potential and suitable product. Some example of this product research tool include Jungle Scout 1 , ZonGuru 2 , Viral Launch 3 and CamelCamelCamel 4 . Although these tools are very useful, they are platform-specific and proprietary. For example, JungleScout is only applicable to Amazon.com. It would be helpful if there was an open model which would be applicable to different platforms. To the best of the authors' knowledge, there is no other similar research paper that focuses on the product research method in the literature.
We understand that short-term sales forecasts are necessary for highly competitive marketplaces. However, sufficient historical data are required to perform accurate forecasting. The accuracy of sales forecasting will be affected by the limited information in the digitalised marketplace. Therefore, limited information on each product (especially quantitative data) is studied to make good use of it and to ensure accurate forecasting results. In this study, we propose a data analytics approach for short-term sales forecasting using limited information in the e-commerce marketplace. We are aware that the product research tool is a complete system that can also be used to discover potential products. However, the core of the product research tool is the computational method used to discover potential products. Therefore, our focus is on the computational method, which is the quantitative forecasting approach.
In recent years, many quantitative forecasting models have been developed and each of them has its special use. To carry out a short-term forecast using limited information in the e-commerce marketplace, a forecasting model must be computationally simple and accurate. This is because sellers often require quick and accurate results from the forecast to plan ahead before making any important decisions. Hence, three type of quantitative forecasting methods, namely simple moving average (SMA), linear regression (LR) and exponential smoothing (ES), are investigated and compared in this study. In order to determine a forecasting model with the highest accuracy when using limited information, evaluation metrics such as mean squared error (MSE), mean absolute deviation (MAD) and mean absolute percentage error (MAPE) are applied. The rest of this paper is organised as follows: Section 2 presents the literature review, which focuses on data mining and other research on quantitative forecasting. Section 3 discusses the proposed methodology in detail. Section 4 describes the experimental results. Finally, Section 5 highlights the contribution of our work and concludes the paper.

Literature review
Data mining is a process that extracts hidden information from a larger set of raw data. The data mining technique can help us study the trend and patterns of collected data. It gives us a big picture of the data that we might want to study. Thus, decision makers can make better decisions based on analysing the results of data mining instead of acting blindly. It is very common to see data mining applied in the area of marketing. For example, data mining is used to extract useful and hidden information to support marketing decisions (Bach & Alessa, 2014). According to Bach & Alessa (2014), credit card marketers make better decisions and target the right customers with the help of a system that employs the data mining technique. In another example, marketers from a lending organisation had integrated data mining to form a system called credit scoring. The system is used to overcome the complex and unstructured credit risk evaluation (Kambal et al., 2013). By using this tool, marketers can identify the credit risk and minimise non-performing loans to avoid unnecessary losses.
Data mining is also used in forecasting. Forecasting is a process that predicts a future occurrence or event using historical data (Hanggara, 2021). For example, sellers perform forecasting to foresee possible problems such as overstocking, stockout, demand and sales issues that happen frequently in many businesses. Forecasting can improve a company's stocking flexibility. Company owners can change their strategy and allocate resources accordingly. For example, they can reduce the stock of a product when they have forecasted that the product is not in high demand, and invest in other products in higher demand. With such a forecasting approach, the company can stay flexible and react to different possible risks in order to reduce unnecessary losses. This is because high profit and large flexibility of business operation often relates to precise forecasting (Ren et al., 2020). Therefore, forecasting is an essential technique to help a company be more operationally flexible and grow.
One of the challenges in sales forecasting is the rapidly changing and complicated business environment in the e-commerce marketplace (Zhao & Wang, 2017). Such a business environment will affect the accuracy of sales forecasts and cause wrong decisions to be made. In order to deal with it, a short-term sales forecast should be carried out. Although there is limited research on short-term sales forecasts using limited information in the e-commerce marketplace, short-term forecasts of passenger demand under ondemand ride services could explain the importance of short-term forecasts in a rapidly changing environment. An accurate short-term passenger demand forecast is necessary to help taxi drivers and ondemand ride service platforms overcome unhealthy scenarios such as oversupply and overdemand (Ke et al., 2017). According to Ke et al. (2017), short-term forecasts are able to shorten passengers' waiting times while optimising the utilisation rate of the drivers. Such evidence suggests that short-term sales forecasts may be suitable for the rapidly changing e-commerce marketplace in order to optimise inventory management and sellers' profit.
Common forecasting used in businesses frequently relates to a time series quantitative forecasting model. This model can predict the future value based on trends and patterns of past data. A time series quantitative forecasting model is applicable in many fields as long as there exist past data in a time series. The application of different time series quantitative forecasting methods will be discussed.
The application of simple moving average (SMA) to forecast car demand in Indonesia during the outbreak of COVID-19 pandemic helps the automotive industry reduce unnecessary overproduction of vehicles (Hanggara, 2021). According to Hanggara (2021), the automotive industry was affected when the public was instructed to stay at home during the pandemic. Another study claiming that SMA is simple but efficient for determining trends in the marketplace is by Nadhira, Gadisku & Peranginangin (2021). According to Nadhira et al. (2021), sales demand of the personal care product Softex 1400-M can be forecasted using SMA to prevent stockouts or overstocks.
Apart from SMA, linear regression is another type of a time series quantitative forecasting model. A study from Bangladesh used linear regression in crime investigation (Awal et al., 2016). Previous crime data were used to train the linear regression model to forecast the future trend of crime. The authors forecasted the trend of crime that includes dacoity, robbery, burglary and theft.
Exponential smoothing is another type of time series quantitative forecasting model. It smooths the time series data using a smoothing factor to reduce fluctuated forecast values. The smoothing factor is a value between 0 and 1. A higher smoothing factor will result in lesser smoothing and higher responsiveness to variations in the historical data. In Sinaga & Irawati (2020), exponential smoothing was used to forecast medical disposable supply demand. Smoothing factors ( ) of 0.1 and 0.3 were used for the computation of exponential smoothing. They used previous actual demand data as the input for the forecasting model. They achieved superior performance by using of 0.1. Besides, another study by Aini et al. (2017) used exponential smoothing to forecast the sales of 3kg LPG cylinders. The sales data were fed into the forecasting model with values of 0.1, 0.5 and 0.9. The authors concluded that the values of 0.9 has the least error in their experiment.
Accuracy evaluation is an important process which is used to measure and understand the quality of a forecasting model. It indicates the performance of a forecasting model in a specific area. There are several common evaluation metrics used to the measure the accuracy, which include mean square error (MSE), mean absolute deviation (MAD) and mean average percentage error (MAPE) (Nadhira et al., 2021). MSE and MAD share similar properties as both metrics are the mean of totalled errors. The only difference between MSE and MAD is that MSE is the mean of totalled squared errors while MAD is the mean of totalled absolute errors. A higher value of MSE and MAD indicates that the forecasting result is less accurate compared to a lower value. On the other hand, MAPE shows the prediction percentage error compared to the actual value (Khair et al., 2017). Although using any of the evaluation metrics alone can achieve the purpose of evaluation, using a combination of them can often help cross-check the experiment results and remove uncertainty. Such a combination can be found in Kilimci et al. (2019), where MAPE and MAD were combined for the second decision integration strategy to determine the accuracy of the forecasting models.

Methodology
The proposed methodology involves three main phases: data collection, data formation and quantitative forecasting. First, we use a content scraping tool to collect historical sold quantity data for each product from an e-commerce platform. After that, raw data are pre-processed to yield the historical sales data during the data formation phase. Data formation phase consists of two sub-phases: data pre-processing and data construction. After that, the historical sales data are used in the quantitative forecasting phase. During this phase, different quantitative forecasting methods are involved in the computation to generate forecasted sales. Lastly, the forecasted results are computed. Figure 1 shows the general architecture of the proposed methodology.

Data collection
In this paper, the Shopee Malaysia 5 e-commerce platform was selected among other existing e-commerce platforms. This is because the Shopee website provides useful raw data, which is the sold quantity of each product. This sold quantity data record the accumulated sales quantity, which can be transformed to form the historical sales and used for the forecasting. In order to extract these data, an automated scraping module was developed using Selenium Automation 6 . During the scraping process, the module extracted the data from the HTML webpages and stored them in the database.  (2019). Before the scraping process started, 300 products from each category were selected randomly. This random selection created a pool of 1800 products. Since not all 1800 products were relevant to our study, an exclusion process was required. Foreign products (outside Malaysia), live shows and payment services which are not considered tangible products were excluded. After the exclusion process, 100 products from each category were chosen randomly from the remaining valid products. Each of these selected products was used as the seed for the daily scraping processes. Therefore, a total of 600 products were scraped daily during the data collection phase. The data used for this paper are data scraped from 10 January 2022 to 28 February 2022.

Data formation
The collected raw data were pre-processed and constructed during the data formation phase. There are two operations during pre-processing phase. The first operation is to convert the scraped raw data such as Sold Quantity to a number format. For example, '1.3k' is converted into a whole number of 1300. The second operation is to handle invalid data. Essential information such as Sold Quantity is important for the forecasting and cannot be left empty. Thus, it is best to pre-process the data with an invalid data replacement procedure. In our proposed replacement procedure, the invalid data were initialised with the value of their previous Sold Quantity. The replacement of previous data would overcome the issue of discontinuity of Sold Quantity. This workaround is reasonable because it mimics the situation where there are no Daily Sales for that particular day. Please note that invalid data may occur during the data collection. For example, the developers of the e-commerce website may periodically update their website, and this may cause some data to no longer exist. Other reasons include slow webpage loading and rendering, which will trigger a timeout of the scraping tool. Table 1 shows the error difference of raw and preprocessed partial sample data. It is clearly shown that the error due to discontinuity of Sold Quantity can be reduced if invalid data are replaced by the previous data. Since Sold Quantity is the accumulated daily sales quantities, we can derive the Daily Sales by calculating the difference between current and previous Sold Quantity. This calculation is conducted during the data construction phase. Note that the data collection during the scraping process is organised daily; we label each scraped data item as Scrape-k (i.e., k is a running number). The calculated Daily Sales were used as the input for the computation of the forecasting model.

Time series quantitative forecasting
From the previous phases, we know that Sold Quantity is the limited information in the e-commerce platform. After extracting Daily Sales (time series data) from the Sold Quantity, we used different time series quantitative forecasting models to compute the forecasting results and identify the best model. The time series quantitative forecasting models are simple moving average (SMA), linear regression (LR) and exponential smoothing (ES). Besides, the forecasting model should be equipped with short-term forecasting characteristics in order to help sellers compete on the market. Previous data should be excluded and are not taken into account for computation. Therefore, the moving window concept is applied to the existing LR forecasting model. The modified forecasting model that applies the moving window is dynamic linear regression (DLR). The moving window is a technique that takes a subset of data in a fixed amount, adding a new value to the subset and removing an old value from the subset simultaneously while shifting across time. Research shows that better smoothness can be achieved with filtered noise when the window size is increasing (Raudys & Pabarskaite, 2018). So, a moving window of 10 days was used throughout this experiment.

Simple moving average
Simple moving average (SMA) is a value-smoothing forecasting method that computes the average as the output from a group of input values (Pataropura et al., 2019). The formula for SMA is as follows: where is the data, is the total number of the data and w is the window size.
The group of input values to the SMA calculation are those data enclosed within the moving window at the current step. As the moving window shifts forward, the first value within the window is excluded and a new one in the data list is included in the window. This moving window calculation is continued until the end of the list.

Dynamic linear regression
Linear regression (LR) is a forecasting method that predicts the value of dependent variables as a linear function of an independent variable (Awal et al., 2016). The formula of linear regression is as follows: where is the dependent variable, is the independent variable, n is the number of data used, is the slope and is the y-intercept. and are obtained using the following equations: In this paper, the dependent variable refers to Daily Sales while the independent variable refers to Scrape number. LR uses the least squares method to fit a straight line to the data. The value of each point to form the straight line is computed using the past data. The estimated value can be determined by observing the value along the straight line.
Based on the concept of linear regression, the slope is constant across time. This implies that the rate of change of Daily Sales is always the same. However, time series data have fluctuation characteristics. Fluctuation means irregular rises and drops in the data values. Therefore, a dynamic approach is required to best fit the fluctuation characteristics. In order to achieve a dynamic adaptability, a moving window mechanism is incorporated into the LR. This enhanced method is called dynamic linear regression (DLR). DLR has the same formula as LR with the exception that the inputs used for each computation are based on a moving window. When the moving window of DLR is shifted forward at every step, a localised slope of the regression is produced based on the data points within the moving window. These localised slopes generate some fluctuating forecasting results that may match the fluctuating pattern of time series forecasting. This is a desired characteristic as it can serve as a correction mechanism for over-and underforecasting. Figure 2 shows a comparison of linear regression and dynamic linear regression. Based on Figure 2, the Daily Sales of LR are increasing at a steady rate while the Daily Sales of DLR show a fluctuating pattern. This is because DLR smooths the values to match the actual data. It is clear that the forecasting pattern of DLR fluctuates based on recent data. This mechanism can correct over-and underforecasting. Therefore, DLR can adapt to the fluctuating pattern of time series data and forecast based on recent data. Table 2 shows the application of dynamic linear regression using the sample data.

Exponential smoothing
Exponential smoothing (ES) is a technique that uses a variable decreasing exponentially over time in the exponential function for time series data smoothing (Sinaga & Irawati, 2020). The formula for ES is as follows: where −1 is the previous Daily Sales, −1 is the previous Forecasted Daily Sales and is the smoothing factor.
The smoothing level of ES is determined by the value. A smaller value produces a higher smoothing level of ES. When a smaller value is used, it causes ES to be less responsive to the variations in historical data. In this paper, ES with value of 0.1, 0.5 and 0.9 were used. These three values were chosen to determine the effects of different smoothing factors on sales forecasting and were denoted as ES1, ES5 and ES9, respectively. Figure 3 shows the forecasting results using different smoothing factors.  Table 3 shows an example of forecasting using exponential smoothing with an value of 0.9 based on the sample data.

Evaluation metrics
In order to evaluate the performance, we used MAD, MAPE and MSE. By using MAD, we can directly know the error difference. For example, if the MAD is 5, it means the forecasted result has deviated ±5 units from the actual result. The lower the error, the higher the accuracy. As discussed in 3.1 above, a total of 600 different products were selected for the experiment. The performance of each product was evaluated using MAD, MAPE and MSE. The evaluation results were generated for each scraping cycle. In order to obtain an overall result, the MAD, MAPE and MSE of different products in the same scraping cycle were averaged to determine the average of different mean errors. Therefore, the averaged MAD, averaged MAPE and averaged MSE were used as the final evaluation metrics in this study. Figure 4 illustrates the averaged MAD of different forecasting methods from Scrape-15 to Scrape-50. From the figure, we can see that the averaged MAD of all the forecasting methods increase over time. Although there is a sudden spike in Scrape-20, this situation is not abnormal as long as other forecasting models rise and drop together at the same time. This is because Daily Sales can suddenly rise and drop due to some special circumstances such as promotion. The sudden spike could be possibly due to the upcoming Chinese New Year 2022 promotion.  Figure 5 illustrates the averaged MAPE of the different forecasting methods from Scrape-15 to Scrape-50. From the figure, we can see that the pattern of the line in Figure 5 is similar to that in Figure 4 for each of the forecasting models. Based on Figure 4, it is clear that ES1 and SMA have similar averaged MAD after Scrape-30. Figure 5 clearly shows that the SMA has the least averaged MAPE after Scrape-30. In order to further verify that SMA has the least error, another evaluation metric was used. Figure 6 shows the averaged MSE of the different forecasting methods from Scrape-15 to Scrape-50. Figure 6 shows that SMA outperformed the other forecasting models and recorded the least averaged mean squared errors throughout the experiment. Based on the results shown in Figures 4, 5 and 6, the performance of SMA is superior compared to the other forecasting models. Therefore, SMA is the proposed time series quantitative forecasting model for short-term sales forecast using limited information.

Experimental results
To further illustrate the effectiveness of the proposed method, we selected randomly one of the products that was not tested in the above experiments to evaluate the forecasting result. The product is Macaron Stereo Wired Earphone. Since the process involves many data, we will list out the comparison from Scrape-15 to Scrape-20 only for the ease of presentation. Table 4 shows the Forecasted Sold Quantity and Actual Sold Quantity for Macaron Stereo Wired Earphone.  Table 5 shows the Absolute Error and Relative Error for Macaron Stereo Wired Earphone. These errors were computed from the differences between actual and forecasted Sold Quantity for each model from Table 4. Based on Table 5, we found that SMA has the least absolute and relative error. This result is also in line with the evaluation results using averaged MAD, averaged MAPE and averaged MSE in Figures 4, 5 and 6, respectively. The absolute error and relative error of SMA from Scrape-15 to Scrape-20 are 400 and 0.0067%, respectively.

Conclusion
In this paper, scraped quantitative data were pre-processed and constructed into appropriate data that could be used for a forecasting experiment. In order to identify a suitable forecasting model for the ecommerce marketplace, we conducted several experiments using different forecasting models: simple moving average, dynamic linear regression and exponential smoothing. As a result, simple moving average is the most suitable forecasting model because it showed the least averaged mean absolute deviation, averaged mean absolute percentage error and averaged mean squared error when we compared using different models.
The simple moving average forecasting model uses a direct averaging calculation that consumes less computational resources and can return the forecasting result quickly. Hence, it is more practical for use as a forecasting model as sellers often require quick information in the competitive e-commerce marketplace.
Additionally, the forecasting model using simple moving average is one of the main components of more comprehensive software for sellers, namely product research tools. In order to fully utilise the proposed forecasting model in the e-commerce marketplace, the proposed forecasting model could be developed as a plugin module for product research tools. Sellers can use the forecasting feature of the product research tool to find potential products to be added to their selling catalogue.
Without a forecasting module, a product research tool can only show static information such as product rating or specification. Clearly, this is incomplete as the level of demand for the product is missing. Selling a low-demand product would be difficult for sellers to compete in a crowded e-commerce marketplace.
On the other hand, sellers can identify the performance of a product on the market by using the forecasting feature of the product research tool. Sellers can forecast the Daily Sales of a product, potentially determine and compare the demand for different products. Therefore, it is essential to be able to forecast using the simple moving average forecasting model as the quantitative approach before making a decision on selecting the right product to sell.
As future work, larger dataset will be included to explore a big data approach. We also plan to include more data on product details such as reviews, ratings and prices to improve the forecasting model. This could help increase the feature dimension that produces a more accurate forecasting model. This improved forecasting model will be suitable for extension to a comprehensive product research tool. Besides more data, more variety in product categories within an e-commerce platform can also be considered. Increasing the data space will provide more learning attributes and will be beneficial to computational approaches such as machine learning and big data.