Predicting merchant future performance using privacy-safe network-based features
For the purpose of conducting our analyses, we construct a merchant network by utilizing customer co-purchase (edge) information extracted from the credit card transaction dataset. Subsequently, we compute the previously introduced centrality and diversity features from the derived network for each individual merchant (node). Next, we use our proposed performance evaluation metric for labeling the merchants according to their rank among their counterparts in the same MCC considering the rate of change in their revenue, number of transactions, and number of unique customers from one period to another. Finally, utilizing the extracted feature sets (e.g., network-based features), along with the performance labels, we leverage different machine learning methods to evaluate the performance of different features and feature sets in predicting the future performance of merchants.
Label analysis
Among the studied 1,977 merchants, 590 (29.84%) are labeled as well-performing, 818 (41.37%) are labeled as medium-performing, and 569 (28.78%) are labeled as poorly-performing merchants. To ascertain the reliability of the labels in providing insights into the future performance of merchants and to assess the potential influence of merchant location and customers’ socio-demographic factors, such as income and wealth, on the assigned labels, we conduct the following analyses.
Label indication
To investigate if the defined labels are able to distinguish between well-performing and poorly-performing merchants, and provide insights into merchants’ performance in the longer terms, we compare the merchants that possess similar magnitudes of revenue, number of customers, and transaction counts in the first period, but are labeled oppositely (i.e., poorly-performing vs. well-performing) bringing into account their performance in the second period.
To this end, we first convert the revenue, transaction count, and number of unique customers of merchants during the first period (first 6 months) into quartiles (i.e., Q1, Q2, Q3, and Q4). Table 3 illustrates an example of the resulting data table structure by providing a random sample of rows. Then we choose the merchant pairs from the same MCC and the same quartiles of revenue, transaction count, and the number of unique customers in the first period.
We only keep the pairs that are labeled oppositely (i.e., well-performing and poorly-performing) based on their second-period performance indicators. Those merchant pairs are the ones that: (1) fall into the same quartiles of merchant revenue, transaction count, and distinct customer count, and (2) one of them is labeled as well-performing and the other is labeled as poorly-performing. There are 11,813 pairs of merchants in our dataset that satisfy both conditions. Table 4 provides an example of the data table resulting from using the information presented by Table 3 as input.
Next, using the ordinary least squares method, for each merchant we compute three fitted line slopes taking into account their monthly revenue, transaction count, and number of unique customers as dependent variables and month numbers as the independent variable. Those slopes can provide an overall indication of a merchant’s performance considering each variable over 12 consecutive months. Equation (12) shows the closed-form expression of the regression model where \(\beta _1\) denotes the value we use as fitted line slope in our analysis.
$$\begin{aligned} \begin{aligned}{}&Y = \beta _0 + \beta _1 X + \epsilon \\&Y = (Y_1,…,Y_{12})^\intercal ,\; X = (1,…,12)^\intercal ,\; \text {and} \; \epsilon = (\epsilon _1,…,\epsilon _{12})^\intercal \end{aligned} \end{aligned}$$
(12)
Figure 2 shows the time series plots for three instances provided in Table 4. Each sub-figure (i.e., 2a, b, and c) depicts time series plots of a merchant pair’s monthly revenues (i.e., 2a\(_\text {R}\), 2b\(_\text {R}\), and 2c\(_\text {R}\)), monthly transaction counts (i.e., 2a\(_\text {N}\), 2b\(_\text {N}\), and 2c\(_\text {N}\)), and numbers of their monthly distinct customers (i.e., 2a\(_\text {C}\), 2b\(_\text {C}\), and 2c\(_\text {C}\)) over 12 months of historical credit card transaction records. The vertical blue line in each plot splits the time-frame into two periods of 6 months as we used for labeling. All pairs are chosen from the same quartile of customer count, transaction count, and revenue in the first period and may have similar trends during the first 6 months. In contrast, each merchant pair have opposite labels (i.e., well-performing vs. poorly-performing) according to their performance indicators values in the subsequent period. As illustrated in Figure 2, it is evident that each merchant pair have a comparable performance during the first period, but their performance is significantly different during the second period. For visualization purposes, the fitted lines representing merchants labeled as well-performing are depicted in green, while the fitted lines for those labeled as poorly-performing are displayed using red color. The lines in each chart are the OLS fitted lines and their slope is equal to \(\beta _1\) in Eq. (12). Those lines provide insights into merchants’ overall performance trend considering each of the three factors, namely: revenue, transaction count, and distinct customer count over the course of 12 months.
Each sub-figure (i.e., a, b, and c) depicts three time series plots for different merchant pairs with the same quartiles of revenue, transaction count, and unique customers count in the first period, but opposite labels. In each sub-figure the time series plots show monthly revenue (top: R), monthly transaction count (middle: N), and monthly unique customer count (bottom: C) including the OLS regression line for each merchant over the course of 1 year. In the figure legend, 0 corresponds to merchants labeled as poorly-performing, and 2 corresponds to those labeled as well-performing.
In the subsequent step, we compare the \(\beta _1\) values for each pair of merchants. For each slope pair, if the merchant labeled as well-performing has a higher slope, we assign 1 to the slope indicator and otherwise assign 0, and then sum the three resulting indicators. If the well-performing merchant has higher \(\beta _1\) values in all three indicators, the sum will be equal to 3. Conversely, if the well-performing merchant has lower values in all three indicators, the sum will be equal to 0. The closed-form expressions of these calculations are presented in Eqs. (13–16).
$$\begin{aligned}{} & {} {I_{{\beta }_{1}}^{R}} \rightarrow {\left\{ \begin{array}{ll} 1\quad if\quad \beta _{1}^{R_{well-performing}}\; \ge \; \beta _{1}^{R_{poorly-performing}} \\ 0\quad otherwise \end{array}\right. } \end{aligned}$$
(13)
$$\begin{aligned}{} & {} {I_{{\beta }_{1}}^{C}} \rightarrow {\left\{ \begin{array}{ll} 1\quad if\quad \beta _{1}^{C_{well-performing}}\; \ge \; \beta _{1}^{C_{poorly-performing}} \\ 0\quad otherwise \end{array}\right. } \end{aligned}$$
(14)
$$\begin{aligned}{} & {} {I_{{\beta }_{1}}^{N}} \rightarrow {\left\{ \begin{array}{ll} 1\quad if\quad \beta _{1}^{N_{well-performing}}\; \ge \; \beta _{1}^{N_{poorly-performing}} \\ 0\quad otherwise \end{array}\right. } \end{aligned}$$
(15)
$$\begin{aligned}{} & {} { I_{{\beta }_{1}}^{S} = I_{{\beta }_{1}}^{R} + I_{{\beta }_{1}}^{C} + I_{{\beta }_{1}}^{N}} \end{aligned}$$
(16)
Table 5 summarizes the result of this analysis. It is evident that in more than 90% of the pairs, the merchant labeled as well-performing demonstrates superior long-term and overall performance across all three factors: revenue, transaction count, and the number of unique customers. These results affirm the robustness and high accuracy of our proposed labeling approach in effectively distinguishing merchants based on their comprehensive performance measured by three distinct objective criteria.
District level analyses
In this section, we conduct three statistical analyses to explore potential relationships between merchants’ performance and their geographical location (i.e., district).
Correlation analysis
Initially, we calculate the proportion or share of each performance group (i.e., well-performing, medium-performing, and poorly-performing) within each district based on their respective labels. These shares represent the probability of a merchant belonging to each performance class within a particular district. Furthermore, we compute the relative ratios of performance class pairs for each district. Subsequently, we conduct a correlation analysis to examine the potential associations between the probabilities and relative ratios of performance classes with the population and average household income of their corresponding districts. The results reveal no correlations between the performance class of merchants and the population size or income level of the residents in the districts where they are located (correlation table is provided in “Supplementary Information”).
Chi-squared test
This test is performed between district ids as categorical variable (identification numbers) and performance class probabilities converted into categorical variables. This type of test is valid here as the sample size is small (33 districts) and the contingency table test is reasonable. Based on the results (\(\chi ^{2}(2560, \, n=99) \, = \, 2629, \, p = 0.167\)), there are no observed dependencies between the performance class probability distributions and relative ratios with the district ids indicating that within our dataset and based on the defined labels, a merchant’s performance is independent of the district in which it is situated.
While based on Eq. (17) the inequality value can range from 0 (perfectly equal) to 1 (fully unequal), the histogram and the distribution range presented in Table 6 reveal that, in district-level distributions, the inequality scores tend to concentrate towards lower values, suggesting a trend towards reduced inequality. It is important to note that the maximum inequality index value obtained here (i.e., 0.388), is even less than 40% of the maximum possible value. Furthermore, the other statistical measures, such as median, mean, and interquartile range, further support the observed trend of the distribution favoring low inequality values.
In conclusion, the findings obtained from the three aforementioned analyses validate that there is no discernible dependency or substantial association between the defined performance labels for merchants and the districts in which they are situated.
Label distribution inequality analysis
For this analysis, we use the formulation presented in Eq. (17) to measure the inequality52 in performance class labels’ distributions within and among districts. This equation formally represents the inequality of labels’ distribution in district i (denoted by \(Inequality_L^i\)) as a function of shares of the three labels in that district (denoted by \(p_{L_{k}}^{i}\)).
$$\begin{aligned} \begin{aligned}{}&Inequality_L^i = \frac{3}{4} \times \sum _{k=1}^{3} |p_{L_{k}}^{i} – \frac{1}{3}| \\&(Inequality_L^i \in [0, 1]) \end{aligned} \end{aligned}$$
(17)
Since by definition, there are 3 performance labels (\(k \in {1, 2, 3}\)), then the classes in a totally equal distribution should have the same share of the merchants in a district, that is equal to one third (i.e., \(\frac{1}{3}\)) share for each class, which results in 0 inequality. Moreover, a completely unbalanced distribution happens when the share of one particular class is equal to 1 and the other two classes have 0 shares. This distribution will yield the highest possible inequality value, which is equal to 1. Table 6 provides the basic statistics of the label inequality measures at a district level, while Fig. 3 illustrates the corresponding histogram.
Customer level analysis
To explore potential dependencies between a merchant’s performance and the income level of its customers, we categorize the mean and median income of all merchants’ customers into quartiles based on their respective distributions. This results in four categories (i.e., quartiles) for merchants based on the mean and median income of their customers. Subsequently, we conduct two chi-squared tests to examine the relationship between a merchant’s performance label and the quartiles of its customers’ mean and median income. The results of these tests indicate no significant dependencies between a merchant’s performance class and the mean (\(\chi ^{2}(6, \, n=1977) \, = \, 5.2951, \, p = 0.506\)) or median (\(\chi ^{2}(6, \, n=1977) \, = \, 2.8613, \, p = 0.826\)) income of its customers. Additional analysis and the corresponding results regarding the customer features and their role in generating the merchant network of study are presented in the “Supplementary Information”.
Based on the findings obtained from both the district-level and customer-level analyses, it can be concluded that the defined labels are independent of and unbiased towards the locations of merchants or the socio-economic status of their customers. This reaffirms the robustness and reliability of the labeling approach proposed and used in the study.
Predictive analysis
For each merchant, we gather 25 features to serve as inputs for the machine learning models. Out of these 25 features, 5 are based on the merchant’s information, 3 are related to revenue, 11 are obtained from customer data, and 6 are extracted from the proposed network structure. The obtained merchant network consists of 2,011 nodes and 217,422 edges as one strongly connected component. “Supplementary Information” provides additional details concerning the network and node features. In addition to the 6 features directly obtained from the network structure, we generate a feature vector with 128 dimensions for each merchant (node) using the node2vec method.
Then we use different combinations of the feature sets as inputs for the four selected machine learning models. Table 7 summarizes the results for the classifiers we include in our analyses. Among those, Random Forest (RF) performs better than the other classifiers in most cases. This is in line with the previous findings that show tree-based models, and in particular RF, perform better in multi-class performance prediction tasks53,54,55.
The computational results shown in Table 7 indicate that the performance of the network-based feature set is comparable to and almost as good as other feature sets with all classifiers. Moreover, in certain scenarios, the models utilizing feature vector representation (i.e., node2vec) of merchants within the suggested network demonstrate superior performance compared to those that solely use customer-based features. Nevertheless, the node2vec features exhibit no improvement over the network-based features, potentially due to information loss during the embedding process.
It is worth emphasizing that the inclusion of network-based features alongside the conventional feature sets enhances prediction accuracy. This suggests that certain signals, which are not captured by revenue-based and customer-based information, can be captured through features associated with a merchant’s position within the proposed merchant network.
In order to demonstrate the impact of each feature on prediction accuracy, we perform a feature importance analysis using the mean decrease in accuracy. This analysis reveals the reduction in accuracy that occurs when a particular feature is absent. Figure 4 displays feature importance rankings for the RF classifier in accordance with the mean decrease in the classifier’s prediction accuracy when a feature is removed after a random permutation. With the mean decrease in accuracy measures, financial features such as transaction count, revenue, and the number of unique customers, occupy the top ranks as they signify their role in accurate predictions. Among the network-based features, degree centrality, followed by betweenness and closeness centrality scores, cause the greatest deficits in accuracy. Notably, among the top ten important variables based on the mean decrease in accuracy, four belong to the network-based features.
The results of analyses shown in Table 7 and Fig. 4 demonstrate that in some prediction results as well as the feature importance ranking cases, some features from different feature sets possess close performance levels. This similarity can be the result of high correlation between the features from different feature sets, which also indicates that those features could carry similar signals and information about the merchants but obtained in different ways. For instance, features such as node degree, revenue, and the number of customers, originating from different feature sets, demonstrate significant pairwise correlations and closely align in their importance ranking. The correlation table of the features used in this study is available in “Supplementary Information”.
In addition, we employ two different methods to further explore the relationship between labels and extracted features in our study. Firstly, we conducted principal component analysis (PCA)56 on the originally extracted features, as presented in Table 2. Secondly, we utilize T-distributed stochastic neighbor embedding (t-SNE)57 on the node2vec features. These methods were employed to visualize the placement of merchants in a two-dimensional (2D) space. The resulting 2D plots do not exhibit any discernible distribution or clustering patterns for merchants based on their performance classes. For more comprehensive information, including detailed plots, please refer to the “Supplementary Information”.
Privacy implications
Within a highly competitive environment, the proprietary financial information of merchants assumes a heightened level of sensitivity. Consequently, the custodians of such data (e.g., banks), exercise caution in divulging such confidential information to external entities. However, financial institutions and investors find themselves compelled to evaluate the stability of merchants and predict their future performance in order to make informed determinations regarding business loans and investments. As a result, accessing this confidential information becomes imperative for these stakeholders. However, given the context of data sharing, safeguarding merchants’ revenue- and customer-related information from unauthorized disclosure assumes paramount importance and is of utmost priority.
Our methodology offers a twofold advantage. Firstly, the network-based features generated through our approach exhibit prediction accuracy on par with conventional feature sets. Secondly, these features are presented in a tabular format, enhancing privacy beyond that offered by merely anonymizing raw financial records. This unique property enables the establishment of a more secure and expeditious data-sharing process with third-party organizations. Such a mechanism holds immense potential for enhancing data privacy while facilitating efficient collaboration and knowledge exchange.
The nature of network-based features, which primarily reveal the interconnectedness within the merchant network, presents a challenge in directly inferring and estimating customer- and revenue-related information in the absence of raw data or relevant statistical measures (e.g., average spending per customer transaction) which are never released by a financial institution that owns the raw data. Nevertheless, due to the significant correlation between certain financial or customer-related variables (e.g., revenue and the number of customers) and network-based features (e.g., node degree), there is a possibility of inferring sensitive information, such as revenue ranking and comparison.
To address this issue, the network data owner has the option to generate node2vec features locally and share them with third parties for the downstream task, specifically merchant performance classification. This approach is justifiable, as demonstrated by the outcomes presented in Table 7, wherein node2vec features exhibit comparable results to network-based features while ensuring a higher level of privacy.
While there exist adversarial attack methods aimed at reconstructing the graph from the node embeddings of the original graph, it is important to note two significant limitations. Firstly, these methods are unable to fully recover the original graph, thereby compromising the reliability of the reconstructed network in providing accurate insights about merchants. Secondly, data holders can effectively mitigate privacy risks associated with integrating node embeddings for downstream analysis by employing suitable defense mechanisms.
In the realm of defense mechanisms, two widely employed approaches are perturbation of the node2vec matrix, and generating embeddings in lower dimensions58,59. While these methods effectively mitigate inference attacks’ accuracy, they come at the expense of degrading the accuracy of merchant performance prediction tasks. To tackle this issue, one can employ a tentative defense mechanism that involves an iterative process of removing the least significant feature vector (i.e., column) from the node2vec matrix until no substantial changes are observed in the classification accuracy. This technique is both straightforward and effective, while ensuring that our proposed approach does not compromise the accuracy of the downstream classification task.