August 7, 2022
Data mining is a complex topic that is often difficult to convey in layman’s terms. This is an in-depth research paper on mining a data set about banknotes to help determine if forgeries are present. The final result is that the data set accurately determines forgeries based on the details of the images.
This project uses the banknotes.csv file to build an appropriate model to predict forgery successfully. The dataset was uploaded into SAS for mining and analysis. Summary statistics and a graphical summary of histograms are provided, along with a discussion of key aspects of that data. The Winsorization approach, commonly referred to as percentile capping, is used for anomaly detection, and associations are identified using correlation analysis. The hierarchical clustering technique is used in this project to analyze the variables, and the results are explained. A classification model to predict forgeries and reasoning for its use are provided. Finally, the model (isolation forest) is evaluated, and impressions on model fit are reviewed.
Dataset Upload to SAS
Summary Statistics and Histograms of Banknotes Dataset
The summary statistics and histograms of the banknote datasets are shown below.
Summary Statistics of Continuous Variables Separated by Class
Summary Statistics of Continuous Variables not Separated by Class
Histogram With Classes in V1 = Variance
Histogram Without Classes in V1
Histogram With Classes in V2 = Skewness
Histogram Without Classes in V2
Histogram With Classes in V3 = Kurtosis
Histogram Without Classes in V3
Histogram With Classes in V4 = Entropy
Histogram Without Classes in V4
Key Aspects of Statistical Data
There are two classes identified in the dataset, with 1 meaning the observation is genuine and 2 that it is forged. Basic statistics are performed on these two variables to look more closely at legitimate and counterfeit items. The data shows there are 762 genuine entries and 610 forgery entries. The Wavelet Transform tool extracts features from the images, which are demonstrated in integers for four variables, including:
- V1 = variance of Wavelet Transformed image, which measures the spread between these numbers in the dataset (Sturdivant et al., 2016).
- V2 = skewness of Wavelet Transformed image detects the symmetry of the curve in terms of the image data features.
- V3 = kurtosis of Wavelet Transformed image describes the degree of the tails and at the peak of the curve in a frequency distribution (Sturdivant et al., 2016)
- V4 = entropy of image measures the degree of randomness in the image features (Thum, 1984).
Summary statistics and histograms demonstrate a left skewness in genuine entries (class = 1) for variance, skewness, and entropy. Kurtosis has a right skew that looks to be multimodal. Forgeries (class = 2) show a rightward skewness in variance and kurtosis, with a left skewness in entropy. This suggests higher variance, skewness, and entropy in the images. When classes are not separated, variance is multimodal but relatively normally distributed. Skewness is left-skewed and multimodal, kurtosis is right-skewed and bimodal, and entropy is left-skewed. Items with larger entropy values and kurtosis tend to be a class = 2, forgery. Higher variance and skewness values tend to be class =1, genuine.
In this test, the Winsorization approach, commonly referred to as percentile capping, was applied. This approach involved checking the values that are less than the 5th percentile and those higher than the 99th percentile. When removing the outliers from the data, the values less than the 1st percentile are replaced with the value at the 1st percentile; the values more than the 99th percentile are replaced by the value at 99th percentile (Kumar, 2020). The data displayed below revealed outliers in the dataset given for this analysis.
Anomaly Testing with PROC Univariate Code
Anomaly Detection V1 Results with Extreme Observations
Anomaly Detection V2 Results with Extreme Observations
Anomaly Detection V3 Results with Extreme Observations
Anomaly Detection V4 Results with Extreme Observations
In this project, the analyst applied the correlation coefficient, represented by r, to determine the association between the variables. The correlation coefficient is commonly referred to as Pearson’s and is used to measure the linear relationship between the dependent and independent variables (Sturdivant et al., 2016). A scale ranging from +1 to –1 is used to determine the correlation coefficient. One or the other of the following is true of a correlation between two variables: + 1 or -1. It is a positive correlation when one variable rises as the other rises; it is a negative correlation when one decreases while the other increases (Hayes, 2021). A correlation of zero means that there is none. Correlation analysis in this instance was carried out to determine multicollinearity. Independent variables are significantly connected when they have multicollinearity (Sturdivant et al., 2016). There is no dependent variable in cluster analysis. As a result, all variables are treated as unrelated. A higher weight is given to variables that are significantly associated with clustering.
The findings show that V1 and V2 variables are negatively correlated with the Class variable, whereas V3 is positively correlated with the class variable. V1 and V3 positively correlate with V4, whereas V2 negatively correlates with V4. The results are depicted below.
SAS Code for Correlation Analysis Between Variables
Correlation Analysis Results
Clustering Technique for Variable Analysis
In unlabeled data sets, clustering is a common technique for detecting outliers. This analysis may be improved by looking at groups with fewer records and determining how many fraud occurrences are included in these groups by grouping data together. A hierarchical clustering approach was applied, as shown below. A total of twenty clusters were used as input to the method after results from clustering with five and 10 clusters were analyzed and evaluated, respectively. It was determined that around 1,000 anomalous observations were needed to compare the findings of the isolation forest model with the number of clusters.
A total of 149 observations have been identified as fraudulent in the 1,000 most out-of-the-ordinary clusters. This is much better than the default isolation forest results but only 6% better than the best results for the data set’s isolation forests. The anomaly score is one area where the isolation forest outperforms standard clustering to provide more clarity. The clustering approach depends more on subjective parsing to produce a rating for how to discriminate between observations inside various clusters than on the isolation forest to rank the complete data set.
SAS Code of Clustering Technique for Variable Analysis
Sample of Clustering Results
An important task in data analysis is to look for any outliers or anomalies in the dataset. To do this, one must identify objects that do not conform to the typical patterns of behavior in the dataset (Tan et al., 2018). The method used in this paper is isolation forests. According to Gillespie (2019), an isolation forest is a common approach to detecting anomalies. One input variable is randomly selected at each split in the isolation forest. The variable is divided at a random value between the highest and lowest values of the observations in that node. If the variable is nominal, each level of the variable is randomly allocated to a branch. In this manner, anomalous observations are more likely to have a shorter route from the root node to the leaf node than non-anomalous observations. The following formula is used to calculate the abnormality score of observation x, S(x):
Where h(i) is the harmonic number estimated by ln(i), + 0.5772156649 (Euler’s constant), and the c(n) is the average of h(x) given n. H(i) normalizes the equation’s h(x).
In all cases, the anomaly score ranges from 0 to 1, with values closer to 1 indicating a greater likelihood that the observation is an abnormality.
Isolation forests are built based on decision trees, a helpful way to help predict forgery. A decision tree classification model uses a series of questions about the attributes of observation or instance to determine its class, which are organized into a hierarchical structure (Tan et al., 2018). The advancement of digitization with scan and print techniques has led to serious counterfeiting issues, making it challenging to identify forgery with the naked eye (Upadhyaya et al., 2018). Therefore, a decision tree model can perform variable scrutiny on the banknote dataset given the images’ Variance, Skewness, Kurtosis, and Entropy, making it a good model for this task.
Summary statistics and histograms help with visualizing the data in the banknotes dataset. After reviewing the documentation and analyzing these summaries, I determined an isolation forest model was the best fit for predicting forgery by asking specific questions about the banknote images’ Variance, Skewness, Kurtosis, and Entropy. Anomaly detection, clustering, and correlation coefficient analysis helped detect outliers, group similar items, and identify correlations between observations. Overall, this was a good fit model, and I did not see any reason or way to improve it, as it successfully identified forgery items.
Gillespie, R. (2019). Detecting Fraud and Other Anomalies Using Isolation Forests. In Proceedings of SAS Global Forum 2019 (pp. 1-6).
Hayes, A. (2021, December 4). Positive correlation. Investopedia. https://www.investopedia.com/terms/p/positive-correlation
Kumar, P. (2020, February 4). How to handle outliers. LinkedIn Pulse. https://www.linkedin.com/pulse/how-handle-outliers-piyush-kumar/
Sturdivant, R., Pardoe, I., Berrier, I., & Watts, K. (2016). Statistics for Data Analytics. zyBook [online].
Tan, P.N., Steinback, M., Karpatne, A., & Kumar, V. (2018). Introduction to data mining (2nd ed.). Pearson.
Thum, C. (1984). Measurement of entropy of an image with application to image focusing. Optica Acta: International Journal of Optics, 31(2), 203-211. DOI: 10.1080/713821475
Upadhyaya, A., Shokeen, V., & Srivastava, G. (2018). Decision tree model for classification of fake and genuine banknotes using SPSS. World Review of Entrepreneurship, Management, and Sust. Development, 14(6), 683-693. DOI: 10.1504/WREMSD.2018.097696