Resources for the "Phishing Kits Source Code Similarity Distribution: A Case Study" paper

Resources for the paper: "Phishing Kits Source Code Similarity Distribution: A Case Study"

Introduction

This website presents additional material for our paper "Phishing Kits Source Code Similarity Distribution: A Case Study".

Paper abstract

Attackers (“phishers”) typically deploy source code in some host website to impersonate a brand or in general a situation in which a user is expected to provide some personal information of interest to phishers (e.g. credentials, credit card number,etc.). Phishing kits are ready-to-deploy sets of files that can be simply copied on a web server and used almost as they are. In this paper, we consider the static similarity analysis of the source code of 20 871 phishing kits totalling over 182 million lines of PHP, Javascript and HTML code, that have been collected during phishing attacks and recovered by forensics teams. Reported experimental results show that as much as 90% of the analyzed kits share 90% or more of their source code with at least another kit. Differences are small, less than about 1000 programming words – identifiers, constants, strings and so on – in 40% of cases. A plausible lineage of phishing kits is presented by connecting together kits with the highest similarity. Obtained results show a very different reconstructed lineage for phishing kits when compared to a publicly available application such as Wordpress. Observed kits similarity distribution is consistent with the assumed hypothesis that kit propagation is often based on identical or near-identical copies at low cost changes. The proposed approach may help classifying new incoming phishing kits as ”near-copy” or ”intellectual leaps” from known and already encountered kits. This could facilitate the identification and classification of new kits as derived from older known kits.

Wordpress validation

As stated in the paper, 235 versions of the open-source software WordPress have been used to evaluate our multi-language approach. Results can be found in the paragraph below. Notably, all x.y.0 releases are correctly ordered (version 2.4.0 does not exist). Since we use MDS for the visualisation, euclidian distances between points in the figure are proportional to the actual distances. Therefore, versions on the visualisation are not equally spread in space: on the contrary, patch releases are clustered next to the most similar minor version. By comparing our predicted lineage, which links similar versions between one another, with the groundtruth, we have 73% correct edges. Patch versions being very similar, they can easily be permutated, leading to unaccurate edges, even though minor versions are well ordered.

Links to the Wordpress material:
Visualisation of Wordpress multi-language lineage
Wordpress indexes
Wordpress multi-language distance (l1) matrix
Wordpress multi-language lineage edges
Actual Wordpress evolution

Kit single-language analysis

Results not included in the paper for single-language analysis of phishing kits are available below:
Results of the Javascript single-language analysis
Results of the HTML single-language analysis

Kit multi-language analysis

More details on the multi-language analysis approach, as well as other example of focused lineage, are available here:
Details on the multi-language approach
Example of focused lineage (1)
Example of focused lineage (2)
Example of focused lineage (3)