Detect phishing with web scraping

Detect phishing with web scraping

Unfortunately, phishing is profitable, difficult to detect, and relatively easy to carry out. With digital transformations accelerating around the world, phishing will experience continued explosive growth.

According to Phishlabs, the number of phishing attempts in the first quarter of 2021 increased by almost 50%. There is also no reason to believe that it will stop climbing.

This means higher levels of digital damage and risk. To counter such an increase, new phishing detection approaches must be tested or improved upon. One way to improve on existing approaches is to use web scraping.

identity impersonation

Phishers would have a hard time fully replicating the original website. Placing all URLs identically, replicating images, baking domain age, etc. it would require more effort than most people would be willing to put forth.

Also, a perfect spoof would likely have a lower success rate due to the possibility of the target being lost (by clicking on an unrelated URL). Finally, as with any other scam, you don't need to fool everyone, so the perfect replica would be a wasted effort in most cases.

However, phishers are not stupid. Or at least those who do are not. They always do their best to create a credible replica with as little effort as possible. It may not be effective against the tech-savvy, but even a perfect replica may not be effective against the distrustful. In short, phishing is about being "good enough."

Therefore, due to the nature of the business, there are always one or two obvious holes to discover. Two good ways to start are to look for similarities between frequently hacked websites (eg fintech, SaaS, etc.) and suspected phishing websites, or to collect known attack patterns and progress from there.

Unfortunately, with the volume of phishing websites popping up daily and intended to target less tech-savvy people, solving the problem may not be as easy as it seems. Of course, as is often the case, the answer is automation.

Phishing Search

More methods have been developed over the years. A 2018 ScienceDirect review article lists URL-based detection, layout recognition, and content-based detection. The former often lag behind phishers, as databases update more slowly than new websites appear. Layout recognition relies on human heuristics and is therefore more prone to failure. Content-based detection is computationally heavy.

We'll pay a little more attention to layout recognition and content-based detection, as these are complicated processes that benefit greatly from web scraping. At that time, a group of researchers created a framework for detecting phishing websites called CANTINA. It was a content-aware approach that checked data such as TF-IDF ratios, domain age, suspicious URLs, punctuation abuse, and more. However, the study was published in 2007 when the possibilities for automation were limited.

Web scraping can significantly improve the framework. Rather than manually trying to find outliers, automated applications can crawl websites and download relevant content from them. Important details such as those described above can be extracted, analyzed and evaluated from the content.

build a network

CANTINA, developed by the researchers, had a drawback: it was only used to test a hypothesis. For these purposes, a database of legitimate and phishing websites has been compiled. The state of both was known a priori.

Such methods are suitable for testing a hypothesis. They are not so good in practice when we don't know the state of the websites in advance. Practical applications of projects similar to CANTINA would require a significant amount of manual effort. At some point, these applications would no longer be considered "practical".

In theory, however, content-based recognition appears to be a strong contender. Phishing websites must reproduce the content almost identically to the original. Any inconsistencies, such as misplaced images, misspellings, missing pieces of text, can raise suspicions. They can never stray too far from the original, which means that metrics like TF-IDF must be similar by necessity.

The downside of content-based recognition has been time-consuming and expensive manual work. However, web scraping shifts most of the manual effort to full automation. In other words, it allows us to use existing detection methods on a much larger scale.

First, instead of manually collecting URLs or pulling them from an already existing database, scraping can quickly create your own. They can be collected through any content that hyperlinks or links to these so-called phishing websites in any way.

Second, a scraper can crawl a collection of URLs faster than any human. The manual overview has advantages, such as the ability to see the structure and content of a website as it is instead of getting the raw HTML code.

Visual representations, however, are of little use if we use mathematical detection methods such as bond depth and TF-IDF. They can even serve as a distraction, drawing us away from important details due to heuristics.

The analysis also becomes a detection track. Parsers frequently fail if design or layout changes occur on the website. If there are unusual scan errors against the same process performed on major websites, these can serve as an indication of a phishing attempt.

Ultimately, web scraping doesn't produce entirely new methods, at least as far as I know, but it does enable older ones. It offers a way to scale methods that might otherwise be too expensive to implement.

cast a net

With a proper web scraping infrastructure, millions of websites can be accessed daily. Just like a scraper collects source HTML, we have all the textual content stored where we want it. A few scans later, the plain text content can be used to calculate TF-IDF. A project would probably start by collecting all the important metrics from popular phishing targets and move on to detection.

In addition, there is a lot of interesting information that we can extract from the source. All internal links can be visited and stored in an index to create a representation of the overall link depth.

It is possible to detect phishing attempts by building a website tree through indexing with a web crawler. Most phishing websites will be superficial for the reasons outlined above. On the other hand, phishing attempts copy the websites of well-established companies. These will have great bond depths. The shallowness itself could be an indicator of a phishing attempt.

However, the collected data can be used to compare TF-IDF, keywords, link depth, domain age, etc., with metrics from legitimate websites. Incompatibility would be a source of suspicion.

There is a caveat to be decided "on the fly": what margin of difference is cause to investigate? A line has to be drawn in the sand somewhere and, at least initially, it's going to have to be pretty arbitrary.

Also, there is an important consideration for IP addresses and locations. Some content on a phishing website may only be visible to IP addresses from a specific geographic location (or not from a specific geographic location). Solving these problems, under normal circumstances, is difficult, but proxies offer an easy solution.

Since a proxy always has a location and IP address associated with it, a large enough pool will provide global coverage. Whenever a geo-block is encountered, a simple proxy change is enough to remove the hurdle.

Finally, web scraping, by its nature, allows you to discover a large amount of data on a specific topic. Most of it is unstructured, which is usually fixed by analysis, and unlabeled, which is usually fixed by humans. Structured and labeled data can provide an excellent foundation for machine learning models.

Stop phishing

Building an automated phishing detector through web scraping produces a lot of data to evaluate. Once evaluated, the data would generally lose its value. However, as with recycling, this information can be reused with some adjustments.

Machine learning models have the disadvantage of requiring large amounts of data to start making predictions of acceptable quality. However, if phishing detection algorithms were to start using web scraping, this amount of data would naturally occur. Of course, labeling might be required, which would require considerable manual effort.

Regardless of this, the information would already be structured to produce acceptable results. Although all machine learning models are black boxes, they are not entirely opaque. We can predict that data structured and labeled in a certain way will produce certain results.

For clarity, machine learning models could be thought of as the application of mathematics to physics. Some mathematical models seem to fit natural phenomena like gravity exceptionally well. Gravitational attraction can be calculated by multiplying the gravitational constant by the mass of two objects and dividing the result by the distance between them squared. However, if we only knew the required data, it would not give us an idea of ​​gravity itself.

Machine learning models are much the same. A certain data structure produces the expected results. However, it is not clear how these models arrive at their predictions. At the same time, at all stages, the rest is as expected. Thus, except for marginal cases, the "black box" character does not greatly harm the results.

Additionally, machine learning models appear to be among the most effective methods for detecting phishing. Some automated crawlers with ML implementations could achieve 99% accuracy, according to Springer Link research.

The future of web scraping

Web scraping seems to be the perfect complement to all current phishing solutions. After all, most cyber security involves a lot of data to make the right protection decisions. Phishing is no different. At least through the lens of cybersecurity.

There seems to be a holy trinity in cybersecurity just waiting to be harnessed to its full potential: analytics, web scraping, and machine learning. There have been some attempts to combine two of the three together. However, I have yet to see all three exploited to their full potential.