Data Analysis

Abstract

In regards to networks, the Dark Web is notoriously difficult to traverse. It requires specialized software or configurations to access addresses, and addresses are only accessible with a priori knowledge. Without an address beforehand, a user is unable to visit the associated hidden service. In addition to this, the Dark Web utilizes encrypted peer-to-peer connections, primarily through TOR protocols, meaning that content is highly anonymized and several layers of abstraction exist between users and hidden services.

These factors result in the Dark Web being host to a myriad of services whose express goal is to host illegal content. When mentioning the Dark Web, many are quick to point out the fact that it harbors hacking forums, drug marketplaces, hitmen-for-hire, and child pornography. While comprehensive knowledge of the Dark Web proves valuable for law enforcement, actually acquiring a complete index of the Dark Web proves difficult.

This analysis will discuss the Dargle (the Dark web Grouping and Listing Engine), our methodology for indexing the Dark Web, and the results of our work.

Methodology

In order to comprehensively catalog hidden services on the Dark Web, we ingested the Common Crawl Corpus (CCC), the monthly result of crawling the entire internet. We systematically parsed through the CCC in order to collect all .onion addresses present on the Clear Web. In total, we gathered over 40,000 unique addresses from the Clear Web. These were then compiled into a single file, and each unique address had an associated value denoting the number of times said address appeared on the internet. In addition to this, we compiled each address on the Clear Web along with the number of .onion addresses each referenced.

With this list of addresses, we proceeded to verify the status of each of these addresses. We iteratively connected to each address via Python's requests library and collected a status, timestamp, and title from each of the .onion addresses. We then utilized an object-relational mapper, SQLAlchemy, to store this information in a database.

Once in the database, we displayed our results through an public outward-facing webapp, Dargle. We made three separate tables with this information: one with .onion addresses, one with the timestamps and statuses of each .onion, and one with the sources for the onion addresses. We also added a search functionality for users to parse through these tables themselves.

Analysis

From our data, we can take a look at just exactly which services on the Dark Web are most popular on the Clear Web. Taking a look at the top 10 sites, we see the following figure

Hits per Domain

This figure shows that of the top 10 domains, 4 of them are links to Hydra, a Russian drug marketplace. What is unique about Hydra is its decentralized nature-- there are many different vendors that pay rent to a central operator. This operator then gains license to operate as a Hydra affiliate, with vendors that pay extra becoming "trusted sellers" with extra privileges. So long as vendors maintain their monthly payments and abide by the rules set forth by the central hub, these vendors can continue to operate under the Hydra umbrella.[1]

Facebook's .onion address holds the second place slot. Facebook has been maintaining this service on the Dark Web since 2014 in order to allow people to use their service in countries blocking Facebook. Facebook has reported that their Dark Web service has been garnering traffic of over 1 million users monthly.[2]

In the fourth slot is DuckDuckGo's address. DuckDuckGo is a privacy-oriented search engine that allows users to navigate the internet without fear of personal information being leaked.

Next is PROBIV, a Russian information vendor often used by cybercriminals to gain the personal information of their victims.[3]

The sixth slot returned a now defunct .onion address.

In the seventh slot is a deprecated address for the Pirate Bay's Dark Web service.[4] The Pirate Bay was one of the largest hosts of torrents on the internet and continues to hold one of the top slots in this field. They have since switched to another domain at piratebayztemzmv.onion.

Finally, we have TORCH, a Tor Search Engine, taking up the tenth slot.

We can do the same with the sites that house references to .onion domains. Doing this, we get

Hits per Source

The one with the most references by far is a directory for .onion addresses, oniondir.biz. The number of hits for this address is over 14,000, more than doubling the number of hits for any single domain ranked 3 through 10.

At second, we have Pastebin, which is a popular site that hosts text files of all kinds. Since Pastebin is frequently used by hackers, programmers, and information sharers of all varieties, its place at the number 2 slot is understandable.

Covering third, we can see the domain for CyberBunker, a former hosting service for Dark Web domains. Due to several raids by German authorities, the service has since been taken down.[5] Instead, the address now redirects to a gab.com address for Sven Olaf Kamphuis' account, the hacker and former owner of CyberBunker.[6] The website gab.com is an alternative to Twitter marketed towards alt-right individuals.

The fourth slot is held by Micropaste, a service for sharing code, texts, and links, similar to Pastebin.

Next is hosting.danwin. This site is run by Daniel Winzen, who in turn hosts thousands of addresses on the Dark Web. However, as of March 2020, his service was attacked and most of his services were taken down.[7]

Sec4ever is an Arabic hacking forum and comes in the sixth slot of domains with the most unique .onion addresses within its contents. This comes as some surprise due to Arabic speaking nations not being associated on the forefront of hacking and cybersecurity. Nevertheless, Iran constitutes a major source of cyberattacks, and Arabic-speaking nations constitute a notable portion of the world's population.

In the seventh slot is www.escrita.com.br, a Brazilian library/archive site. According to the translated landing page, the site allows writers to submit content for publication. This covers a wide variety of content, to include news, essays, poetry, and reviews.

Next is onionlandsearchengine, which is a search engine for the Dark Web. It is interesting to note that as a search engine, onionland only accounts for less than 4000 unique domains on its site and is dwarfed by listings from even hosting sites.

Again we see Daniel Winzen's domain, but this time it appears to be a listing of onions on his site.

Finally, we see coinexchange.io, which is a domain dedicated to BTC trading. With Bitcoin being the de facto currency of the Dark Web, it makes sense that an exchange would link to a large number of .onion addresses.

Ranking by Hits

We can also plot the ranking trends to see exactly how the hits are distributed across our dataset. Doing this we get

Rankings by Hits

Looking at these diagrams, one thing stands out: the vast majority of hits come from the upper portion of both sources and .onion addresses. In fact, based on some cursory SQL queries, we determined that for our .onion table, the top 10% of addresses accounted for nearly 50% of the hits. Additionally, over 90% of .onion hits were contained within the top 10% of sources.

There are several implications that one could derive from this:

  1. By monitoring only a small subset of sites on the clear web, one can find a large proportion of the hidden services on the Dark Web.
  2. By monitoring only a small subset of sites on the Dark Web, one can track the majority of the traffic on the Dark Web.
  3. While the numbers do not exactly match the Pareto Principle, a threat analyst can use the following rule of thumb: 20% of the Dark Web is liable to cause 80% of your problems.

Breakdown of .onion Statuses and Responses

We can now observe the timestamps and statuses of the most recent round of requests we performed on our list of .onion addresses. From this set of data, we can see all of the requests grouped by response code, e.g. HTTP 200, 300, 400, and 500.

We grouped these requests into the following status groups:

  • 200 : The request returned a valid response, and information from the address was likely successfully logged.
  • 300 : The request returned a response indicating the service has moved to another location or redirects elsewhere.
  • 400 : The request returned a client error, e.g. bad response, unauthorized, forbidden, etc.
  • 500 : The request returned a server error, e.g. service unavailable.
  • Connect Timeout : The request failed to receive a response within the allotted time limit.
  • SSLError : The SSL certificate for the service is untrusted, indicating potential for malicious activity.
  • Attribute Error : Likely a failure to parse information from the service in our script.
  • Read Timeout : The request succeeded, but the response timed out returning information.
  • Connection Error : An error indicating a connection failure, possibly because the address is not hosting a webserver on a standart port.

Of these, the most interesting status was the Connection Error. Upon initial reading, the response is nondescript. However, this response indicates potential for a service being hosted on a nonstandard port, meaning a service may very well exist, hidden from prying eyes.

Graphing our results, we get

Response Counts

In the first chart, we included the Connect Timeout errors. As can be seen in this chart, Timeouts accounted for the majority of responses for our complete list of *.onion* addresses, with over 34,000 addresses returning this response. Looking forward, we will likely rerun these addresses through our script with an increased timeout limit, and we anticipate that by doing this, we will enhance the completeness of our database.

Controlling for Connect Timeout responses, we see the second chart. The second most common response was a valid HTTP 200 response. Over 3,000 addresses were able to be connected to, just under 10% of the addresses we gathered. Though it seems like a low number, this is still a powerful insight into the Dark Web, albeit improvable.

We also see just over 1,700 Connection Errors. While some of these may be legitimate errors, it is likely that a sizeable chunk of these addresses are running services on nonstandard ports. In the future, we hope to investigate these particular status codes further in order to determine their exact purpose and orientation.

Formatting this information into a more human-readable graphic, we see below

Response Count Pie Chart

Out of all the statuses aside from 200 OK and Connect Timeout, we see that Connection Errors account for three quarters of the remaining statuses.

16% of these remainders-- Read Timeouts and Attribute Errors, can likely return a valid response after tweaking our codebase to account for these errors. We hope to apply these improvements in the near future to enhance our database.

References

[1] https://www.vice.com/en_us/article/g5x3zj/hydra-russia-drug-cartel-dark-web

[2] https://www.businessinsider.com/facebook-tor-connect-1-million-2016-4

[3] https://www.digitalshadows.com/blog-and-research/probiv-the-missing-pieces-to-a-cybercriminals-puzzle/

[4] https://torrentfreak.com/the-pirate-bay-moves-to-a-brand-new-onion-domain-191206/

[5] https://krebsonsecurity.com/tag/cyberbunker/

[6] https://gab.com/cb3rob

[7] https://www.zdnet.com/article/dark-web-hosting-provider-hacked-again-7600-sites-down/