August 2021 Survey Report

August Survey:

Our August survey represents our first official release of our crawling research. These reports will serve as a way for us not only to discuss the results of our crawls but the technical process of creating this technology. Working on this project has been an incredible adventure so far and we look forward to you taking this journey with us!

Survey Results:

High Level Survey Reports:

From a top level, our first HTTP crawl has yielded a great deal of interesting statistics about the adoption rates of many technologies across the internet. It is important to note that only the HTTP crawler is deployed in production crawling so these results are limited to that in the secondary crawl.

Port Ping Response Crawl Response
21 13016656 Pending Dev
80 60773485 19387484
443 53606644 2083897
7878 3385803 27962
8008 9079938 81633
8080 12263264 2291785
8888 5602088 1154780
9200 3868575 Pending Dev
9981 4798149 Pending Dev
11001 4908024 Pending Dev
47808 3204075 Pending Dev
50001 8383239 Pending Dev

Server Type Adoption Rates

Adoption rates are always an interesting look at the state of the internet. Technologies are rapidly replaced and obsolesced and the landscape is massively dynamic.

Server Type Count
nginx 3813748
unknown 3765476
Apache 3660319
Microsoft-IIS/10.0 733627
Microsoft-IIS/8.5 580386
DNVRS-Webs 557033
Microsoft-IIS/7.5 435336
Webs 403379
Apache/2.4.29 (Ubuntu) 399222
nginx/1.18.0 324084
Apache/2.4.41 (Ubuntu) 317789
nginx/1.14.0 (Ubuntu) 292996
SonicWALL 282804
micro_httpd 262466
Apache/2.4.18 (Ubuntu) 260325
nginx/1.16.1 259588
nginx/1.18.0 (Ubuntu) 254281
Apache-Coyote/1.1 212352
Nginx Microsoft-HTTPAPI/2.0 194509
Boa/0.94.14rc21 193325
LiteSpeed 182980
Apache/2.4.38 (Debian) 173570
Apache/2 161466
Apache/2.2.15 (CentOS) 160465
Apache/2.4.25 (Debian) 158716
IdeaWebServer/3.0.0 155090
nginx/1.10.3 (Ubuntu) 151313
nginx/1.14.2 121512
GoAhead-Webs 121471
nginx/1.20.0 120744
web 118241
nginx/1.20.1 114829
Apache/2.4.7 (Ubuntu) 101990

Responsible Disclose For August

Our first crawl has yielded over 1000 vulnerable high value targets that have been responsibly disclosed to government agencies and the end customers. For the time being these disclosures will remain unreleased publicly until we can insure that all vulnerable machines are patched.

Technical Challenges

Since in general, our technology is still under heavy development, this month has exposed a variety of interesting challenges. To start, our stack is primarily Golang, Bash, ArangoDB, and NGINX. This has allowed us to scale at a tremendous rate with some a variety of growing pains.

Zmap Json Bug

This bug ended up resulting in us needing to throw away all our results after discovering a major flaw in the zmap Json module. Anyone using this in their research undoubtedly must throw away all their results as the IP’s returned are not real responses. This bug has been reported to the zmap team but it is unclear how or when this will be patched. We have fallen back on the CSV module which seems still to be reporting IP responses correctly. Thankfully this only resulted in a small rework of our distributed zmap worker system.

Distributed Locking at Scale

This so far has been our most challenging issue. We use ArangoDB as a Command and Control surface for all our workers. For the most part this is a highly efficient method for managing thousands of jobs distributed across hundreds of workers and sub-processes. We have had some cases where workers, for whatever reason, become perfectly synchronized. Basically this means that a job is acquired at the exact same time by two different workers before the job can be locked. This is mostly my own oversight in not realizing that a synchronized api would be 100% necessary when scaling past a certain point. This has been mostly mitigated by randomizing a wait time before a worker gets a job, but this is only a bandage.

To fix this a few options are possible. A foxx API may be the answer but I’m not convinced that this will totally solve the issue at a larger scale. I think another possible option may be better logic on the worker end to ensure that it is not locked in tandem with another worker. Overall this will be an interesting challenge and a chance to maximise the power of ArangoDB.

Colly Deadlocking Issue (Potential Golang Source Bug)

Often during crawls we deal with honey pots and anti-crawl mechanisms. This has, for some reason, become an extreme issue on port 8008. My current guess is that honeypots create endless connections that are deadlocking a few jobs infinitely. I have set very aggressive timeouts which you can see below, but even still typically in a job of 10k IP’s, 4-15 will fully deadlock and c.Wait() never completes. We have enabled a variety of deadlines to help kill this job but even in the lowest level Golang ‘HTTP’ package, the workers never are able to close.

This issue has been reported to the Colly developers but again it seems unclear what the fix is. For the time being we have resorted to crashing workers after a timeout and skipping failed bricks of IP’s. While this solution is not optimal it has allowed our crawlers to successfully recover and skip failed blocks.

// Create Colly Collector
       c := colly.NewCollector(
               colly.MaxBodySize(10e9),
               colly.DetectCharset(),
               colly.Async(true),
       )
       c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 500})
       c.SetRequestTimeout(5 * time.Second)
       c.WithTransport(&http.Transport{
               Proxy: http.ProxyFromEnvironment,
               DialContext: (&net.Dialer{
                       Timeout:   5 * time.Second,
                       KeepAlive: 5 * time.Second,
                       DualStack: true,
               }).DialContext,
               MaxIdleConns:          500,
               IdleConnTimeout:       5 * time.Second,
               TLSHandshakeTimeout:   5 * time.Second,
               ExpectContinueTimeout: 1 * time.Second,
       })