August 2021 Survey Report

August Survey:
Our August survey represents our first official release of our crawling research. These reports will serve as a way for us not only to discuss the results of our crawls but the technical process of creating this technology. Working on this project has been an incredible adventure so far and we look forward to you taking this journey with us!
Survey Results:
High Level Survey Reports:
From a top level, our first HTTP crawl has yielded a great deal of interesting statistics about the adoption rates of many technologies across the internet. It is important to note that only the HTTP crawler is deployed in production crawling so these results are limited to that in the secondary crawl.
Port | Ping Response | Crawl Response |
---|---|---|
21 | 13016656 | Pending Dev |
80 | 60773485 | 19387484 |
443 | 53606644 | 2083897 |
7878 | 3385803 | 27962 |
8008 | 9079938 | 81633 |
8080 | 12263264 | 2291785 |
8888 | 5602088 | 1154780 |
9200 | 3868575 | Pending Dev |
9981 | 4798149 | Pending Dev |
11001 | 4908024 | Pending Dev |
47808 | 3204075 | Pending Dev |
50001 | 8383239 | Pending Dev |
Server Type Adoption Rates
Adoption rates are always an interesting look at the state of the internet. Technologies are rapidly replaced and obsolesced and the landscape is massively dynamic.
Server Type | Count |
---|---|
nginx | 3813748 |
unknown | 3765476 |
Apache | 3660319 |
Microsoft-IIS/10.0 | 733627 |
Microsoft-IIS/8.5 | 580386 |
DNVRS-Webs | 557033 |
Microsoft-IIS/7.5 | 435336 |
Webs | 403379 |
Apache/2.4.29 (Ubuntu) | 399222 |
nginx/1.18.0 | 324084 |
Apache/2.4.41 (Ubuntu) | 317789 |
nginx/1.14.0 (Ubuntu) | 292996 |
SonicWALL | 282804 |
micro_httpd | 262466 |
Apache/2.4.18 (Ubuntu) | 260325 |
nginx/1.16.1 | 259588 |
nginx/1.18.0 (Ubuntu) | 254281 |
Apache-Coyote/1.1 | 212352 |
Nginx Microsoft-HTTPAPI/2.0 | 194509 |
Boa/0.94.14rc21 | 193325 |
LiteSpeed | 182980 |
Apache/2.4.38 (Debian) | 173570 |
Apache/2 | 161466 |
Apache/2.2.15 (CentOS) | 160465 |
Apache/2.4.25 (Debian) | 158716 |
IdeaWebServer/3.0.0 | 155090 |
nginx/1.10.3 (Ubuntu) | 151313 |
nginx/1.14.2 | 121512 |
GoAhead-Webs | 121471 |
nginx/1.20.0 | 120744 |
web | 118241 |
nginx/1.20.1 | 114829 |
Apache/2.4.7 (Ubuntu) | 101990 |
Responsible Disclose For August
Our first crawl has yielded over 1000 vulnerable high value targets that have been responsibly disclosed to government agencies and the end customers. For the time being these disclosures will remain unreleased publicly until we can insure that all vulnerable machines are patched.
Technical Challenges
Since in general, our technology is still under heavy development, this month has exposed a variety of interesting challenges. To start, our stack is primarily Golang, Bash, ArangoDB, and NGINX. This has allowed us to scale at a tremendous rate with some a variety of growing pains.
Zmap Json Bug
This bug ended up resulting in us needing to throw away all our results after discovering a major flaw in the zmap Json module. Anyone using this in their research undoubtedly must throw away all their results as the IP’s returned are not real responses. This bug has been reported to the zmap team but it is unclear how or when this will be patched. We have fallen back on the CSV module which seems still to be reporting IP responses correctly. Thankfully this only resulted in a small rework of our distributed zmap worker system.
Distributed Locking at Scale
This so far has been our most challenging issue. We use ArangoDB as a Command and Control surface for all our workers. For the most part this is a highly efficient method for managing thousands of jobs distributed across hundreds of workers and sub-processes. We have had some cases where workers, for whatever reason, become perfectly synchronized. Basically this means that a job is acquired at the exact same time by two different workers before the job can be locked. This is mostly my own oversight in not realizing that a synchronized api would be 100% necessary when scaling past a certain point. This has been mostly mitigated by randomizing a wait time before a worker gets a job, but this is only a bandage.
To fix this a few options are possible. A foxx API may be the answer but I’m not convinced that this will totally solve the issue at a larger scale. I think another possible option may be better logic on the worker end to ensure that it is not locked in tandem with another worker. Overall this will be an interesting challenge and a chance to maximise the power of ArangoDB.
Colly Deadlocking Issue (Potential Golang Source Bug)
Often during crawls we deal with honey pots and anti-crawl mechanisms. This has, for some reason, become an extreme issue on port 8008. My current guess is that honeypots create endless connections that are deadlocking a few jobs infinitely. I have set very aggressive timeouts which you can see below, but even still typically in a job of 10k IP’s, 4-15 will fully deadlock and c.Wait()
never completes. We have enabled a variety of deadlines to help kill this job but even in the lowest level Golang ‘HTTP’ package, the workers never are able to close.
This issue has been reported to the Colly developers but again it seems unclear what the fix is. For the time being we have resorted to crashing workers after a timeout and skipping failed bricks of IP’s. While this solution is not optimal it has allowed our crawlers to successfully recover and skip failed blocks.
// Create Colly Collector
c := colly.NewCollector(
colly.MaxBodySize(10e9),
colly.DetectCharset(),
colly.Async(true),
)
c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 500})
c.SetRequestTimeout(5 * time.Second)
c.WithTransport(&http.Transport{
Proxy: http.ProxyFromEnvironment,
DialContext: (&net.Dialer{
Timeout: 5 * time.Second,
KeepAlive: 5 * time.Second,
DualStack: true,
}).DialContext,
MaxIdleConns: 500,
IdleConnTimeout: 5 * time.Second,
TLSHandshakeTimeout: 5 * time.Second,
ExpectContinueTimeout: 1 * time.Second,
})