Intermittent DNS resolution failures
Incident Report for Papertrail
Postmortem

A detailed explanation of what happened and how we're preventing it from affecting Papertrail again is on blog.papertrailapp.com.

Posted Dec 03, 2014 - 12:04 PST

Resolved
This incident has been resolved.
Posted Dec 02, 2014 - 01:07 PST
Update
Our monitoring shows DNS resolution is responding normally now.
Posted Dec 01, 2014 - 23:50 PST
Update
We're continuing to monitor progress as new nodes come online. In the next 48 hours, we'll post a detailed explanation and changes to prevent outages of a single authoritative DNS network from affecting Papertrail.
Posted Dec 01, 2014 - 19:10 PST
Update
DNS is still responding slowly or timing out for about 50% of resolvers outside of North America. It is functioning normally for the remaining half outside of North America and nearly all resolvers in North America.

We're monitoring and our provider is working on bringing the rest of their anycast nodes back online. We'll update again at 7 PM Pacific/2 AM UTC or sooner with news.
Posted Dec 01, 2014 - 18:04 PST
Monitoring
As of ~10 minutes ago, we're seeing slow but eventually successful resolution from most of the Internet. We'll update in 1 hour or when we see a material change in performance.
Posted Dec 01, 2014 - 16:57 PST
Update
To reach Papertrail's Web site by IP, browse to 67.214.223.202 with HTTPS. Acknowledge the SSL certificate mismatch and you'll be able to login.

Our DNS service provider is still working to mitigate the attack; no news. They're as disappointed in themselves as we, and the hundreds of thousands of other sites which depend on them, are. We'll update again about mitigation at 5 PM Pacific / 1 AM UTC or as soon as we see improvement.
Posted Dec 01, 2014 - 16:11 PST
Update
Our DNS provider estimates 20 minutes before DDoS attack traffic is being filtered so that resolution is functioning again. We'll update at 3:30 PM Pacific/11:30 PM UTC or sooner if our external monitors report an improvement.

We're disappointed in our almost nonexistent ability to mitigate a provider-specific problem, and a problem that can be architected around. While we work through this incident, we're also taking the first steps to implement redundant DNS providers. A provider-specific DDoS should not materially affect Papertrail's DNS resolution, and in the near future, it won't.
Posted Dec 01, 2014 - 15:07 PST
Identified
Our DNS provider is still working to mitigate the DDoS attack. We'll post an update as soon as we know or in 1 hour (3 PM Pacific/11 PM UTC).

Separately, after this outage has been resolved, we'll be making Papertrail's status site reachable when authoritative DNS resolution is down and investigating practical ways to make Papertrail's DNS more redundant.
Posted Dec 01, 2014 - 13:59 PST
Update
Our DNS provider is continuing to work to mitigate the distributed denial of service attack by adding additional capacity. We'll post more as soon as we know or in 1 hour (2PM Pacific/10 PM UTC).
Posted Dec 01, 2014 - 13:02 PST
Update
Our DNS provider is midway through mitigating a distributed denial of service attack. We'll post more as soon as we know or in 45 minutes (1 PM Pacific/9 PM UTC).
Posted Dec 01, 2014 - 12:17 PST
Investigating
As of 11:19 AM PST (19:19 UTC), our DNS provider is seeing intermittent query timeouts.

Due to Papertrail's DNS TTL and most syslog senders' resolution cache behavior, query timeouts won't affect most existing loggers immediately. Query timeouts may affect browsing sessions and other clients without cached records.

We'll update this incident as soon as we have news or in 30 minutes, whichever is sooner.
Posted Dec 01, 2014 - 11:36 PST