Systems running in the cloud is the same as a primitive multicellular creature. They can not live unless you take care of them. Understanding the healthiness of the cloud system is directly proportional to understand logs and metrics in the system. One of the most critical services provided by Cloud Providers is logging query and Monitoring services. In the AWS, Cloudfront, Alb, Cloudtrails, Lambda and EC2 are mainly known services that provide very useful logs into us. Cloudwatch insights and S3 integration provide us to query them and benefit from this evidence. R53 is of new services that shared detailed logs about DNS resolution in private networks. Today, I will share my work to show how much information we get from
Private hosted zone produces logs of every DNS resolution event in Cloudwatch Log groups with rich fields like domain names, source information, and resolution time including AWS domains. In huge microservice systems, keeping information about microservices dependency with each other and AWS services is not an issue that we can handle manually due to an enormous number of microservices. We can solve the problem just by monitoring the dependency of all microservices in real-time without doing anything other than following r53 logs.
Monitoring R53 logs provides information about not only system architecture and AWS service dependency but also NXDOMAIN DNS queries helping us to solve our It’s always DNS incidents and decrease MTTR value up to reasonable values with good alerting solutions like Opsgenie.
Example
To understand the R53 private logs, we must have a microservices setup connecting to each other with private DNS. Therefore, I provide simple cloudformation that generates internal microservices on behalf of us.
The cloudformation template provides simple microservice deployment talking with each other via r53 private DNS. In this example, we have 4 different services deployed with their loadbalancers and private domain names and they can request to each other.
In our example, we expect service tag for each resource like alb, ec2 instance so that we can understand which instance is connected to which alb. First, we run the following query in order to receive all logs in the last 20 logs.
"fields query_name, srcids.instance | filter query_name like /(?i)"+re.escape(dns_prefix)+"/ | sort @timestamp desc | limit 20"
And we query ec2 instances and to determine sender and receiver respectively. The remaining part is to simply put them into graphs and show service architecture.
r53-analyzer is the repository created in order to show how important DNS query logs and to start the discussion on how we can combine more logs. We can go further and analyze which services query which DNS queries and show the external dependency of each service. Moreover, combining them with alb + CloudFront logs can help us to understand external request distribution into internal services. For example, we can understand how many external requests are affected by the S3 service incident here.