l Expect failure at any time
l Automation is key
l Dashboards are essential
l Backups are good only if you can restore them
l If it's not monitored, it's not in production
l If a protocol has an acronym, you need to learn it
l The most important skill you need to master is problem solving
l You need at least 2 of everything in production
l Keep your systems secure
l Logging is your best friend
l You need to know a scripting language
l Document everything
l Always try to be a leader
Step 1. Configure a good monitoring and alerting system
Step 2. Configure a good resource graphing system
Step 3. Dashboards, dashboards, dashboards
Step 4. Correlate errors with resource state and capacity
Step 5. Expect failures and recover quickly and gracefully
l Test-infected vs. monitoring-infected
l Adding tests vs. adding monitoring checks
l Ignoring broken tests vs. ignoring monitoring alerts
l Improving test coverage vs. improving monitoring coverage
l Measure and graph everything
沒有留言:
張貼留言