Sharing

2012年9月25日 星期二

CAP、BASE 理論筆記


這篇對 CAP 理論做了一個初步的解釋, 建議對 CAP 理論完全不清楚的人可以先看這篇
這篇對 CAP 的論文解釋的更深入, 同時也提到延申的 BASE 理論
Eric Brewer 解釋每一種設計在某一個象限都會有部份的漏洞 (Consistency/Availability/Partition Tolerance), 但這些漏洞在經過十二年後都發展出不同的配套措施, 我覺得很值得一看, 因為有些事情如果一直用工程師的角度去看, 會發現根本無解, 而且很嚴重, 但如果是加上行政上的配套措施之後, 似乎也就沒這麼嚴重了, 所以偶而也要跳脫工程師的思維才是。



http://pl.atyp.us/wordpress/index.php/2009/11/availability-and-partition-tolerance/
一個試著用簡單例子說明 CAP 的人, 但我總覺得還是有什麼地方怪怪的
下面的留言還滿有可看性的

留言一:
Quorum is to avoid “split brain syndrome” in which two mutually isolated sets of nodes continue running independently after a network partition. The key observation is that only one such set can contain N/2+1 or more nodes. Therefore, if you can reach at least that many nodes (including yourself) then you’re part of the quorum majority and it’s safe to keep running. If you can’t, then you’re part of the minority and it’s not safe so you shut down. Since the quorum members can see or assume that non-quorum members are no longer alive, they can break locks instead of waiting for operations that require them. It’s a simple and completely reasonable approach in a local environment where you’ve done everything you can to prevent partitions (e.g. redundant networks or even signaling through storage), but in a distributed environment where partitions are inevitable it can mean that all but one site becomes unusable.

留言二:

Thank you Jeff, for perfect and quick elaboration. Now it makes so much more sense to me.
I would like to clarify one more thing. Under the figure you have three bullets. The first bullet describes a scenario where we have an available but not partition-tolerant system. In the description you say that Z would be blocked because X is unreachable. Which I understand. However, I do not understand why this is called available system because when I compare it to the second bullet, and look at both situation as black-box from availability prospective, they appear to me completely similar. In both cases one node is working while the other is blocked. Yes, it is for two different reasons, which I understand, but what I don’t is why one is called available while the other is not. To me it appears that both situations are actually not-available (for two different reasons, though).

回覆:
CP without A – The distributed service will always return correct results, though some or all of the nodes may not respond at all in when there are problems. If the network gets partitioned this will continue to work.
AP without C – Any non-failed node will always respond to requests, though the data may be stale. If the network gets partitioned this will continue to work.
CA without P – The distributed service will return consistent results from all non-failed nodes, all the time. Since there is no “Partition-Tolerance” the distributed systems network must be perfect – never failing. If the network does get partitioned, all bets are off. You’ll lose consistency, availability, or both. Choosing to forgo “Partition-Tolerance” means that your system will not tolerate network partitions.


回覆:
It is impossible to have a “consistent” and “available” distributed system, unless you can guarantee there will be no network partitions. But It doesn’t mean there can’t be “almost CA systems” out there that are available most of the time and are consistent most of the time.


看完之後, 我認為紅色的留言才是正確的, 事實上, 再回頭看 infoq 那邊文章, 會發現 Eric Brewer 已解釋, 

  • CAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of partitions, which are rare
  • designers need to choose between consistency and availability when partitions are present
  • a period when the program must make a fundamental decision-the partition decision:
    • cancel the operation and thus decrease availability (CP)
    • proceed with the operation and thus risk inconsistency. (AP)

一個分散式的系統, 一般來說就必須滿足 P  (不會因為某些節點的損壞或網路中斷而造成服務完全不可得), 所以剩下的就是考慮你到底要
1. 犧牲某些人, 讓他們得不到服務, 但確保得到服務的人都能拿到一致的資料 (C)
2. 不犧牲任何人, 所有的人都可以得到服務, 但有可能拿到不一致的資料 (A)

至於宣稱同時滿足 CA 的系統, 一但某些節點壞掉(或因為網路的中斷而 unavailable)  就會讓整個系統停頓, 所有的人都得不到服務, 我個人認為 Single Point Failure 就是說明的好例子, 但在我們設計一個擁有 High Availability 的服務系統時, 我們最不想見到的就是整個服務停擺, 所以前面才會說現在的分散式系統大多都會以滿足 P 為前提, 接下來才是考慮要放棄 C 或是放棄 A,  甚至有些 NoSQL 的設計已經可以自由選擇要 CP or AP, 端看使用者的需求.

2012年9月24日 星期一

Some good articles about testing and monitoring


l   Expect failure at any time
l   Automation is key
l   Dashboards are essential

l   Backups are good only if you can restore them
l   If it's not monitored, it's not in production
l   If a protocol has an acronym, you need to learn it
l   The most important skill you need to master is problem solving
l   You need at least 2 of everything in production
l   Keep your systems secure
l   Logging is your best friend
l   You need to know a scripting language
l   Document everything
l   Always try to be a leader

Step 1. Configure a good monitoring and alerting system
Step 2. Configure a good resource graphing system
Step 3. Dashboards, dashboards, dashboards
Step 4. Correlate errors with resource state and capacity
Step 5. Expect failures and recover quickly and gracefully


l   Test-infected vs. monitoring-infected
l   Adding tests vs. adding monitoring checks
l   Ignoring broken tests vs. ignoring monitoring alerts
l   Improving test coverage vs. improving monitoring coverage
l   Measure and graph everything