4. 1. 事务:通过运行时事务保证无分区时的强一致性,mnesia 支持多种事务类型:
a) 无锁无事务脏写,一阶段异步;
b) 有锁异步事务,一阶段同步锁,一阶段同步一阶段异步事务;
c) 有锁同步事务,一阶段同步锁,两阶段同步事务;
d) 有锁 majority 事务,一阶段同步锁,两阶段同步一阶段异步事务;
e) 有锁 schema 事务,一阶段同步锁,三阶段同步事务,是附带 schema 操作的 majority
事务;
2. 恢复:通过重启时恢复保证有分区时的最终一致性,mnesia 重启时进行如下分布式协商
工作:
a) 节点发现;
b) 节点协议版本协商;
c) 节点 schema 合并;
d) 节点事务 decision 合并;
i. 若远程节点事务结果 abort,本节点事务结果 commit,则出现冲突,报告
{inconsistent_database, bad_decision, Node},本节点事务结果改为 abort;
ii. 若远程节点事务结果 commit,本节点事务结果 abort,本节点事务结果仍为
abort,此时远程节点将进行修改和通报;
iii. 若远程节点事务结果 unclear,本节点事务结果非 unclear,则事务结果为本节
点事务结果,远程节点进行修改;
iv. 若远程节点事务结果 unclear,本节点事务结果 unclear,则等待其它直到事务
结果的节点启动,并按照其结果作为事务结果;
v. 若所有节点事务结果均 unclear,则事务结果为 unclear;
5. vi. 事务 decision 并不真正影响实际的数据内容;
e) 节点表数据合并:
i. 若本节点为 master 节点,则本节点从磁盘加载表数据;
ii. 若本节点有 local 表,则本节点从磁盘加载 local 表数据;
iii. 若远程节点存活,则从远程节点拉取表数据;
iv. 若远程节点未存活,本节点为最后一个关闭的节点,则本节点从磁盘加载表数
据;
v. 若远程节点未存活,本节点非最后一个关闭的节点,则等待其它远程节点启动
加载表数据后,再从远程节点拉取表数据,远程节点未启动加载时,表不可访
问;
vi. 若表数据已经加载,则不会再从远程节点拉取表数据;
vii. 从集群角度看:
1. 若有其它节点重启时发起新的分布式协商,本节点将其加入集群拓扑视图;
2. 若有集群中节点 down(关闭或者产生分区) 本节点将其移出集群拓扑视
,
图;
3. 分区恢复时不进行分布式协商,其它分区的节点不能加入集群拓扑视图,
各个分区依旧保持分区状态;
3. 不一致状态检测:通过运行时和重启时监控远程节点的 up 和 down 状态、远程节点对
事务的决议结果,检测是否曾经发生过程网络分区,若出现过,则意味着潜在的分区不
一致,此时将通告应用者一个 inconsistent_database 事件:
a) 运行时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在
远程节点重新 up 时,即通告{inconsistent_database, running_partitioned_network,
49. end,
add_remote_decisions(Node, Tail, State);
add_remote_decisions(_Node, [], State) ->
State.
add_remote_decision(Node, NewD, State) ->
Tid = NewD#decision.tid,
OldD = decision(Tid),
%%根据合并策略进行 decision 合并,对于唯一的冲突情况,即接收节点提交事务,而
发送节点中止事务,则接收节点处也选择中止事务,而事务本身的状态将由检查点和 redo
日志进行重构
D = merge_decisions(Node, OldD, NewD),
%%记录合并结果
do_log_decision(D, false, undefined),
Outcome = D#decision.outcome,
if
OldD == no_decision -> ignore;
Outcome == unclear -> ignore;
true ->
case lists:member(node(), NewD#decision.disc_nodes) or
lists:member(node(), NewD#decision.ram_nodes) of
true ->
%%向其它节点告知本节点的 decision 合并结果
tell_im_certain([Node], D);
false -> ignore
end
end,
case State#state.unclear_decision of
U when U#decision.tid == Tid ->
WaitFor = State#state.unclear_waitfor -- [Node],
if
Outcome == unclear, WaitFor == [] ->
%% Everybody are uncertain, lets abort
%%询问过未决事务的所有参与节点后,仍然没有任何节点可以提供事务提交
结果,此时决定终止事务
NewOutcome = aborted,
CertainD = D#decision{outcome = NewOutcome,
50. disc_nodes = [],
ram_nodes = []},
tell_im_certain(D#decision.disc_nodes, CertainD),
tell_im_certain(D#decision.ram_nodes, CertainD),
do_log_decision(CertainD, false, undefined),
verbose("Decided to abort transaction ~p "
"since everybody are uncertain ~p~n",
[Tid, CertainD]),
gen_server:reply(State#state.unclear_pid, {ok, NewOutcome}),
State#state{unclear_pid = undefined,
unclear_decision = undefined,
unclear_waitfor = undefined};
Outcome /= unclear ->
%%发送节点知道事务结果,通告事务结果
verbose("~p told us that transaction ~p was ~p~n",
[Node, Tid, Outcome]),
gen_server:reply(State#state.unclear_pid, {ok, Outcome}),
State#state{unclear_pid = undefined,
unclear_decision = undefined,
unclear_waitfor = undefined};
Outcome == unclear ->
%%发送节点也不知道事务结果,此时继续等待
State#state{unclear_waitfor = WaitFor}
end;
_ ->
State
end.
合并策略:
merge_decisions(Node, D, NewD0) ->
NewD = filter_aborted(NewD0),
if
D == no_decision, node() /= Node ->
%% We did not know anything about this txn
NewD#decision{disc_nodes = []};
D == no_decision ->
NewD;
is_record(D, decision) ->
DiscNs = D#decision.disc_nodes -- ([node(), Node]),
OldD = filter_aborted(D#decision{disc_nodes = DiscNs}),
if
51. OldD#decision.outcome == unclear,
NewD#decision.outcome == unclear ->
D;
OldD#decision.outcome == NewD#decision.outcome ->
%% We have come to the same decision
OldD;
OldD#decision.outcome == committed,
NewD#decision.outcome == aborted ->
%%decision 发送节点与接收节点唯一冲突的位置,即接收节点提交事务,而发
送节点中止事务,此时仍然选择中止事务
Msg = {inconsistent_database, bad_decision, Node},
mnesia_lib:report_system_event(Msg),
OldD#decision{outcome = aborted};
OldD#decision.outcome == aborted -> OldD#decision{outcome = aborted};
NewD#decision.outcome == aborted -> OldD#decision{outcome = aborted};
OldD#decision.outcome == committed,
NewD#decision.outcome == unclear -> OldD#decision{outcome = committed};
OldD#decision.outcome == unclear,
NewD#decision.outcome == committed -> OldD#decision{outcome = committed}
end
end.
2. 节点发现,集群遍历
mnesia_controller.erl
merge_schema() ->
AllNodes = mnesia_lib:all_nodes(),
%%尝试合并 schema,合并完了后通知所有曾经的集群节点,与本节点进行数据转移
case try_merge_schema(AllNodes, [node()], fun default_merge/1) of
ok ->
%%合并 schema 成功后,将进行数据合并
schema_is_merged();
{aborted, {throw, Str}} when is_list(Str) ->
fatal("Failed to merge schema: ~s~n", [Str]);
Else ->
fatal("Failed to merge schema: ~p~n", [Else])
end.
52. try_merge_schema(Nodes, Told0, UserFun) ->
%%开始集群遍历,启动一个 schema 合并事务
case mnesia_schema:merge_schema(UserFun) of
{atomic, not_merged} ->
%% No more nodes that we need to merge the schema with
%% Ensure we have told everybody that we are running
case val({current,db_nodes}) -- mnesia_lib:uniq(Told0) of
[] -> ok;
Tell ->
im_running(Tell, [node()]),
ok
end;
{atomic, {merged, OldFriends, NewFriends}} ->
%% Check if new nodes has been added to the schema
Diff = mnesia_lib:all_nodes() -- [node() | Nodes],
mnesia_recover:connect_nodes(Diff),
%% Tell everybody to adopt orphan tables
%%通知所有的集群节点,本节点启动,开始数据合并申请
im_running(OldFriends, NewFriends),
im_running(NewFriends, OldFriends),
Told = case lists:member(node(), NewFriends) of
true -> Told0 ++ OldFriends;
false -> Told0 ++ NewFriends
end,
try_merge_schema(Nodes, Told, UserFun);
{atomic, {"Cannot get cstructs", Node, Reason}} ->
dbg_out("Cannot get cstructs, Node ~p ~p~n", [Node, Reason]),
timer:sleep(300), % Avoid a endless loop look alike
try_merge_schema(Nodes, Told0, UserFun);
{aborted, {shutdown, _}} -> %% One of the nodes is going down
timer:sleep(300), % Avoid a endless loop look alike
try_merge_schema(Nodes, Told0, UserFun);
Other ->
Other
end.
mnesia_schema.erl
merge_schema() ->
schema_transaction(fun() -> do_merge_schema([]) end).
merge_schema(UserFun) ->
schema_transaction(fun() -> UserFun(fun(Arg) -> do_merge_schema(Arg) end) end).
可以看出 merge_schema 的过程也是放在一个 mnesia 元数据事务中进行的,这个事务的主
53. 题操作包括:
{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}
{op, merge_schema, CstructList}
这个过程会与集群中的事务节点进行 schema 协商,检查 schema 是否兼容。
do_merge_schema(LockTabs0) ->
%% 锁 schema 表
{_Mod, Tid, Ts} = get_tid_ts_and_lock(schema, write),
LockTabs = [{T, tab_to_nodes(T)} || T <- LockTabs0],
[get_tid_ts_and_lock(T,write) || {T,_} <- LockTabs],
Connected = val(recover_nodes),
Running = val({current, db_nodes}),
Store = Ts#tidstore.store,
%% Verify that all nodes are locked that might not be the
%% case, if this trans where queued when new nodes where added.
case Running -- ets:lookup_element(Store, nodes, 2) of
[] -> ok; %% All known nodes are locked
Miss -> %% Abort! We don't want the sideeffects below to be executed
mnesia:abort({bad_commit, {missing_lock, Miss}})
end,
%% Connected 是本节点的已连接节点,通常为当前集群中通信协议兼容的结点;
Running
是本节点的当前 db_nodes,通常为当前集群中与本节点一致的结点;
case Connected -- Running of
%% 对于那些已连接,但是还未进行 decision 的节点,需要进行通信协议协商,然后进
行 decision 协商,这个过程实质上是一个全局拓扑下的节点发现过程(遍历算法)
,这个过
程由某个节点发起,
[Node | _] = OtherNodes ->
%% Time for a schema merging party!
mnesia_locker:wlock_no_exist(Tid, Store, schema, [Node]),
[mnesia_locker:wlock_no_exist(
Tid, Store, T, mnesia_lib:intersect(Ns, OtherNodes))
|| {T,Ns} <- LockTabs],
%% 从远程结点 Node 处取得其拥有的表的 cstruct,及其 db_nodes RemoteRunning1
case fetch_cstructs(Node) of
{cstructs, Cstructs, RemoteRunning1} ->
57. ignore; %% from do_merge_schema
true ->
%% If a node has restarted it may still linger in db_nodes,
%% but have been removed from recover_nodes
Current = mnesia_lib:intersect(val({current,db_nodes}), [node()|val(recover_nodes)]),
NewNodes = mnesia_lib:uniq(Running++RemoteRunning) -- Current,
mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}),
announce_im_running(NewNodes, SchemaCs)
end,
{false, optional};
此处可以看出,在 announce_im_running 的 prepare 过程中,要与远程未连接的节点进行协
商,协商通过后,这些未连接节点将加入本节点的事务节点集群
反之,一旦该 schema 操作中止,mnesia_tm 将进行 undo 动作:
mnesia_tm.erl
commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->
?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]),
case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of
{Modified, C = #commit{}, DumperMode} ->
%% If we can not find any local unclear decision
%% we should presume abort at startup recovery
case lists:member(node(), DiscNs) of
false ->
ignore;
true ->
case Modified of
false -> mnesia_log:log(Bin);
true -> mnesia_log:log(C)
end
end,
?eval_debug_fun({?MODULE, commit_participant, vote_yes},
[{tid, Tid}]),
reply(Coord, {vote_yes, Tid, self()}),
receive
{Tid, pre_commit} ->
…
receive
{Tid, committed} ->
…
{Tid, {do_abort, _Reason}} ->