SlideShare une entreprise Scribd logo
1  sur  95
目录

1.   现象与成因............................................................................................................................... 2

2.   mnesia 运行机制 ...................................................................................................................... 3

3.   常见问题与注意事项............................................................................................................... 6

4.   源码分析................................................................................................................................... 8

     1.     mnesia:create_schema/1 的工作过程 ............................................................................. 8

            1.     主体过程................................................................................................................... 8

            2.     前半部分 mnesia:create_schema/1 做的工作 ........................................................ 9

            3.     后半部分 mnesia:start/0 做的工作 ....................................................................... 19

     4.     mnesia:change_table_majority/2 的工作过程 .............................................................. 23

            1.     调用接口................................................................................................................. 23

            2.     事务操作................................................................................................................. 24

            3.     schema 事务提交接口 ........................................................................................... 29

            4.     schema 事务协议过程 ........................................................................................... 31

            5.     远程节点事务管理器第一阶段提交 prepare 响应 .............................................. 34

            6.     远程节点事务参与者第二阶段提交 precommit 响应 ......................................... 37

            7.     请求节点事务发起者收到第二阶段提交 precommit 确认 ................................. 37

            8.     远程节点事务参与者第三阶段提交 commit 响应............................................... 39

            9.     第三阶段提交 commit 的本地提交过程 .............................................................. 39

     5.     majority 事务处理 .......................................................................................................... 45

     6.     恢复................................................................................................................................. 46
1.     节点协议版本检查+节点 decision 通告与合并.................................................... 46

        2.     节点发现,集群遍历 ............................................................................................. 51

        3.     节点 schema 合并 .................................................................................................. 60

        4.     节点数据合并部分 1,从远程节点加载表 .......................................................... 62

        5.     节点数据合并部分 2,从本地磁盘加载表 .......................................................... 66

        6.     节点数据合并部分 2,表加载完成 ...................................................................... 71

   7.   分区检测......................................................................................................................... 73

        1.     锁过程中的同步检测 ............................................................................................. 73

        2.     事务过程中的同步检测 ......................................................................................... 75

        3.     节点 down 异步检测.............................................................................................. 80

        4.     节点 up 异步检测................................................................................................... 92

   8.   其它................................................................................................................................. 95




分析代码版本为 erlang 版本 R15B03。




1. 现象与成因

现象:mnesia 在出现网络分区后,向各个分区写入不同数据,各个分区产生不一致的状态,

分区恢复后,mnesia 仍然呈现不一致的状态。若重启任意的分区,则重启的分区将从存活

的分区中拉取数据,自身原先的数据丢失。



原理:分布式系统受 CAP 原理(若系统能够容忍网络分区,则不能同时满足可用性和一致
性)约束,一些分布式存储系统为保证可用性,放弃强一致转而追求最终一致。

mnesia 也是最终一致的分布式数据库,在没有分区的时候,mnesia 为强一致的,而出现分

区后,mnesia 仍然允许写入,因此将呈现不一致的状态。分区消除后,需要应用者处理不

一致的状态。简单的恢复过程如重启被放弃的分区,令其重新从保留的分区拉取数据,复杂

的恢复过程则需要编写数据订正程序,应用订正程序进行恢复。




2. mnesia 运行机制


mnesia 运行机制状态图,事务过程采用 majority 事务,即当大多数节点在集群中时,才允

许写:




mnesia 运行机制解释:
1. 事务:通过运行时事务保证无分区时的强一致性,mnesia 支持多种事务类型:

  a)      无锁无事务脏写,一阶段异步;

  b)      有锁异步事务,一阶段同步锁,一阶段同步一阶段异步事务;

  c)      有锁同步事务,一阶段同步锁,两阶段同步事务;

  d)      有锁 majority 事务,一阶段同步锁,两阶段同步一阶段异步事务;

  e)      有锁 schema 事务,一阶段同步锁,三阶段同步事务,是附带 schema 操作的 majority

          事务;

2. 恢复:通过重启时恢复保证有分区时的最终一致性,mnesia 重启时进行如下分布式协商

 工作:

  a)      节点发现;

  b)      节点协议版本协商;

  c)      节点 schema 合并;

  d)      节点事务 decision 合并;

        i.    若远程节点事务结果 abort,本节点事务结果 commit,则出现冲突,报告

              {inconsistent_database, bad_decision, Node},本节点事务结果改为 abort;

       ii.    若远程节点事务结果 commit,本节点事务结果 abort,本节点事务结果仍为

              abort,此时远程节点将进行修改和通报;

       iii.   若远程节点事务结果 unclear,本节点事务结果非 unclear,则事务结果为本节

              点事务结果,远程节点进行修改;

       iv.    若远程节点事务结果 unclear,本节点事务结果 unclear,则等待其它直到事务

              结果的节点启动,并按照其结果作为事务结果;

       v.     若所有节点事务结果均 unclear,则事务结果为 unclear;
vi.    事务 decision 并不真正影响实际的数据内容;

  e)      节点表数据合并:

         i.   若本节点为 master 节点,则本节点从磁盘加载表数据;

        ii.   若本节点有 local 表,则本节点从磁盘加载 local 表数据;

       iii.   若远程节点存活,则从远程节点拉取表数据;

       iv.    若远程节点未存活,本节点为最后一个关闭的节点,则本节点从磁盘加载表数

              据;

        v.    若远程节点未存活,本节点非最后一个关闭的节点,则等待其它远程节点启动

              加载表数据后,再从远程节点拉取表数据,远程节点未启动加载时,表不可访

              问;

       vi.    若表数据已经加载,则不会再从远程节点拉取表数据;

       vii.   从集群角度看:

              1.   若有其它节点重启时发起新的分布式协商,本节点将其加入集群拓扑视图;

              2.   若有集群中节点 down(关闭或者产生分区) 本节点将其移出集群拓扑视
                                        ,

                   图;

              3.   分区恢复时不进行分布式协商,其它分区的节点不能加入集群拓扑视图,

                   各个分区依旧保持分区状态;

3. 不一致状态检测:通过运行时和重启时监控远程节点的 up 和 down 状态、远程节点对

 事务的决议结果,检测是否曾经发生过程网络分区,若出现过,则意味着潜在的分区不

 一致,此时将通告应用者一个 inconsistent_database 事件:

  a)      运行时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在

          远程节点重新 up 时,即通告{inconsistent_database, running_partitioned_network,
Node};

  b)   重启时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在

       远程节点重新 up 时,即通告{inconsistent_database, starting_partitioned_network,

       Node};

  c)   运行时和重启时与远程节点交换事务 decision,若发现对方 abort 而本身 commit

       事务,即通告{inconsistent_database, bad_decision, Node};




3. 常见问题与注意事项

此处的所有问题涉及事务的部分仅讨论 majority 事务,因为此类事务比同步和异步事务要完

备一些,也不包含 schema 操作。

fail_safe 状态:出现网络分区后,minority 分区不能写入的状态。

常见问题:

1. 出现分区后,本节点节点为 minority 分区,分区恢复后,若本节点不重启,能否一直保

  持 fail_safe 状态?

  若不断有其它节点启动,然后与本节点进行协商,加入本集群,使得集群变为 majority,

  此时集群变为可写;

  若没有任何其他节点启动,则本节点一致保持 fail_safe 状态;

2. 出现网络瞬断后,在 majority 分区写入,此写入不能到达 minority 分区,分区恢复后,

  在 minority 分区写入,此时 minority 如何进入 fail_safe 状态?

  mnesia 依赖于 erlang 虚拟机检测是否有节点 down,在 majority 写入时,erlang 虚拟机

  将检测到 minority 节点 down, minority 的 erlang 虚拟机也会发现 majority 节点 down。
                       而
双方都能发现对方 down,majority 继续可写,而 minority 进入 fail_safe 状态;

3. 对于集群 A、B、C,产生分区 A 与 B、C,在 B、C 写入数据,恢复分区。此时若重启 A,

  有什么效果?重启 B、C 有什么效果?

  经过试验得出:

   a)   若重启 A,则在 A 中能正确发现 B、 写入的记录,
                            C      这依赖于 A 启动时的协商过程,

        A 向 B、C 请求表数据;

   b)   若重启 B、C,则在 B、C 中不能发现原先写入的记录,这依赖于 B、C 启动时的协

        商过程,B、C 向 A 请求表数据;



注意事项:

1. mnesia 在出现分区时是最终一致的而非强一致的,要保证强一致,可以指定一个 master

  节点,由他来仲裁最终的数据结果,但这样也会引入单点问题;

2. 订阅 mnesia 产生的 system 事件(包括 inconsistent_database 事件) mnesia 已经启动,
                                                       时

  一些事件可能已经被 mnesia 发出,因此可能会遗漏一些事件;

3. 订阅 mnesia 事件是非持久的,mnesia 重启时需要重新订阅;

4. majority 事务是二次同步一次异步事务,提交过程中还需要参与节点创建一个进程进行

  事务处理,加上原本的一个 ets 表和一次同步锁,可能进一步降低性能;

5. majority 事务不能约束恢复过程,而恢复过程将优先选择存活远程节点的表作为本节点

  表的恢复依据;

6. mnesia 对 inconsistent_database 的检查与报告是一个较强的条件,可能产生误报;

7. 发现脑裂结束后,最好应采取告警,由人力介入来解决,而非自动解决;
4. 源码分析

主题包括:

1. mnesia 磁盘表副本要求 schema 也有磁盘表副本,因此需要参考 mnesia:create_schema/1

   的工作过程;

2. 此处使用 majority 事务进行解释,
                       必须参考 mnesia:change_table_majority/2 的工作过程,

   且此过程是 schema 事务,可以更详细全面的理解 majority 事务;

3. majority 事务处理将弱化 schema 事务模型,进行特定的解释;

4. 恢复过程分析 mnesia 启动时的主要工作、分布式协商过程、磁盘表加载;

5. 分区检查分析 mnesia 如何检查各类 inconsistent_database 事件;




1. mnesia:create_schema/1 的工作过程

1. 主体过程

安装 schema 的过程必须要在 mnesia 停机的条件下进行,此后,mnesia 启动。



schema 添加的过程本质上是一个两阶段提交过程:

schema 变更发起节点

1. 询问各个参与节点是否已经由 schema 副本

2. 上全局锁{mnesia_table_lock, schema}

3. 在各个参与节点上建立 mnesia_fallback 进程

4. 第一阶段向各个节点的 mnesia_fallback 进程广播{start, Header, Schema2}消息,通知其保

   存新生成的 schema 文件备份
5. 第二阶段向各个节点的 mnesia_fallback 进程广播 swap 消息,通知其完成提交过程,创

    建真正的"FALLBACK.BUP"文件

6. 最终向各个节点的 mnesia_fallback 进程广播 stop 消息,完成变更



2. 前半部分 mnesia:create_schema/1 做的工作


mnesia.erl
create_schema(Ns) ->
     mnesia_bup:create_schema(Ns).

mnesia_bup.erl
create_schema([]) ->
     create_schema([node()]);
create_schema(Ns) when is_list(Ns) ->
     case is_set(Ns) of
          true ->
                create_schema(Ns, mnesia_schema:ensure_no_schema(Ns));
          false ->
                {error, {combine_error, Ns}}
     end;
create_schema(Ns) ->
     {error, {badarg, Ns}}.

mnesia_schema.erl
ensure_no_schema([H|T]) when is_atom(H) ->
    case rpc:call(H, ?MODULE, remote_read_schema, []) of
          {badrpc, Reason} ->
               {H, {"All nodes not running", H, Reason}};
          {ok,Source, _} when Source /= default ->
               {H, {already_exists, H}};
          _ ->
               ensure_no_schema(T)
    end;
ensure_no_schema([H|_]) ->
    {error,{badarg, H}};
ensure_no_schema([]) ->
    ok.
remote_read_schema() ->
    case mnesia_lib:ensure_loaded(?APPLICATION) of
    ok ->
case mnesia_monitor:get_env(schema_location) of
         opt_disc ->
              read_schema(false);
         _ ->
              read_schema(false)
         end;
    {error, Reason} ->
         {error, Reason}
    end.

询问其它所有节点,检查其是否启动,并检查其是否已经具备了 mnesia 的 schema,仅当所

有预备建立 mnesia schema 的节点全部启动,且没有 schema 副本,该检查才成立。



回到 mnesia_bup.erl

mnesia_bup.erl
create_schema(Ns, ok) ->
     case mnesia_lib:ensure_loaded(?APPLICATION) of
          ok ->
               case mnesia_monitor:get_env(schema_location) of
                    ram ->
                         {error, {has_no_disc, node()}};
                    _ ->
                         case mnesia_schema:opt_create_dir(true, mnesia_lib:dir()) of
                              {error, What} ->
                                    {error, What};
                              ok ->
                                    Mod = mnesia_backup,
                                    Str = mk_str(),
                                    File = mnesia_lib:dir(Str),
                                    file:delete(File),
                                    case catch make_initial_backup(Ns, File, Mod) of
                                          {ok, _Res} ->
                                               case do_install_fallback(File, Mod) of
                                                     ok ->
                                                          file:delete(File),
                                                          ok;
                                                     {error, Reason} ->
                                                          {error, Reason}
                                               end;
                                          {error, Reason} ->
                                               {error, Reason}
                                    end
end
              end;
         {error, Reason} ->
              {error, Reason}
     end;
create_schema(_Ns, {error, Reason}) ->
     {error, Reason};
create_schema(_Ns, Reason) ->
     {error, Reason}.

通过 mnesia_bup:make_initial_backup 创建一个本地节点的新 schema 的描述文件,然后再通

过 mnesia_bup:do_install_fallback 将新 schema 描述文件通过恢复过程,变更 schema:


make_initial_backup(Ns, Opaque, Mod) ->
    Orig = mnesia_schema:get_initial_schema(disc_copies, Ns),
    Modded = proplists:delete(storage_properties, proplists:delete(majority, Orig)),
    Schema = [{schema, schema, Modded}],
    O2 = do_apply(Mod, open_write, [Opaque], Opaque),
    O3 = do_apply(Mod, write, [O2, [mnesia_log:backup_log_header()]], O2),
    O4 = do_apply(Mod, write, [O3, Schema], O3),
    O5 = do_apply(Mod, commit_write, [O4], O4),
   {ok, O5}.

创建一个本地节点的新 schema 的描述文件,注意,新 schema 的 majority 属性没有在备份

中。

mnesia_schema.erl
get_initial_schema(SchemaStorage, Nodes) ->
     Cs = #cstruct{name = schema,
              record_name = schema,
              attributes = [table, cstruct]},
     Cs2 =
     case SchemaStorage of
           ram_copies -> Cs#cstruct{ram_copies = Nodes};
           disc_copies -> Cs#cstruct{disc_copies = Nodes}
     end,
     cs2list(Cs2).

mnesia_bup.erl
do_install_fallback(Opaque, Mod) when is_atom(Mod) ->
    do_install_fallback(Opaque, [{module, Mod}]);
do_install_fallback(Opaque, Args) when is_list(Args) ->
    case check_fallback_args(Args, #fallback_args{opaque = Opaque}) of
{ok, FA} ->
               do_install_fallback(FA);
         {error, Reason} ->
               {error, Reason}
    end;
do_install_fallback(_Opaque, Args) ->
    {error, {badarg, Args}}.

检 查 安 装 参 数 , 将 参 数 装 入 一 个 fallback_args 结 构 , 参 数 检 查 及 构 造 过 程 在

check_fallback_arg_type/2 中,然后进行安装


check_fallback_args([Arg | Tail], FA) ->
    case catch check_fallback_arg_type(Arg, FA) of
          {'EXIT', _Reason} ->
               {error, {badarg, Arg}};
          FA2 ->
               check_fallback_args(Tail, FA2)
    end;
check_fallback_args([], FA) ->
    {ok, FA}.
check_fallback_arg_type(Arg, FA) ->
    case Arg of
          {scope, global} ->
               FA#fallback_args{scope = global};
          {scope, local} ->
               FA#fallback_args{scope = local};
          {module, Mod} ->
               Mod2 = mnesia_monitor:do_check_type(backup_module, Mod),
               FA#fallback_args{module = Mod2};
          {mnesia_dir, Dir} ->
               FA#fallback_args{mnesia_dir = Dir,
                                      use_default_dir = false};
          {keep_tables, Tabs} ->
               atom_list(Tabs),
               FA#fallback_args{keep_tables = Tabs};
          {skip_tables, Tabs} ->
               atom_list(Tabs),
               FA#fallback_args{skip_tables = Tabs};
          {default_op, keep_tables} ->
               FA#fallback_args{default_op = keep_tables};
          {default_op, skip_tables} ->
               FA#fallback_args{default_op = skip_tables}
    end.
此处的构造过程记录 module 参数, mnesia_backup,
                   为              同时记录 opaque 参数,
                                                为新建 schema

文件的文件名。


do_install_fallback(FA) ->
     Pid = spawn_link(?MODULE, install_fallback_master, [self(), FA]),
     Res =
            receive
                 {'EXIT', Pid, Reason} -> % if appl has trapped exit
                       {error, {'EXIT', Reason}};
                 {Pid, Res2} ->
                       case Res2 of
                             {ok, _} ->
                                  ok;
                             {error, Reason} ->
                                  {error, {"Cannot install fallback", Reason}}
                       end
            end,
     Res.
install_fallback_master(ClientPid, FA) ->
     process_flag(trap_exit, true),
     State = {start, FA},
     Opaque = FA#fallback_args.opaque,
     Mod = FA#fallback_args.module,
     Res = (catch iterate(Mod, fun restore_recs/4, Opaque, State)),
     unlink(ClientPid),
     ClientPid ! {self(), Res},
     exit(shutdown).

从新建的 schema 文件中迭代恢复到本地节点和全局集群,此时 Mod 为 mnesia_backup,

Opaque 为新建 schema 文件的文件名,State 为给出的 fallback_args 参数,均为默认值。



fallback_args 默认定义:

-record(fallback_args, {opaque,
                             scope = global,
                             module = mnesia_monitor:get_env(backup_module),
                             use_default_dir = true,
                             mnesia_dir,
                             fallback_bup,
                             fallback_tmp,
                             skip_tables = [],
keep_tables = [],
                               default_op = keep_tables
                             }).

iterate(Mod, Fun, Opaque, Acc) ->
      R = #restore{bup_module = Mod, bup_data = Opaque},
      case catch read_schema_section(R) of
            {error, Reason} ->
                  {error, Reason};
            {R2, {Header, Schema, Rest}} ->
                  case catch iter(R2, Header, Schema, Fun, Acc, Rest) of
                        {ok, R3, Res} ->
                             catch safe_apply(R3, close_read, [R3#restore.bup_data]),
                             {ok, Res};
                        {error, Reason} ->
                             catch safe_apply(R2, close_read, [R2#restore.bup_data]),
                             {error, Reason};
                        {'EXIT', Pid, Reason} ->
                             catch safe_apply(R2, close_read, [R2#restore.bup_data]),
                             {error, {'EXIT', Pid, Reason}};
                        {'EXIT', Reason} ->
                             catch safe_apply(R2, close_read, [R2#restore.bup_data]),
                             {error, {'EXIT', Reason}}
                  end
     end.
iter(R, Header, Schema, Fun, Acc, []) ->
      case safe_apply(R, read, [R#restore.bup_data]) of
            {R2, []} ->
                  Res = Fun([], Header, Schema, Acc),
                  {ok, R2, Res};
            {R2, BupItems} ->
                  iter(R2, Header, Schema, Fun, Acc, BupItems)
      end;
iter(R, Header, Schema, Fun, Acc, BupItems) ->
      Acc2 = Fun(BupItems, Header, Schema, Acc),
      iter(R, Header, Schema, Fun, Acc2, []).

read_schema_section 将读出新建 schema 文件的内容,得到文件头部,并组装出 schema 结

构,将 schema 应用回调函数,此处回调函数为 mnesia_bup 的 restore_recs/4 函数:

restore_recs(Recs, Header, Schema, {start, FA}) ->
     %% No records in backup
     Schema2 = convert_schema(Header#log_header.log_version, Schema),
     CreateList = lookup_schema(schema, Schema2),
case catch mnesia_schema:list2cs(CreateList) of
         {'EXIT', Reason} ->
               throw({error, {"Bad schema in restore_recs", Reason}});
         Cs ->
               Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies),
               global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity),
               Args = [self(), FA],
               Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns],
               send_fallback(Pids, {start, Header, Schema2}),
               Res = restore_recs(Recs, Header, Schema2, Pids),
               global:del_lock({{mnesia_table_lock, schema}, self()}, Ns),
               Res
    end;

一个典型的 schema 结构如下:

[{schema,schema,
          [{name,schema},
            {type,set},
            {ram_copies,[]},
            {disc_copies,['rds_la_dev@10.232.64.77']},
            {disc_only_copies,[]},
            {load_order,0},
            {access_mode,read_write},
            {index,[]},
            {snmp,[]},
            {local_content,false},
            {record_name,schema},
            {attributes,[table,cstruct]},
            {user_properties,[]},
            {frag_properties,[]},
            {cookie,{{1358,676768,107058},'rds_la_dev@10.232.64.77'}},
            {version,{{2,0},[]}}]}]

构成一个{schema, schema, CreateList}的元组,同时调用 mnesia_schema:list2cs(CreateList),将

CreateList 还原回 schema 的 cstruct 结构。

mnesia_bup.erl
restore_recs(Recs, Header, Schema, {start, FA}) ->
     %% No records in backup
     Schema2 = convert_schema(Header#log_header.log_version, Schema),
     CreateList = lookup_schema(schema, Schema2),
     case catch mnesia_schema:list2cs(CreateList) of
          {'EXIT', Reason} ->
               throw({error, {"Bad schema in restore_recs", Reason}});
Cs ->
                    Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies),
                    global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity),
                    Args = [self(), FA],
                    Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns],
                    send_fallback(Pids, {start, Header, Schema2}),
                    Res = restore_recs(Recs, Header, Schema2, Pids),
                    global:del_lock({{mnesia_table_lock, schema}, self()}, Ns),
                    Res
     end;

get_fallback_nodes 将得到参与 schema 构建节点的 fallback 节点,通常为所有参与 schema

构建的节点。

构建过程要加入集群的全局锁{mnesia_table_lock, schema}。

在各个参与 schema 构建的节点上,均创建一个 fallback_receiver 进程,
                                              处理 schema 的变更。

向这些节点的 fallback_receiver 进程广播{start, Header, Schema2}消息,并等待其返回结果。

所有节点的 fallback_receiver 进程对 start 消息响应后,进入下一个过程:

restore_recs([], _Header, _Schema, Pids) ->
     send_fallback(Pids, swap),
     send_fallback(Pids, stop),
     stop;

restore_recs 向所有节点的 fallback_receiver 进程广播后续的 swap 消息和 stop 消息,完成整

个 schema 变更过程,然后释放全局锁{mnesia_table_lock, schema}。



进入 fallback_receiver 进程的处理过程:

fallback_receiver(Master, FA) ->
      process_flag(trap_exit, true),

     case catch register(mnesia_fallback, self()) of
          {'EXIT', _} ->
               Reason = {already_exists, node()},
               local_fallback_error(Master, Reason);
          true ->
               FA2 = check_fallback_dir(Master, FA),
               Bup = FA2#fallback_args.fallback_bup,
               case mnesia_lib:exists(Bup) of
true ->
                          Reason2 = {already_exists, node()},
                          local_fallback_error(Master, Reason2);
                    false ->
                          Mod = mnesia_backup,
                          Tmp = FA2#fallback_args.fallback_tmp,
                          R = #restore{mode = replace,
                                            bup_module = Mod,
                                            bup_data = Tmp},
                          file:delete(Tmp),
                          case catch fallback_receiver_loop(Master, R, FA2, schema) of
                                {error, Reason} ->
                                     local_fallback_error(Master, Reason);
                                Other ->
                                     exit(Other)
                          end
              end
    end.

在自身的节点上注册进程名字为 mnesia_fallback。

构建初始化状态。

进入 fallback_receiver_loop 循环处理来自 schema 变更发起节点的消息。

fallback_receiver_loop(Master, R, FA, State) ->
      receive
           {Master, {start, Header, Schema}} when State =:= schema ->
               Dir = FA#fallback_args.mnesia_dir,
               throw_bad_res(ok, mnesia_schema:opt_create_dir(true, Dir)),
               R2 = safe_apply(R, open_write, [R#restore.bup_data]),
               R3 = safe_apply(R2, write, [R2#restore.bup_data, [Header]]),
               BupSchema = [schema2bup(S) || S <- Schema],
               R4 = safe_apply(R3, write, [R3#restore.bup_data, BupSchema]),
               Master ! {self(), ok},
               fallback_receiver_loop(Master, R4, FA, records);
           …
      end.

在本地也创建一个 schema 临时文件,
                    接收来自变更发起节点构建的 header 部分和新 schema。

fallback_receiver_loop(Master, R, FA, State) ->
      receive
           …
           {Master, swap} when State =/= schema ->
               ?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []),
safe_apply(R, commit_write, [R#restore.bup_data]),
               Bup = FA#fallback_args.fallback_bup,
               Tmp = FA#fallback_args.fallback_tmp,
               throw_bad_res(ok, file:rename(Tmp, Bup)),
               catch mnesia_lib:set(active_fallback, true),
               ?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []),
               Master ! {self(), ok},
               fallback_receiver_loop(Master, R, FA, stop);
           …
    end.

mnesia_backup.erl
commit_write(OpaqueData) ->
   B = OpaqueData,
   case disk_log:sync(B#backup.file_desc) of
        ok ->
             case disk_log:close(B#backup.file_desc) of
                  ok ->
             case file:rename(B#backup.tmp_file, B#backup.file) of
                 ok ->
                  {ok, B#backup.file};
                 {error, Reason} ->
                  {error, Reason}
             end;
                  {error, Reason} ->
             {error, Reason}
             end;
        {error, Reason} ->
             {error, Reason}
   end.

变更提交过程,新建的 schema 文件在写入到本节点时,为文件名后跟".BUPTMP"表明是一

个临时未提交的文件,此处进行提交时,sync 新建的 schema 文件到磁盘后关闭,并重命名

为真正的新建的 schema 文件名,消除最后的".BUPTMP"


fallback_receiver_loop(Master, R, FA, State) ->
      receive
           …
           {Master, swap} when State =/= schema ->
               ?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []),
               safe_apply(R, commit_write, [R#restore.bup_data]),
               Bup = FA#fallback_args.fallback_bup,
Tmp = FA#fallback_args.fallback_tmp,
               throw_bad_res(ok, file:rename(Tmp, Bup)),
               catch mnesia_lib:set(active_fallback, true),
               ?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []),
               Master ! {self(), ok},
               fallback_receiver_loop(Master, R, FA, stop);
          …
end.

在这个参与节点上,将新建 schema 文件命名为"FALLBACK.BUP",同时激活本地节点的

active_fallback 属性,表明称为一个活动 fallback 节点。


fallback_receiver_loop(Master, R, FA, State) ->
      receive
           …
           {Master, stop} when State =:= stop ->
               stopped;
           …
      end.

收到 stop 消息后,mnesia_fallback 进程退出。




3. 后半部分 mnesia:start/0 做的工作


mnesia    启 动 , 则 可 以 自 动 通 过 事 务 管 理 器                                  mnesia_tm   调 用

mnesia_bup:tm_fallback_start(IgnoreFallback)将 schema 建立到 dets 表中:

mnesia_bup.erl
tm_fallback_start(IgnoreFallback) ->
    mnesia_schema:lock_schema(),
    Res = do_fallback_start(fallback_exists(), IgnoreFallback),
    mnesia_schema: unlock_schema(),
    case Res of
         ok -> ok;
         {error, Reason} -> exit(Reason)
    end.

锁住 schema 表,然后通过"FALLBACK.BUP"文件进行 schema 恢复创建,最后释放 schema 表

锁
do_fallback_start(true, false) ->
    verbose("Starting from fallback...~n", []),

    BupFile = fallback_bup(),
    Mod = mnesia_backup,
    LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]),
    case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of
         …
    end.

根据"FALLBACK.BUP"文件,调用 restore_tables 函数进行恢复

restore_tables(Recs, Header, Schema, {start, LocalTabs}) ->
     Dir = mnesia_lib:dir(),
     OldDir = filename:join([Dir, "OLD_DIR"]),
     mnesia_schema:purge_dir(OldDir, []),
     mnesia_schema:purge_dir(Dir, [fallback_name()]),
     init_dat_files(Schema, LocalTabs),
     State = {new, LocalTabs},
     restore_tables(Recs, Header, Schema, State);

init_dat_files(Schema, LocalTabs) ->
     TmpFile = mnesia_lib:tab2tmp(schema),
     Args = [{file, TmpFile}, {keypos, 2}, {type, set}],
     case dets:open_file(schema, Args) of % Assume schema lock
          {ok, _} ->
                 create_dat_files(Schema, LocalTabs),
                 ok = dets:close(schema),
                 LocalTab = #local_tab{
                                              name              = schema,
                                              storage_type = disc_copies,
                                              open             = undefined,
                                              add              = undefined,
                                              close         = undefined,
                                              swap             = undefined,
                                              record_name = schema,
                                              opened = false},
                 ?ets_insert(LocalTabs, LocalTab);
          {error, Reason} ->
                 throw({error, {"Cannot open file", schema, Args, Reason}})
     end.

创建 schema 的 dets 表,文件名为 schema.TMP,根据"FALLBACK.BUP"文件,将各个表的元数

据恢复到新建的 schema 的 dets 表中。
调用 create_dat_files 构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数,然后

调用之,将其它表的元数据持久化到 schema 表中。

restore_tables(Recs, Header, Schema, {start, LocalTabs}) ->
     Dir = mnesia_lib:dir(),
     OldDir = filename:join([Dir, "OLD_DIR"]),
     mnesia_schema:purge_dir(OldDir, []),
     mnesia_schema:purge_dir(Dir, [fallback_name()]),
     init_dat_files(Schema, LocalTabs),
     State = {new, LocalTabs},
     restore_tables(Recs, Header, Schema, State);

构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数

restore_tables(All=[Rec | Recs], Header, Schema, {new, LocalTabs}) ->
     Tab = element(1, Rec),
     case ?ets_lookup(LocalTabs, Tab) of
          [] ->
                State = {not_local, LocalTabs, Tab},
                restore_tables(Recs, Header, Schema, State);
          [LT] when is_record(LT, local_tab) ->
          State = {local, LocalTabs, LT},
          case LT#local_tab.opened of
          true -> ignore;
          false ->
                (LT#local_tab.open)(Tab, LT),
                ?ets_insert(LocalTabs,LT#local_tab{opened=true})
          end,
                restore_tables(All, Header, Schema, State)
     end;

打开表,不断检查表是否位于本地,若是则进行恢复添加过程:

restore_tables(All=[Rec | Recs], Header, Schema, State={local, LocalTabs, LT}) ->
     Tab = element(1, Rec),
     if
          Tab =:= LT#local_tab.name ->
               Key = element(2, Rec),
               (LT#local_tab.add)(Tab, Key, Rec, LT),
               restore_tables(Recs, Header, Schema, State);
          true ->
               NewState = {new, LocalTabs},
               restore_tables(All, Header, Schema, NewState)
     end;

Add 函数主要为将表记录入 schema 表,此处是写入临时 schema,而未真正提交
待所有表恢复完成后,进行真正的提交工作:

do_fallback_start(true, false) ->
    verbose("Starting from fallback...~n", []),

     BupFile = fallback_bup(),
     Mod = mnesia_backup,
     LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]),
     case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of
          {ok, _Res} ->
               catch dets:close(schema),
               TmpSchema = mnesia_lib:tab2tmp(schema),
               DatSchema = mnesia_lib:tab2dat(schema),
               AllLT = ?ets_match_object(LocalTabs, '_'),
               ?ets_delete_table(LocalTabs),
               case file:rename(TmpSchema, DatSchema) of
                     ok ->
                          [(LT#local_tab.swap)(LT#local_tab.name, LT) ||
                               LT <- AllLT, LT#local_tab.name =/= schema],
                          file:delete(BupFile),
                          ok;
                     {error, Reason} ->
                          file:delete(TmpSchema),
                          {error, {"Cannot start from fallback. Rename error.", Reason}}
               end;
          {error, Reason} ->
               {error, {"Cannot start from fallback", Reason}};
          {'EXIT', Reason} ->
               {error, {"Cannot start from fallback", Reason}}
     end.

将 schema.TMP 变更为 schema.DAT,正式启用持久 schema,提交 schema 表的变更

同 时 调 用在 create_dat_files 函 数 中创 建 的转 换 函数 , 进 行各 个表 的 提交工 作 , 对于

ram_copies 表,没有什么额外动作,对于 disc_only_copies 表,主要为提交其对应 dets 表的

文件名,对于 disc_copies 表,主要为记录 redo 日志,然后提交其对应 dets 表的文件名。

全部完成后,schema 表将成为持久的 dets 表,"FALLBACK.BUP"文件也将被删除。



事务管理器在完成 schema 的 dets 表的构建后,将初始化 mnesia_schema:

mnesia_schema.erl
init(IgnoreFallback) ->
      Res = read_schema(true, IgnoreFallback),
      {ok, Source, _CreateList} = exit_on_error(Res),
      verbose("Schema initiated from: ~p~n", [Source]),
      set({schema, tables}, []),
      set({schema, local_tables}, []),
      Tabs = set_schema(?ets_first(schema)),
      lists:foreach(fun(Tab) -> clear_whereabouts(Tab) end, Tabs),
      set({schema, where_to_read}, node()),
      set({schema, load_node}, node()),
      set({schema, load_reason}, initial),
      mnesia_controller:add_active_replica(schema, node()).

检查 schema 表从何处恢复,在 mnesia_gvar 这个全局状态 ets 表中,初始化 schema 的原始

信息,并将本节点作为 schema 表的初始活动副本

若某个节点作为一个表的活动副本,则表的 where_to_commit 和 where_to_write 属性必须

同时包含该节点。




4. mnesia:change_table_majority/2 的工作过程


mnesia 表可以在建立时,设置一个 majority 的属性,也可以在建立表之后通过 mnesia:

change_table_majority/2 更改此属性。

该属性可以要求 mnesia 在进行事务时,检查所有参与事务的节点是否为表的提交节点的大

多数,这样可以在出现网络分区时,保证 majority 节点的可用性,同时也能保证整个网络的

一致性,minority 节点将不可用,这也是 CAP 理论的一个折中。




1. 调用接口


mnesia.erl
change_table_majority(T, M) ->
    mnesia_schema:change_table_majority(T, M).
mnesia_schema.erl
change_table_majority(Tab, Majority) when is_boolean(Majority) ->
    schema_transaction(fun() -> do_change_table_majority(Tab, Majority) end).
schema_transaction(Fun) ->
    case get(mnesia_activity_state) of
    undefined ->
         Args = [self(), Fun, whereis(mnesia_controller)],
         Pid = spawn_link(?MODULE, schema_coordinator, Args),
         receive
         {transaction_done, Res, Pid} -> Res;
         {'EXIT', Pid, R} -> {aborted, {transaction_crashed, R}}
         end;
    _ ->
              {aborted, nested_transaction}
    end.

启动一个 schema 事务的协调者 schema_coordinator 进程。

schema_coordinator(Client, Fun, Controller) when is_pid(Controller) ->
    link(Controller),
    unlink(Client),
    Res = mnesia:transaction(Fun),
    Client ! {transaction_done, Res, self()},
    unlink(Controller),           % Avoids spurious exit message
    unlink(whereis(mnesia_tm)), % Avoids spurious exit message
    exit(normal).

与普通事务不同,
       schema 事务使用的 schema_coordinator 进程 link 到的不是请求者的进程,

而是 mnesia_controller 进程。

启动一个 mnesia 事务,函数为 fun() -> do_change_table_majority(Tab, Majority) end。




2. 事务操作


do_change_table_majority(schema, _Majority) ->
    mnesia:abort({bad_type, schema});
do_change_table_majority(Tab, Majority) ->
    TidTs = get_tid_ts_and_lock(schema, write),
    get_tid_ts_and_lock(Tab, none),
    insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).
可以看出,不能修改 schema 表的 majority 属性。

对 schema 表主动申请写锁,而不对需要更改 majority 属性的表申请锁

get_tid_ts_and_lock(Tab, Intent) ->
     TidTs = get(mnesia_activity_state),
     case TidTs of
     {_Mod, Tid, Ts} when is_record(Ts, tidstore)->
          Store = Ts#tidstore.store,
          case Intent of
          read -> mnesia_locker:rlock_table(Tid, Store, Tab);
          write -> mnesia_locker:wlock_table(Tid, Store, Tab);
          none -> ignore
          end,
          TidTs;
     _ ->
          mnesia:abort(no_transaction)
     end.

上锁的过程:直接向锁管理器 mnesia_locker 请求表锁。

do_change_table_majority(Tab, Majority) ->
    TidTs = get_tid_ts_and_lock(schema, write),
    get_tid_ts_and_lock(Tab, none),
    insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).

关注实际的 majority 属性的修改动作:

make_change_table_majority(Tab, Majority) ->
    ensure_writable(schema),
    Cs = incr_version(val({Tab, cstruct})),
    ensure_active(Cs),
    OldMajority = Cs#cstruct.majority,
    Cs2 = Cs#cstruct{majority = Majority},
    FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of
            {_, Tab} ->
                  FragNames = mnesia_frag:frag_names(Tab) -- [Tab],
                  lists:map(
               fun(T) ->
                     get_tid_ts_and_lock(Tab, none),
                     CsT = incr_version(val({T, cstruct})),
                     ensure_active(CsT),
                     CsT2 = CsT#cstruct{majority = Majority},
                     verify_cstruct(CsT2),
                     {op, change_table_majority, vsn_cs2list(CsT2),
                       OldMajority, Majority}
end, FragNames);
            false     -> [];
            {_, _}   -> mnesia:abort({bad_type, Tab})
            end,
    verify_cstruct(Cs2),
    [{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps].

通过 ensure_writable 检查 schema 表的 where_to_write 属性是否为[],即是否有持久化的

schema 节点。

通过 incr_version 更新表的版本号。

通过 ensure_active 检查所有表的副本节点是否存活,
                               即与副本节点进行表的全局视图确认。



修改表的元数据版本号:

incr_version(Cs) ->
     {{Major, Minor}, _} = Cs#cstruct.version,
     Nodes = mnesia_lib:intersect(val({schema, disc_copies}),
                                          mnesia_lib:cs_to_nodes(Cs)),
     V=
          case Nodes -- val({Cs#cstruct.name, active_replicas}) of
               [] -> {Major + 1, 0};    % All replicas are active
               _ -> {Major, Minor + 1} % Some replicas are inactive
          end,
     Cs#cstruct{version = {V, {node(), now()}}}.
mnesia_lib.erl
cs_to_nodes(Cs) ->
     Cs#cstruct.disc_only_copies ++
     Cs#cstruct.disc_copies ++
     Cs#cstruct.ram_copies.

重新计算表的元数据版本号,由于这是一个 schema 表的变更,需要参考有持久 schema 的

节点以及持有该表副本的节点的信息而计算表的版本号,若这二类节点的交集全部存活,则

主版本可以增加,否则仅能增加副版本,同时为表的 cstruct 结构生成一个新的版本描述符,

这个版本描述符包括三个部分:{新的版本号,{发起变更的节点,发起变更的时间}},相当

于时空序列+单调递增序列。版本号的计算类似于 NDB。



检查表的全局视图:
ensure_active(Cs) ->
     ensure_active(Cs, active_replicas).
ensure_active(Cs, What) ->
     Tab = Cs#cstruct.name,
     W = {Tab, What},
     ensure_non_empty(W),
     Nodes = mnesia_lib:intersect(val({schema, disc_copies}), mnesia_lib:cs_to_nodes(Cs)),
     case Nodes -- val(W) of
          [] ->
                ok;
          Ns ->
                Expl = "All replicas on diskfull nodes are not active yet",
                case val({Tab, local_content}) of
                      true ->
                case rpc:multicall(Ns, ?MODULE, is_remote_member, [W]) of
                {Replies, []} ->
                      check_active(Replies, Expl, Tab);
                {_Replies, BadNs} ->
                      mnesia:abort({not_active, Expl, Tab, BadNs})
                            end;
                      false ->
                            mnesia:abort({not_active, Expl, Tab, Ns})
                end
     end.
is_remote_member(Key) ->
     IsActive = lists:member(node(), val(Key)),
     {IsActive, node()}.

为了防止不一致的状态,需要向这样的未明节点进行确认:该节点不是表的活动副本节点,

却是表的副本节点,也是持久 schema 节点。确认的内容是:通过 is_remote_member 询问

该节点,其是否已经是该表的活动副本节点。这样就避免了未明节点与请求节点对改变状态

的不一致认知。


make_change_table_majority(Tab, Majority) ->
    ensure_writable(schema),
    Cs = incr_version(val({Tab, cstruct})),
    ensure_active(Cs),
    OldMajority = Cs#cstruct.majority,
    Cs2 = Cs#cstruct{majority = Majority},
    FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of
            {_, Tab} ->
FragNames = mnesia_frag:frag_names(Tab) -- [Tab],
                  lists:map(
               fun(T) ->
                     get_tid_ts_and_lock(Tab, none),
                     CsT = incr_version(val({T, cstruct})),
                     ensure_active(CsT),
                     CsT2 = CsT#cstruct{majority = Majority},
                     verify_cstruct(CsT2),
                     {op, change_table_majority, vsn_cs2list(CsT2),
                       OldMajority, Majority}
               end, FragNames);
            false       -> [];
            {_, _}     -> mnesia:abort({bad_type, Tab})
            end,
    verify_cstruct(Cs2),
    [{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps].

变更表的 cstruct 中对 majority 属性的记录,检查表的新建 cstruct,主要检查 cstruct 的各项

成员的类型,内容是否合乎要求。

vsn_cs2list 将 cstruct 转换为一个 proplist,key 为 record 成员名,value 为 record 成员值。

生成一个 change_table_majority 动作,供给 insert_schema_ops 使用。


do_change_table_majority(Tab, Majority) ->
    TidTs = get_tid_ts_and_lock(schema, write),
    get_tid_ts_and_lock(Tab, none),
    insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).

此时 make_change_table_majority 生成的动作为[{op, change_taboe_majority, 表的新 cstruct

组成的 proplist, OldMajority, Majority}]


insert_schema_ops({_Mod, _Tid, Ts}, SchemaIOps) ->
     do_insert_schema_ops(Ts#tidstore.store, SchemaIOps).
do_insert_schema_ops(Store, [Head | Tail]) ->
     ?ets_insert(Store, Head),
     do_insert_schema_ops(Store, Tail);
do_insert_schema_ops(_Store, []) ->
     ok.

可以看到,
    插入过程仅仅将 make_change_table_majority 操作记入当前事务的临时 ets 表中。

这个临时插入动作完成后,mnesia 将开始执行提交过程,与普通表事务不同,由于操作是
op 开头,表明这是一个 schema 事务,事务管理器需要额外的处理,使用不同的事务提交过

程。



3. schema 事务提交接口


mnesia_tm.erl
t_commit(Type) ->
    {_Mod, Tid, Ts} = get(mnesia_activity_state),
    Store = Ts#tidstore.store,
    if
    Ts#tidstore.level == 1 ->
         intercept_friends(Tid, Ts),
         case arrange(Tid, Store, Type) of
         {N, Prep} when N > 0 ->
            multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store);
         {0, Prep} ->
            multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store)
         end;
    true ->
         %% nested commit
         Level = Ts#tidstore.level,
         [{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores,
         req({del_store, Tid, Store, Obsolete, false}),
         NewTs = Ts#tidstore{store = Store,
                    up_stores = Tail,
                    level = Level - 1},
         NewTidTs = {OldMod, Tid, NewTs},
         put(mnesia_activity_state, NewTidTs),
         do_commit_nested
    end.

首先在操作重排时进行检查:

arrange(Tid, Store, Type) ->
     %% The local node is always included
     Nodes = get_elements(nodes,Store),
     Recs = prep_recs(Nodes, []),
     Key = ?ets_first(Store),
     N = 0,
     Prep =
     case Type of
          async -> #prep{protocol = sym_trans, records = Recs};
sync -> #prep{protocol = sync_sym_trans, records = Recs}
    end,
    case catch do_arrange(Tid, Store, Key, Prep, N) of
    {'EXIT', Reason} ->
         dbg_out("do_arrange failed ~p ~p~n", [Reason, Tid]),
         case Reason of
         {aborted, R} ->
               mnesia:abort(R);
         _ ->
               mnesia:abort(Reason)
         end;
    {New, Prepared} ->
         {New, Prepared#prep{records = reverse(Prepared#prep.records)}}
    end.

Key 参数即为插入临时 ets 表的第一个操作,此处将为 op。

do_arrange(Tid, Store, {Tab, Key}, Prep, N) ->
    Oid = {Tab, Key},
    Items = ?ets_lookup(Store, Oid), %% Store is a bag
    P2 = prepare_items(Tid, Tab, Key, Items, Prep),
    do_arrange(Tid, Store, ?ets_next(Store, Oid), P2, N + 1);
do_arrange(Tid, Store, SchemaKey, Prep, N) when SchemaKey == op ->
    Items = ?ets_lookup(Store, SchemaKey), %% Store is a bag
    P2 = prepare_schema_items(Tid, Items, Prep),
    do_arrange(Tid, Store, ?ets_next(Store, SchemaKey), P2, N + 1);

可以看出,普通表的 key 为{Tab, Key},而 schema 表的 key 为 op,取得的 Itens 为[{op,

change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}],这导致本次事

务使用不同的提交协议:

prepare_schema_items(Tid, Items, Prep) ->
    Types = [{N, schema_ops} || N <- val({current, db_nodes})],
    Recs = prepare_nodes(Tid, Types, Items, Prep#prep.records, schema),
    Prep#prep{protocol = asym_trans, records = Recs}.

prepare_node 在 Recs 的 schema_ops 成员中记录 schema 表的操作,同时将表的提交协议设

置为 asym_trans。

prepare_node(_Node, _Storage, Items, Rec, Kind)
  when Kind == schema, Rec#commit.schema_ops == [] ->
    Rec#commit{schema_ops = Items};

t_commit(Type) ->
{_Mod, Tid, Ts} = get(mnesia_activity_state),
    Store = Ts#tidstore.store,
    if
    Ts#tidstore.level == 1 ->
         intercept_friends(Tid, Ts),
         case arrange(Tid, Store, Type) of
         {N, Prep} when N > 0 ->
            multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store);
         {0, Prep} ->
            multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store)
         end;
    true ->
         %% nested commit
         Level = Ts#tidstore.level,
         [{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores,
         req({del_store, Tid, Store, Obsolete, false}),
         NewTs = Ts#tidstore{store = Store,
                    up_stores = Tail,
                    level = Level - 1},
         NewTidTs = {OldMod, Tid, NewTs},
         put(mnesia_activity_state, NewTidTs),
         do_commit_nested
    end.

提交过程使用 asym_trans,这个协议主要用于:schema 操作, majority 属性的表的操作,
                                     有

recover_coordinator 过程,restore_op 操作。




4. schema 事务协议过程


multi_commit(asym_trans, Majority, Tid, CR, Store) ->
     D = #decision{tid = Tid, outcome = presume_abort},
     {D2, CR2} = commit_decision(D, CR, [], []),
     DiscNs = D2#decision.disc_nodes,
     RamNs = D2#decision.ram_nodes,
     case have_majority(Majority, DiscNs ++ RamNs) of
    ok -> ok;
    {error, Tab} -> mnesia:abort({no_majority, Tab})
     end,
     Pending = mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs),
     ?ets_insert(Store, Pending),
     {WaitFor, Local} = ask_commit(asym_trans, Tid, CR2, DiscNs, RamNs),
SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})),
   {Votes, Pids} = rec_all(WaitFor, Tid, do_commit, []),
   ?eval_debug_fun({?MODULE, multi_commit_asym_got_votes}, [{tid, Tid}, {votes, Votes}]),
   case Votes of
   do_commit ->
        case SchemaPrep of
        {_Modified, C = #commit{}, DumperMode} ->
             mnesia_log:log(C), % C is not a binary
             ?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_rec}, [{tid, Tid}]),
             D3 = C#commit.decision,
             D4 = D3#decision{outcome = unclear},
             mnesia_recover:log_decision(D4),
             ?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_dec}, [{tid, Tid}]),
             tell_participants(Pids, {Tid, pre_commit}),
             rec_acc_pre_commit(Pids, Tid, Store, {C,Local}, do_commit, DumperMode, [], []);
        {'EXIT', Reason} ->
             mnesia_recover:note_decision(Tid, aborted),
             ?eval_debug_fun({?MODULE, multi_commit_asym_prepare_exit}, [{tid, Tid}]),
             tell_participants(Pids, {Tid, {do_abort, Reason}}),
             do_abort(Tid, Local),
             {do_abort, Reason}
        end;
   {do_abort, Reason} ->
        mnesia_recover:note_decision(Tid, aborted),
        ?eval_debug_fun({?MODULE, multi_commit_asym_do_abort}, [{tid, Tid}]),
        tell_participants(Pids, {Tid, {do_abort, Reason}}),
        do_abort(Tid, Local),
        {do_abort, Reason}
   end.


事务处理过程从 mnesia_tm:t_commit/1 开始,流程如下:

1. 发起节点检查 majority 条件是否满足,即表的存活副本节点数必须大于表的磁盘和内存

  副本节点数的一半,等于一半时亦不满足

2. 发起节点操作发起节点调用 mnesia_checkpoint:tm_enter_pending,产生检查点

3. 发起节点向各个参与节点的事务管理器发起第一阶段提交过程 ask_commit,注意此时协

  议类型为 asym_trans

4. 参与节点事务管理器创建一个 commit_participant 进程,该进程将进行负责接下来的提
交过程

   注意 majority 表和 schema 表操作,需要额外创建一个进程辅助提交,可能导致性能变

   低

5. 参 与 节 点 commit_participant 进 程 进 行 本 地 schema 操 作 的 prepare 过 程 , 对 于

   change_table_majority,没有什么需要 prepare 的

6. 参与节点 commit_participant 进程同意提交,向发起节点返回 vote_yes

7. 发起节点收到所有参与节点的同意提交消息

8. 发起节点进行本地 schema 操作的 prepare 过程,对于 change_table_majority,同样没有

   什么需要 prepare 的

9. 发起节点收到所有参与节点的 vote_yes 后,记录需要提交的操作的日志

10. 发起节点记录第一阶段恢复日志 presume_abort;

11. 发起节点记录第二阶段恢复日志 unclear

12. 发起节点向各个参与节点的 commit_participant 进程发起第二阶段提交过程 pre_commit

13. 参与节点 commit_participant 进程收到 pre_commit 进行预提交

14. 参与节点记录第一阶段恢复日志 presume_abort

15. 参与节点记录第二阶段恢复日志 unclear

16. 参与节点 commit_participant 进程同意预提交,向发起节点返回 acc_pre_commit

17. 发起节点收到所有参与节点的 acc_pre_commit 后,记录需要等待的 schema 操作参与节

   点,用于崩溃恢复过程

18. 发起节点向各个参与节点的 commit_participant 进程发起第三阶段提交过程 committed

19. a.发起节点通知完参与节点进行 committed 后,立即记录第二阶段恢复日志 committed

   b.参与节点 commit_participant 进程收到 committed 后进行提交,立即记录第二阶段恢复
日志 committed

20. a.发起节点记录完第二阶段恢复日志后,进行本地提交,通过 do_commit 完成

    b.参与节点 commit_participant 进程记录完第二阶段恢复日志后,进行本地提交,通过

    do_commit 完成

21. a.发起节点本地提交完成后,若有 schema 操作,则同步等待参与节点 commit_participant

    进程的 schema 操作的提交结果

    b.参与节点 commit_participant 进程本地提交完成后,若有 schema 操作,则向发起节点

    返回 schema_commit

22. a.发起节点本地收到所有参与节点的 schema_commit 后,释放锁和事务资源

    b.参与节点 commit_participant 进程释放锁和事务资源




5. 远程节点事务管理器第一阶段提交 prepare 响应


参与节点事务管理器收到第一阶段提交的消息后:

mnesia.erl
doit_loop(#state{coordinators=Coordinators,participants=Participants,supervisor=Sup}=State) ->
…
     {From, {ask_commit, Protocol, Tid, Commit, DiscNs, RamNs}} ->
         ?eval_debug_fun({?MODULE, doit_ask_commit},
                    [{tid, Tid}, {prot, Protocol}]),
         mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs),
         Pid =
         case Protocol of
               asym_trans when node(Tid#tid.pid) /= node() ->
               Args = [tmpid(From), Tid, Commit, DiscNs, RamNs],
               spawn_link(?MODULE, commit_participant, Args);
               _ when node(Tid#tid.pid) /= node() -> %% *_sym_trans
               reply(From, {vote_yes, Tid}),
               nopid
         end,
         P = #participant{tid = Tid,
pid = Pid,
                    commit = Commit,
                    disc_nodes = DiscNs,
                    ram_nodes = RamNs,
                    protocol = Protocol},
         State2 = State#state{participants = gb_trees:insert(Tid,P,Participants)},
         doit_loop(State2);
…

创建一个 commit_participant 进程,参数包括[发起节点的进程号,事务 id,提交内容,磁盘

节点列表,内存节点列表],辅助事务提交:

commit_participant(Coord, Tid, Bin, DiscNs, RamNs) when is_binary(Bin) ->
   process_flag(trap_exit, true),
   Commit = binary_to_term(Bin),
   commit_participant(Coord, Tid, Bin, Commit, DiscNs, RamNs);
commit_participant(Coord, Tid, C = #commit{}, DiscNs, RamNs) ->
   process_flag(trap_exit, true),
   commit_participant(Coord, Tid, C, C, DiscNs, RamNs).

commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->
   ?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]),
   case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of

    {Modified, C = #commit{}, DumperMode} ->、

         case lists:member(node(), DiscNs) of
         false ->
               ignore;
         true ->
               case Modified of
               false -> mnesia_log:log(Bin);
               true -> mnesia_log:log(C)
               end
         end,
         ?eval_debug_fun({?MODULE, commit_participant, vote_yes}, [{tid, Tid}]),
         reply(Coord, {vote_yes, Tid, self()}),
    …

参与节点的 commit_participant 进程在创建初期,需要在本地进行 schema 表的 prepare 工作:

mnesia_schema.erl
prepare_commit(Tid, Commit, WaitFor) ->
    case Commit#commit.schema_ops of
    [] ->
          {false, Commit, optional};
OrigOps ->
       {Modified, Ops, DumperMode} =
       prepare_ops(Tid, OrigOps, WaitFor, false, [], optional),
       InitBy = schema_prepare,
       GoodRes = {Modified,
                 Commit#commit{schema_ops = lists:reverse(Ops)}, DumperMode},
       case DumperMode of
       optional ->
              dbg_out("Transaction log dump skipped (~p): ~w~n", [DumperMode, InitBy]);
       mandatory ->
              case mnesia_controller:sync_dump_log(InitBy) of
             dumped -> GoodRes;
             {error, Reason} -> mnesia:abort(Reason)
              end
       end,
       case Ops of
       [] -> ignore;
       _ -> mnesia_controller:wait_for_schema_commit_lock()
       end,
       GoodRes
  end.

注意此处,包含三个主要分支:

1. 若操作中不包含任何 schema 操作,则不进行任何动作,仅返回{false, 原 Commit 内容,

  optional},这适用于 majority 类表的操作

2. 若操作通过 prepare_ops 判定后,如果包含这些操作:rec,announce_im_running,

  sync_trans , create_table , delete_table , add_table_copy , del_table_copy ,

  change_table_copy_type,dump_table,add_snmp,transform,merge_schema,有可能

  但不一定需要进行 prepare,prepare 动作包括各类操作自身的一些内容记录,以及 sync

  日志,这适用于出现上述操作的时候

3. 若操作通过 prepare_ops 判定后,仅包含其它类型的操作,则不作任何动作,仅返回{true,

  原 Commit 内容, optional},这适用于较小的 schema 操作,此处的 change_table_majority

  就属于这类操作
6. 远程节点事务参与者第二阶段提交 precommit 响应


commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->
…
        receive
        {Tid, pre_commit} ->
              D = C#commit.decision,
              mnesia_recover:log_decision(D#decision{outcome = unclear}),
              ?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]),
              Expect_schema_ack = C#commit.schema_ops /= [],
              reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}),
              receive
              {Tid, committed} ->
                    mnesia_recover:log_decision(D#decision{outcome = committed}),
                    ?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]),
                    do_commit(Tid, C, DumperMode),
                    case Expect_schema_ack of
                    false -> ignore;
                    true -> reply(Coord, {schema_commit, Tid, self()})
                    end,
                    ?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]);
              …
              end;
…

参与节点的 commit_participant 进程收到预提交消息后,同样记录第二阶段恢复日志 unclear,

并返回 acc_pre_commit




7. 请求节点事务发起者收到第二阶段提交 precommit 确认


发起节点收到所有参与节点的 acc_pre_commit 消息后:

rec_acc_pre_commit([], Tid, Store,       {Commit,OrigC},    Res,   DumperMode,      GoodPids,
SchemaAckPids) ->
     D = Commit#commit.decision,
     case Res of
     do_commit ->
prepare_sync_schema_commit(Store, SchemaAckPids),
         tell_participants(GoodPids, {Tid, committed}),
         D2 = D#decision{outcome = committed},
         mnesia_recover:log_decision(D2),
              ?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_commit}, [{tid, Tid}]),
         do_commit(Tid, Commit, DumperMode),
              ?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_commit}, [{tid, Tid}]),
         sync_schema_commit(Tid, Store, SchemaAckPids),
         mnesia_locker:release_tid(Tid),
         ?MODULE ! {delete_transaction, Tid};
    {do_abort, Reason} ->
         tell_participants(GoodPids, {Tid, {do_abort, Reason}}),
         D2 = D#decision{outcome = aborted},
         mnesia_recover:log_decision(D2),
              ?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_abort}, [{tid, Tid}]),
         do_abort(Tid, OrigC),
         ?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_abort}, [{tid, Tid}])
    end,
    Res.

prepare_sync_schema_commit(_Store, []) ->
    ok;
prepare_sync_schema_commit(Store, [Pid | Pids]) ->
    ?ets_insert(Store, {waiting_for_commit_ack, node(Pid)}),
    prepare_sync_schema_commit(Store, Pids).

发起节点在本地记录参与 schema 操作的节点,用于崩溃恢复过程,然后向所有参与节点

commit_participant 进程发送 committed,通知其进行最终提交,此时发起节点可以进行本地

提交,记录第二阶段恢复日志 committed,本地提交通过 do_commit 完成,然后同步等待参

与节点的 schema 操作提交结果,若没有 schema 操作,则可以立即返回,此处需要等待:

sync_schema_commit(_Tid, _Store, []) ->
    ok;
sync_schema_commit(Tid, Store, [Pid | Tail]) ->
    receive
    {?MODULE, _, {schema_commit, Tid, Pid}} ->
         ?ets_match_delete(Store, {waiting_for_commit_ack, node(Pid)}),
         sync_schema_commit(Tid, Store, Tail);
    {mnesia_down, Node} when Node == node(Pid) ->
         ?ets_match_delete(Store, {waiting_for_commit_ack, Node}),
         sync_schema_commit(Tid, Store, Tail)
    end.
8. 远程节点事务参与者第三阶段提交 commit 响应


参与节点 commit_participant 进程收到提交消息后:

commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->
…
        receive
        {Tid, pre_commit} ->
              D = C#commit.decision,
              mnesia_recover:log_decision(D#decision{outcome = unclear}),
              ?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]),
              Expect_schema_ack = C#commit.schema_ops /= [],
              reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}),
              receive
              {Tid, committed} ->
                    mnesia_recover:log_decision(D#decision{outcome = committed}),
                    ?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]),
                    do_commit(Tid, C, DumperMode),
                    case Expect_schema_ack of
                    false -> ignore;
                    true -> reply(Coord, {schema_commit, Tid, self()})
                    end,
                    ?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]);
              …
              end;
…

参与节点的 commit_participant 进程收到预提交消息后,同样记录第而阶段恢复日志

committed,通过 do_commit 进行本地提交后,若有 schema 操作,则向发起节点返回

schema_commit,否则完成事务。




9. 第三阶段提交 commit 的本地提交过程


do_commit(Tid, C, DumperMode) ->
    mnesia_dumper:update(Tid, C#commit.schema_ops, DumperMode),
    R = do_snmp(Tid, C#commit.snmp),
R2 = do_update(Tid, ram_copies, C#commit.ram_copies, R),
    R3 = do_update(Tid, disc_copies, C#commit.disc_copies, R2),
    R4 = do_update(Tid, disc_only_copies, C#commit.disc_only_copies, R3),
    mnesia_subscr:report_activity(Tid),
    R4.

这里仅关注对于 schema 表的更新,同时需要注意,这些更新操作会同时发生在发起节点与

参与节点中。

对于 schema 表的更新包括:

1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为设置的值,同时更

    新表的 where_to_wlock 属性

2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各

    个属性

3. 在 schema 的 ets 表中,记录表的 cstruct

4. 在 schema 的 dets 表中,记录表的 cstruct



更新过程如下:

mnesia_dumper.erl
update(_Tid, [], _DumperMode) ->
    dumped;
update(Tid, SchemaOps, DumperMode) ->
    UseDir = mnesia_monitor:use_dir(),
    Res = perform_update(Tid, SchemaOps, DumperMode, UseDir),
    mnesia_controller:release_schema_commit_lock(),
    Res.

perform_update(_Tid, _SchemaOps, mandatory, true) ->
     InitBy = schema_update,
     ?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]),
     opt_dump_log(InitBy);
perform_update(Tid, SchemaOps, _DumperMode, _UseDir) ->
     InitBy = fast_schema_update,
     InPlace = mnesia_monitor:get_env(dump_log_update_in_place),
     ?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]),
     case catch insert_ops(Tid, schema_ops, SchemaOps, InPlace, InitBy,
mnesia_log:version()) of
     {'EXIT', Reason} ->
          Error = {error, {"Schema update error", Reason}},
          close_files(InPlace, Error, InitBy),
          fatal("Schema update error ~p ~p", [Reason, SchemaOps]);
     _ ->
          ?eval_debug_fun({?MODULE, post_dump}, [InitBy]),
          close_files(InPlace, ok, InitBy),
          ok
     end.

insert_ops(_Tid, _Storage, [], _InPlace, _InitBy, _) ->    ok;
insert_ops(Tid, Storage, [Op], InPlace, InitBy, Ver) when Ver >= "4.3"->
     insert_op(Tid, Storage, Op, InPlace, InitBy),
     ok;
insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver >= "4.3"->
     insert_op(Tid, Storage, Op, InPlace, InitBy),
     insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver);
insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver < "4.3" ->
     insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver),
     insert_op(Tid, Storage, Op, InPlace, InitBy).

…
insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) ->
     Cs = mnesia_schema:list2cs(TabDef),
     case InitBy of
     startup -> ignore;
     _ -> mnesia_controller:change_table_majority(Cs)
     end,
     insert_cstruct(Tid, Cs, true, InPlace, InitBy);
…

对于 change_table_majority 操作,其本身的格式为:

{op, change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}

此处将 proplist 形态的 cstruct 转换为 record 形态的 cstruct,然进行真正的设置


mnesia_controller.erl
change_table_majority(Cs) ->
    W = fun() ->
         Tab = Cs#cstruct.name,
         set({Tab, majority}, Cs#cstruct.majority),
         update_where_to_wlock(Tab)
end,
    update(W).
update_where_to_wlock(Tab) ->
    WNodes = val({Tab, where_to_write}),
    Majority = case catch val({Tab, majority}) of
             true -> true;
             _      -> false
             end,
    set({Tab, where_to_wlock}, {WNodes, Majority}).

该处做的更新主要为:在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为

设置的值,同时更新表的 where_to_wlock 属性,重设 majority 部分


mnesia_dumper.erl
…
insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) ->
     Cs = mnesia_schema:list2cs(TabDef),
     case InitBy of
     startup -> ignore;
     _ -> mnesia_controller:change_table_majority(Cs)
     end,
     insert_cstruct(Tid, Cs, true, InPlace, InitBy);
…
insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) ->
     Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts),
     {schema, Tab, _} = Val,
     S = val({schema, storage_type}),
     disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy),
     Tab.

除了在 mnesia 全局变量 ets 表 mnesia_gvar 中更新表的 where_to_wlock 属性外,还要更新

其 cstruct 属性,及由此属性导出的其它属性,另外,还需要更新 schema 的 ets 表中记录的

表的 cstruct

mnesia_schema.erl
insert_cstruct(Tid, Cs, KeepWhereabouts) ->
     Tab = Cs#cstruct.name,
     TabDef = cs2list(Cs),
     Val = {schema, Tab, TabDef},
     mnesia_checkpoint:tm_retain(Tid, schema, Tab, write),
     mnesia_subscr:report_table_event(schema, Tid, Val, write),
     Active = val({Tab, active_replicas}),
case KeepWhereabouts of
          true -> ignore;
          false when Active == [] -> clear_whereabouts(Tab);
          false -> ignore
    end,
    set({Tab, cstruct}, Cs),
    ?ets_insert(schema, Val),
    do_set_schema(Tab, Cs),
    Val.
do_set_schema(Tab) ->
    List = get_create_list(Tab),
    Cs = list2cs(List),
    do_set_schema(Tab, Cs).
do_set_schema(Tab, Cs) ->
    Type = Cs#cstruct.type,
    set({Tab, setorbag}, Type),
    set({Tab, local_content}, Cs#cstruct.local_content),
    set({Tab, ram_copies}, Cs#cstruct.ram_copies),
    set({Tab, disc_copies}, Cs#cstruct.disc_copies),
    set({Tab, disc_only_copies}, Cs#cstruct.disc_only_copies),
    set({Tab, load_order}, Cs#cstruct.load_order),
    set({Tab, access_mode}, Cs#cstruct.access_mode),
    set({Tab, majority}, Cs#cstruct.majority),
    set({Tab, all_nodes}, mnesia_lib:cs_to_nodes(Cs)),
    set({Tab, snmp}, Cs#cstruct.snmp),
    set({Tab, user_properties}, Cs#cstruct.user_properties),
    [set({Tab, user_property, element(1, P)}, P) || P <- Cs#cstruct.user_properties],
    set({Tab, frag_properties}, Cs#cstruct.frag_properties),
    mnesia_frag:set_frag_hash(Tab, Cs#cstruct.frag_properties),
    set({Tab, storage_properties}, Cs#cstruct.storage_properties),
    set({Tab, attributes}, Cs#cstruct.attributes),
    Arity = length(Cs#cstruct.attributes) + 1,
    set({Tab, arity}, Arity),
    RecName = Cs#cstruct.record_name,
    set({Tab, record_name}, RecName),
    set({Tab, record_validation}, {RecName, Arity, Type}),
    set({Tab, wild_pattern}, wild(RecName, Arity)),
    set({Tab, index}, Cs#cstruct.index),
    %% create actual index tabs later
    set({Tab, cookie}, Cs#cstruct.cookie),
    set({Tab, version}, Cs#cstruct.version),
    set({Tab, cstruct}, Cs),
    Storage = mnesia_lib:schema_cs_to_storage_type(node(), Cs),
    set({Tab, storage_type}, Storage),
mnesia_lib:add({schema, tables}, Tab),
    Ns = mnesia_lib:cs_to_nodes(Cs),
    case lists:member(node(), Ns) of
         true ->
               mnesia_lib:add({schema, local_tables}, Tab);
         false when Tab == schema ->
               mnesia_lib:add({schema, local_tables}, Tab);
         false ->
               ignore
    end.

do_set_schema 更新由 cstruct 导出的各项属性,如版本,cookie 等


mnesia_dumper.erl
insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) ->
     Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts),
     {schema, Tab, _} = Val,
     S = val({schema, storage_type}),
     disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy),
     Tab.
disc_insert(_Tid, Storage, Tab, Key, Val, Op, InPlace, InitBy) ->
     case open_files(Tab, Storage, InPlace, InitBy) of
     true ->
           case Storage of
           disc_copies when Tab /= schema ->
                mnesia_log:append({?MODULE,Tab}, {{Tab, Key}, Val, Op}),
                ok;
           _ ->
                dets_insert(Op,Tab,Key,Val)
           end;
     false ->
           ignore
     end.
dets_insert(Op,Tab,Key,Val) ->
     case Op of
     write ->
           dets_updated(Tab,Key),
           ok = dets:insert(Tab, Val);
     …
     end.
dets_updated(Tab,Key) ->
     case get(mnesia_dumper_dets) of
     undefined ->
           Empty = gb_trees:empty(),
Tree = gb_trees:insert(Tab, gb_sets:singleton(Key), Empty),
        put(mnesia_dumper_dets, Tree);
   Tree ->
        case gb_trees:lookup(Tab,Tree) of
        {value, cleared} -> ignore;
        {value, Set} ->
             T = gb_trees:update(Tab, gb_sets:add(Key, Set), Tree),
             put(mnesia_dumper_dets, T);
        none ->
             T = gb_trees:insert(Tab, gb_sets:singleton(Key), Tree),
             put(mnesia_dumper_dets, T)
        end
   end.

更新 schema 的 dets 表中记录的表 cstruct。



综上所述,对于 schema 表的变更,或者 majority 类的表,其事务提交过程为三阶段,同时

有良好的崩溃恢复检测

schema 表的变更包括对多处地方的更新,包括:

1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 xxx 属性为设置的值

2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各

   个属性

3. 在 schema 的 ets 表中,记录表的 cstruct

4. 在 schema 的 dets 表中,记录表的 cstruct




5. majority 事务处理

majority 事务总体与 schema 事务处理过程相同,只是在 mnesia_tm:multi_commit 的提交过

程中,不调用 mnesia_schema:prepare_commit/3、mnesia_tm:prepare_sync_schema_commit/2

修改 schema 表,也不调用 mnesia_tm:sync_schema_commit 等待第三阶段同步提交完成。
6. 恢复


mnesia 的连接协商过程用于在启动时,结点间交互状态信息:

整个协商包括如下过程:

1. 节点发现,集群遍历

2. 节点协议版本检查

3. 节点 schema 合并

4. 节点 decision 通告与合并

5. 节点数据重新载入与合并




1. 节点协议版本检查+节点 decision 通告与合并


mnesia_recover.erl
connect_nodes(Ns) ->

    %%Ns 为要检查的节点

    call({connect_nodes, Ns}).
handle_call({connect_nodes, Ns}, From, State) ->
    %% Determine which nodes we should try to connect
    AlreadyConnected = val(recover_nodes),
    {_, Nodes} = mnesia_lib:search_delete(node(), Ns),
    Check = Nodes -- AlreadyConnected,

    %%开始版本协商

    case mnesia_monitor:negotiate_protocol(Check) of
    busy ->
          %% monitor is disconnecting some nodes retry
          %% the req (to avoid deadlock).
          erlang:send_after(2, self(), {connect_nodes,Ns,From}),
          {noreply, State};
    [] ->
          %% No good noodes to connect to!
          %% We can't use reply here because this function can be
%% called from handle_info
         gen_server:reply(From, {[], AlreadyConnected}),
         {noreply, State};
     GoodNodes ->

            %% GoodNodes 是协商通过的节点

            %% Now we have agreed upon a protocol with some new nodes
            %% and we may use them when we recover transactions
            mnesia_lib:add_list(recover_nodes, GoodNodes),

            %%协议版本协商通过后,告知这些节点本节点曾经的历史事务 decision

            cast({announce_all, GoodNodes}),
            case get_master_nodes(schema) of
            [] ->
                  Context = starting_partitioned_network,

                %%检查曾经是否与这些节点出现过分区

                 mnesia_monitor:detect_inconcistency(GoodNodes, Context);
            _ -> %% If master_nodes is set ignore old inconsistencies
                 ignore
            end,
            gen_server:reply(From, {GoodNodes, AlreadyConnected}),
            {noreply,State}
     end;

handle_cast({announce_all, Nodes}, State) ->
    announce_all(Nodes),
    {noreply, State};
announce_all([]) ->
    ok;
announce_all(ToNodes) ->
    Tid = trans_tid_serial(),
    announce(ToNodes, [{trans_tid,serial,Tid}], [], false).
announce(ToNodes, [Head | Tail], Acc, ForceSend) ->
    Acc2 = arrange(ToNodes, Head, Acc, ForceSend),
    announce(ToNodes, Tail, Acc2, ForceSend);
announce(_ToNodes, [], Acc, _ForceSend) ->
    send_decisions(Acc).
send_decisions([{Node, Decisions} | Tail]) ->

     %%注意此处,decision 合并过程是一个异步过程

    abcast([Node], {decisions, node(), Decisions}),
    send_decisions(Tail);
send_decisions([]) ->
ok.

遍历所有协商通过的节点,告知其本节点的历史事务 decision



下列流程位于远程节点中,远程节点将被称为接收节点,而本节点将称为发送节点

handle_cast({decisions, Node, Decisions}, State) ->
    mnesia_lib:add(recover_nodes, Node),
    State2 = add_remote_decisions(Node, Decisions, State),
    {noreply, State2};

接收节点的 mnesia_monitor 在收到这些广播来的 decision 后,进行比较合并。



decision 有多种类型,用于事务提交的为 decision 结构和 transient_decision 结构

add_remote_decisions(Node, [D | Tail], State) when is_record(D, decision) ->
    State2 = add_remote_decision(Node, D, State),
    add_remote_decisions(Node, Tail, State2);
add_remote_decisions(Node, [C | Tail], State)
         when is_record(C, transient_decision) ->
    D = #decision{tid = C#transient_decision.tid,
           outcome = C#transient_decision.outcome,
           disc_nodes = [],
           ram_nodes = []},
    State2 = add_remote_decision(Node, D, State),
    add_remote_decisions(Node, Tail, State2);
add_remote_decisions(Node, [{mnesia_down, _, _, _} | Tail], State) ->
    add_remote_decisions(Node, Tail, State);
add_remote_decisions(Node, [{trans_tid, serial, Serial} | Tail], State) ->

    %%对于发送节点传来的未决事务,接收节点需要继续询问其它节点

    sync_trans_tid_serial(Serial),
    case State#state.unclear_decision of
    undefined -> ignored;
    D ->
         case lists:member(Node, D#decision.ram_nodes) of
         true -> ignore;
         false ->

                %%若未决事务 decision 的发送节点不是内存副本节点,则接收节点将向其询

问该未决事务的真正结果

                abcast([Node], {what_decision, node(), D})
          end
end,
    add_remote_decisions(Node, Tail, State);
add_remote_decisions(_Node, [], State) ->
    State.

add_remote_decision(Node, NewD, State) ->
    Tid = NewD#decision.tid,
    OldD = decision(Tid),

    %%根据合并策略进行 decision 合并,对于唯一的冲突情况,即接收节点提交事务,而

发送节点中止事务,则接收节点处也选择中止事务,而事务本身的状态将由检查点和 redo

日志进行重构

    D = merge_decisions(Node, OldD, NewD),

    %%记录合并结果

    do_log_decision(D, false, undefined),
    Outcome = D#decision.outcome,
    if
    OldD == no_decision -> ignore;
    Outcome == unclear -> ignore;
    true ->
         case lists:member(node(), NewD#decision.disc_nodes) or
           lists:member(node(), NewD#decision.ram_nodes) of
         true ->

              %%向其它节点告知本节点的 decision 合并结果

               tell_im_certain([Node], D);
         false -> ignore
         end
    end,
    case State#state.unclear_decision of
    U when U#decision.tid == Tid ->
         WaitFor = State#state.unclear_waitfor -- [Node],
         if
         Outcome == unclear, WaitFor == [] ->
              %% Everybody are uncertain, lets abort

              %%询问过未决事务的所有参与节点后,仍然没有任何节点可以提供事务提交

结果,此时决定终止事务

              NewOutcome = aborted,
              CertainD = D#decision{outcome = NewOutcome,
disc_nodes = [],
                             ram_nodes = []},
               tell_im_certain(D#decision.disc_nodes, CertainD),
               tell_im_certain(D#decision.ram_nodes, CertainD),
               do_log_decision(CertainD, false, undefined),
               verbose("Decided to abort transaction ~p "
                     "since everybody are uncertain ~p~n",
                     [Tid, CertainD]),
               gen_server:reply(State#state.unclear_pid, {ok, NewOutcome}),
               State#state{unclear_pid = undefined,
                     unclear_decision = undefined,
                     unclear_waitfor = undefined};
           Outcome /= unclear ->

                  %%发送节点知道事务结果,通告事务结果

               verbose("~p told us that transaction ~p was ~p~n",
                    [Node, Tid, Outcome]),
               gen_server:reply(State#state.unclear_pid, {ok, Outcome}),
               State#state{unclear_pid = undefined,
                    unclear_decision = undefined,
                    unclear_waitfor = undefined};
           Outcome == unclear ->

                  %%发送节点也不知道事务结果,此时继续等待

                  State#state{unclear_waitfor = WaitFor}
           end;
    _ ->
           State
    end.


合并策略:

merge_decisions(Node, D, NewD0) ->
    NewD = filter_aborted(NewD0),
    if
    D == no_decision, node() /= Node ->
         %% We did not know anything about this txn
         NewD#decision{disc_nodes = []};
    D == no_decision ->
         NewD;
    is_record(D, decision) ->
         DiscNs = D#decision.disc_nodes -- ([node(), Node]),
         OldD = filter_aborted(D#decision{disc_nodes = DiscNs}),
         if
OldD#decision.outcome == unclear,
           NewD#decision.outcome == unclear ->
               D;
           OldD#decision.outcome == NewD#decision.outcome ->
               %% We have come to the same decision
               OldD;
           OldD#decision.outcome == committed,
           NewD#decision.outcome == aborted ->

               %%decision 发送节点与接收节点唯一冲突的位置,即接收节点提交事务,而发

送节点中止事务,此时仍然选择中止事务

               Msg = {inconsistent_database, bad_decision, Node},
               mnesia_lib:report_system_event(Msg),
               OldD#decision{outcome = aborted};
           OldD#decision.outcome == aborted -> OldD#decision{outcome = aborted};
           NewD#decision.outcome == aborted -> OldD#decision{outcome = aborted};
           OldD#decision.outcome == committed,
           NewD#decision.outcome == unclear -> OldD#decision{outcome = committed};
           OldD#decision.outcome == unclear,
           NewD#decision.outcome == committed -> OldD#decision{outcome = committed}
           end
    end.




2. 节点发现,集群遍历


mnesia_controller.erl
merge_schema() ->
    AllNodes = mnesia_lib:all_nodes(),

    %%尝试合并 schema,合并完了后通知所有曾经的集群节点,与本节点进行数据转移

    case try_merge_schema(AllNodes, [node()], fun default_merge/1) of
    ok ->

           %%合并 schema 成功后,将进行数据合并

          schema_is_merged();
    {aborted, {throw, Str}} when is_list(Str) ->
          fatal("Failed to merge schema: ~s~n", [Str]);
    Else ->
          fatal("Failed to merge schema: ~p~n", [Else])
    end.
try_merge_schema(Nodes, Told0, UserFun) ->

    %%开始集群遍历,启动一个 schema 合并事务

    case mnesia_schema:merge_schema(UserFun) of
    {atomic, not_merged} ->
         %% No more nodes that we need to merge the schema with
         %% Ensure we have told everybody that we are running
         case val({current,db_nodes}) -- mnesia_lib:uniq(Told0) of
         [] -> ok;
         Tell ->
               im_running(Tell, [node()]),
               ok
         end;
    {atomic, {merged, OldFriends, NewFriends}} ->
         %% Check if new nodes has been added to the schema
         Diff = mnesia_lib:all_nodes() -- [node() | Nodes],
         mnesia_recover:connect_nodes(Diff),
         %% Tell everybody to adopt orphan tables

         %%通知所有的集群节点,本节点启动,开始数据合并申请

         im_running(OldFriends, NewFriends),
         im_running(NewFriends, OldFriends),
         Told = case lists:member(node(), NewFriends) of
                  true -> Told0 ++ OldFriends;
                  false -> Told0 ++ NewFriends
             end,
         try_merge_schema(Nodes, Told, UserFun);
    {atomic, {"Cannot get cstructs", Node, Reason}} ->
         dbg_out("Cannot get cstructs, Node ~p ~p~n", [Node, Reason]),
         timer:sleep(300), % Avoid a endless loop look alike
         try_merge_schema(Nodes, Told0, UserFun);
    {aborted, {shutdown, _}} -> %% One of the nodes is going down
         timer:sleep(300), % Avoid a endless loop look alike
         try_merge_schema(Nodes, Told0, UserFun);
    Other ->
         Other
    end.

mnesia_schema.erl
merge_schema() ->
    schema_transaction(fun() -> do_merge_schema([]) end).
merge_schema(UserFun) ->
    schema_transaction(fun() -> UserFun(fun(Arg) -> do_merge_schema(Arg) end) end).

可以看出 merge_schema 的过程也是放在一个 mnesia 元数据事务中进行的,这个事务的主
题操作包括:

{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}
{op, merge_schema, CstructList}

这个过程会与集群中的事务节点进行 schema 协商,检查 schema 是否兼容。


do_merge_schema(LockTabs0) ->

    %% 锁 schema 表

    {_Mod, Tid, Ts} = get_tid_ts_and_lock(schema, write),
    LockTabs = [{T, tab_to_nodes(T)} || T <- LockTabs0],
    [get_tid_ts_and_lock(T,write) || {T,_} <- LockTabs],
    Connected = val(recover_nodes),
    Running = val({current, db_nodes}),
    Store = Ts#tidstore.store,
    %% Verify that all nodes are locked that might not be the
    %% case, if this trans where queued when new nodes where added.
    case Running -- ets:lookup_element(Store, nodes, 2) of
    [] -> ok; %% All known nodes are locked
    Miss -> %% Abort! We don't want the sideeffects below to be executed
          mnesia:abort({bad_commit, {missing_lock, Miss}})
    end,

    %% Connected 是本节点的已连接节点,通常为当前集群中通信协议兼容的结点;
                                             Running

是本节点的当前 db_nodes,通常为当前集群中与本节点一致的结点;

    case Connected -- Running of

    %% 对于那些已连接,但是还未进行 decision 的节点,需要进行通信协议协商,然后进

行 decision 协商,这个过程实质上是一个全局拓扑下的节点发现过程(遍历算法)
                                         ,这个过

程由某个节点发起,

    [Node | _] = OtherNodes ->
        %% Time for a schema merging party!
        mnesia_locker:wlock_no_exist(Tid, Store, schema, [Node]),
             [mnesia_locker:wlock_no_exist(
                  Tid, Store, T, mnesia_lib:intersect(Ns, OtherNodes))
               || {T,Ns} <- LockTabs],

         %% 从远程结点 Node 处取得其拥有的表的 cstruct,及其 db_nodes RemoteRunning1

         case fetch_cstructs(Node) of
         {cstructs, Cstructs, RemoteRunning1} ->
LockedAlready = Running ++ [Node],

             %% 取得 cstruct 后,通过 mnesia_recover:connect_nodes,与远程节点 Node

的集群中的每一个节点进行协商,协商主要包括检查双方的通信协议版本,并检查之前与这

些结点是否曾有过分区

             {New, Old} = mnesia_recover:connect_nodes(RemoteRunning1),

             %% New 为 RemoteRunning1 中版本兼容的新结点, 为本节点原先的集群存
                                               Old

活结点,来自于 recover_nodes

             RemoteRunning = mnesia_lib:intersect(New ++ Old, RemoteRunning1),
             If

             %% RemoteRunning = (New∪Old)∩RemoteRunning1

             %% RemoteRunning≠RemoteRunning <=>

             %% New∪(Old∩RemoteRunning1) < RemoteRunning1

             %%意味着 RemoteRunning1(远程节点 Node 的集群,也即此次探查的目标集

群)中有部分节点不能与本节点相连

               RemoteRunning /= RemoteRunning1 ->
                    mnesia_lib:error("Mnesia on ~p could not connect to node(s) ~p~n",
                               [node(), RemoteRunning1 -- RemoteRunning]),
                    mnesia:abort({node_not_running, RemoteRunning1 -- RemoteRunning});
               true -> ok
               end,
               NeedsLock = RemoteRunning -- LockedAlready,
               mnesia_locker:wlock_no_exist(Tid, Store, schema, NeedsLock),
                          [mnesia_locker:wlock_no_exist(Tid,             Store,            T,
mnesia_lib:intersect(Ns,NeedsLock)) || {T,Ns} <- LockTabs],

             NeedsConversion = need_old_cstructs(NeedsLock ++ LockedAlready),
             {value, SchemaCs} = lists:keysearch(schema, #cstruct.name, Cstructs),
             SchemaDef = cs2list(NeedsConversion, SchemaCs),
             %% Announce that Node is running

             %%开始 announce_im_running 的过程,向集群的事务事务通告本节点进入集

群,同时告知本节点,集群事务节点在这个事务中会与本节点进行 schema 合并

             A = [{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}],
do_insert_schema_ops(Store, A),
                 %% Introduce remote tables to local node

                 %%make_merge_schema 构造一系列合并 schema 的 merge_schema 操作,在提

交成功后由 mnesia_dumper 执行生效

                 do_insert_schema_ops(Store,      make_merge_schema(Node,    NeedsConversion,
Cstructs)),
                 %% Introduce local tables to remote nodes
                 Tabs = val({schema, tables}),
                 Ops = [{op, merge_schema, get_create_list(T)}
                    || T <- Tabs,
                         not lists:keymember(T, #cstruct.name, Cstructs)],
                 do_insert_schema_ops(Store, Ops),

                 %%Ensure that the txn will be committed on all nodes

                 %%向另一个可连接集群中的所有节点通告本节点正在加入集群

                  NewNodes = RemoteRunning -- Running,
                  mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}),
                  announce_im_running(NewNodes, SchemaCs),
                  {merged, Running, RemoteRunning};
             {error, Reason} ->
                  {"Cannot get cstructs", Node, Reason};
             {badrpc, Reason} ->
                  {"Cannot get cstructs", Node, {badrpc, Reason}}
             end;
     [] ->
             %% No more nodes to merge schema with
             not_merged
     end.

announce_im_running([N | Ns], SchemaCs) ->

     %%与新的可连接集群的节点经过协商

     {L1, L2} = mnesia_recover:connect_nodes([N]),
     case lists:member(N, L1) or lists:member(N, L2) of
     true ->

     %%若协商通过,则这些节点就可以作为本节点的事务节点了,注意此处,这个修改是

立即生效的,而不会延迟到事务提交

             mnesia_lib:add({current, db_nodes}, N),
             mnesia_controller:add_active_replica(schema, N, SchemaCs);
false ->

    %%若协商未通过,则中止事务,此时会通过 announce_im_running 的 undo 动作,将新

加入的事务节点全部剥离

         mnesia_lib:error("Mnesia on ~p could not connect to node ~p~n",
                   [node(), N]),
         mnesia:abort({node_not_running, N})
    end,
    announce_im_running(Ns, SchemaCs);
announce_im_running([], _) ->
    [].


schema 操作在三阶段提交时,mnesia_tm 首先要进行 prepare:

mnesia_tm.erl
multi_commit(asym_trans, Majority, Tid, CR, Store) ->
…
     SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})),
…
mnesia_schema.erl
prepare_commit(Tid, Commit, WaitFor) ->
     case Commit#commit.schema_ops of
    [] ->
          {false, Commit, optional};
    OrigOps ->
          {Modified, Ops, DumperMode} =
          prepare_ops(Tid, OrigOps, WaitFor, false, [], optional),
     …
    end.
prepare_ops(Tid, [Op | Ops], WaitFor, Changed, Acc, DumperMode) ->
     case prepare_op(Tid, Op, WaitFor) of
          …
    {false, optional} ->
          prepare_ops(Tid, Ops, WaitFor, true, Acc, DumperMode)
     end;
prepare_ops(_Tid, [], _WaitFor, Changed, Acc, DumperMode) ->
    {Changed, Acc, DumperMode}.

prepare_op(_Tid, {op, announce_im_running, Node, SchemaDef, Running, RemoteRunning},
_WaitFor) ->
    SchemaCs = list2cs(SchemaDef),
    if
    Node == node() -> %% Announce has already run on local node
ignore;        %% from do_merge_schema
    true ->
          %% If a node has restarted it may still linger in db_nodes,
          %% but have been removed from recover_nodes
          Current = mnesia_lib:intersect(val({current,db_nodes}), [node()|val(recover_nodes)]),
          NewNodes = mnesia_lib:uniq(Running++RemoteRunning) -- Current,
          mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}),
          announce_im_running(NewNodes, SchemaCs)
    end,
    {false, optional};

此处可以看出,在 announce_im_running 的 prepare 过程中,要与远程未连接的节点进行协

商,协商通过后,这些未连接节点将加入本节点的事务节点集群



反之,一旦该 schema 操作中止,mnesia_tm 将进行 undo 动作:

mnesia_tm.erl
commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) ->
   ?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]),
   case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of
   {Modified, C = #commit{}, DumperMode} ->
        %% If we can not find any local unclear decision
        %% we should presume abort at startup recovery
        case lists:member(node(), DiscNs) of
        false ->
              ignore;
        true ->
              case Modified of
              false -> mnesia_log:log(Bin);
              true -> mnesia_log:log(C)
              end
        end,
        ?eval_debug_fun({?MODULE, commit_participant, vote_yes},
                    [{tid, Tid}]),
        reply(Coord, {vote_yes, Tid, self()}),

         receive
         {Tid, pre_commit} ->
               …
               receive
               {Tid, committed} ->
                     …
               {Tid, {do_abort, _Reason}} ->
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述
mnesia脑裂问题综述

Contenu connexe

Tendances

Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...
Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...
Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...Ludovico Caldara
 
Oracle ACFS High Availability NFS Services (HANFS) Part-I
Oracle ACFS High Availability NFS Services (HANFS) Part-IOracle ACFS High Availability NFS Services (HANFS) Part-I
Oracle ACFS High Availability NFS Services (HANFS) Part-IAnju Garg
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)Anju Garg
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101MongoDB
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j InternalsTobias Lindaaker
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...xKinAnx
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
Postgresql 12 streaming replication hol
Postgresql 12 streaming replication holPostgresql 12 streaming replication hol
Postgresql 12 streaming replication holVijay Kumar N
 

Tendances (20)

Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...
Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...
Oracle Active Data Guard 12c: Far Sync Instance, Real-Time Cascade and Other ...
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Oracle ACFS High Availability NFS Services (HANFS) Part-I
Oracle ACFS High Availability NFS Services (HANFS) Part-IOracle ACFS High Availability NFS Services (HANFS) Part-I
Oracle ACFS High Availability NFS Services (HANFS) Part-I
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Convert single instance to RAC
Convert single instance to RACConvert single instance to RAC
Convert single instance to RAC
 
Unix ppt
Unix pptUnix ppt
Unix ppt
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j Internals
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Postgresql 12 streaming replication hol
Postgresql 12 streaming replication holPostgresql 12 streaming replication hol
Postgresql 12 streaming replication hol
 

En vedette

了解网络
了解网络了解网络
了解网络Feng Yu
 
了解集群
了解集群了解集群
了解集群Feng Yu
 
我为什么要选择RabbitMQ
我为什么要选择RabbitMQ我为什么要选择RabbitMQ
我为什么要选择RabbitMQFeng Yu
 
高性能集群服务器(Erlang解决方案)
高性能集群服务器(Erlang解决方案)高性能集群服务器(Erlang解决方案)
高性能集群服务器(Erlang解决方案)Feng Yu
 
利用新硬件提升数据库性能
利用新硬件提升数据库性能利用新硬件提升数据库性能
利用新硬件提升数据库性能Feng Yu
 
Rethink db&tokudb调研测试报告
Rethink db&tokudb调研测试报告Rethink db&tokudb调研测试报告
Rethink db&tokudb调研测试报告Feng Yu
 
了解IO设备
了解IO设备了解IO设备
了解IO设备Feng Yu
 
MySQL和IO(上)
MySQL和IO(上)MySQL和IO(上)
MySQL和IO(上)Feng Yu
 
了解内存
了解内存了解内存
了解内存Feng Yu
 
Erlang分布式系统的的领域语言
Erlang分布式系统的的领域语言Erlang分布式系统的的领域语言
Erlang分布式系统的的领域语言Feng Yu
 
了解IO协议栈
了解IO协议栈了解IO协议栈
了解IO协议栈Feng Yu
 
Flash存储设备在淘宝的应用实践
Flash存储设备在淘宝的应用实践Flash存储设备在淘宝的应用实践
Flash存储设备在淘宝的应用实践Feng Yu
 
了解应用服务器
了解应用服务器了解应用服务器
了解应用服务器Feng Yu
 
低成本和高性能MySQL云架构探索
低成本和高性能MySQL云架构探索低成本和高性能MySQL云架构探索
低成本和高性能MySQL云架构探索Feng Yu
 

En vedette (14)

了解网络
了解网络了解网络
了解网络
 
了解集群
了解集群了解集群
了解集群
 
我为什么要选择RabbitMQ
我为什么要选择RabbitMQ我为什么要选择RabbitMQ
我为什么要选择RabbitMQ
 
高性能集群服务器(Erlang解决方案)
高性能集群服务器(Erlang解决方案)高性能集群服务器(Erlang解决方案)
高性能集群服务器(Erlang解决方案)
 
利用新硬件提升数据库性能
利用新硬件提升数据库性能利用新硬件提升数据库性能
利用新硬件提升数据库性能
 
Rethink db&tokudb调研测试报告
Rethink db&tokudb调研测试报告Rethink db&tokudb调研测试报告
Rethink db&tokudb调研测试报告
 
了解IO设备
了解IO设备了解IO设备
了解IO设备
 
MySQL和IO(上)
MySQL和IO(上)MySQL和IO(上)
MySQL和IO(上)
 
了解内存
了解内存了解内存
了解内存
 
Erlang分布式系统的的领域语言
Erlang分布式系统的的领域语言Erlang分布式系统的的领域语言
Erlang分布式系统的的领域语言
 
了解IO协议栈
了解IO协议栈了解IO协议栈
了解IO协议栈
 
Flash存储设备在淘宝的应用实践
Flash存储设备在淘宝的应用实践Flash存储设备在淘宝的应用实践
Flash存储设备在淘宝的应用实践
 
了解应用服务器
了解应用服务器了解应用服务器
了解应用服务器
 
低成本和高性能MySQL云架构探索
低成本和高性能MySQL云架构探索低成本和高性能MySQL云架构探索
低成本和高性能MySQL云架构探索
 

Similaire à mnesia脑裂问题综述

基于MHA的MySQL高可用方案
基于MHA的MySQL高可用方案基于MHA的MySQL高可用方案
基于MHA的MySQL高可用方案Louis liu
 
Glibc内存管理ptmalloc源代码分析4
Glibc内存管理ptmalloc源代码分析4Glibc内存管理ptmalloc源代码分析4
Glibc内存管理ptmalloc源代码分析4hans511002
 
Glibc memory management
Glibc memory managementGlibc memory management
Glibc memory managementdddsf3562
 
6 事务和并发控制
6 事务和并发控制6 事务和并发控制
6 事务和并发控制Zelin Wang
 
CloudTao技术白皮书
CloudTao技术白皮书CloudTao技术白皮书
CloudTao技术白皮书FIT2CLOUD
 
Ubuntu手册(中文版)
Ubuntu手册(中文版)Ubuntu手册(中文版)
Ubuntu手册(中文版)byp2011
 
Hadoop 安装
Hadoop 安装Hadoop 安装
Hadoop 安装feng lee
 
Cassandra运维之道(office2003)
Cassandra运维之道(office2003)Cassandra运维之道(office2003)
Cassandra运维之道(office2003)haiyuan ning
 
丁原:海量数据迁移方案
丁原:海量数据迁移方案丁原:海量数据迁移方案
丁原:海量数据迁移方案YANGL *
 
Csdn Emag(Oracle)第三期
Csdn Emag(Oracle)第三期Csdn Emag(Oracle)第三期
Csdn Emag(Oracle)第三期yiditushe
 
一次Web性能测试小结
一次Web性能测试小结一次Web性能测试小结
一次Web性能测试小结beiyu95
 
Csdn Emag(Oracle)第四期
Csdn Emag(Oracle)第四期Csdn Emag(Oracle)第四期
Csdn Emag(Oracle)第四期yiditushe
 
GCC_Porting_on_MiniSystem
GCC_Porting_on_MiniSystemGCC_Porting_on_MiniSystem
GCC_Porting_on_MiniSystemXiaojing Ma
 
Showinnodbstatus公开
Showinnodbstatus公开Showinnodbstatus公开
Showinnodbstatus公开longxibendi
 
Mnesia用户手册
Mnesia用户手册Mnesia用户手册
Mnesia用户手册shu beta
 
Heartbeat v2 安装和配置原理
Heartbeat v2 安装和配置原理Heartbeat v2 安装和配置原理
Heartbeat v2 安装和配置原理Pickup Li
 
嵌入式inux應用專題文件-智慧家庭系統
嵌入式inux應用專題文件-智慧家庭系統嵌入式inux應用專題文件-智慧家庭系統
嵌入式inux應用專題文件-智慧家庭系統艾鍗科技
 
跨領域物理視算:力學與電磁系統之視覺化、模擬與平行運算
跨領域物理視算:力學與電磁系統之視覺化、模擬與平行運算跨領域物理視算:力學與電磁系統之視覺化、模擬與平行運算
跨領域物理視算:力學與電磁系統之視覺化、模擬與平行運算TAAZE 讀冊生活
 

Similaire à mnesia脑裂问题综述 (20)

基于MHA的MySQL高可用方案
基于MHA的MySQL高可用方案基于MHA的MySQL高可用方案
基于MHA的MySQL高可用方案
 
Glibc内存管理ptmalloc源代码分析4
Glibc内存管理ptmalloc源代码分析4Glibc内存管理ptmalloc源代码分析4
Glibc内存管理ptmalloc源代码分析4
 
Glibc memory management
Glibc memory managementGlibc memory management
Glibc memory management
 
6 事务和并发控制
6 事务和并发控制6 事务和并发控制
6 事务和并发控制
 
CloudTao技术白皮书
CloudTao技术白皮书CloudTao技术白皮书
CloudTao技术白皮书
 
Ubuntu手册(中文版)
Ubuntu手册(中文版)Ubuntu手册(中文版)
Ubuntu手册(中文版)
 
Hadoop 安装
Hadoop 安装Hadoop 安装
Hadoop 安装
 
DR planning
DR planningDR planning
DR planning
 
Rack
RackRack
Rack
 
Cassandra运维之道(office2003)
Cassandra运维之道(office2003)Cassandra运维之道(office2003)
Cassandra运维之道(office2003)
 
丁原:海量数据迁移方案
丁原:海量数据迁移方案丁原:海量数据迁移方案
丁原:海量数据迁移方案
 
Csdn Emag(Oracle)第三期
Csdn Emag(Oracle)第三期Csdn Emag(Oracle)第三期
Csdn Emag(Oracle)第三期
 
一次Web性能测试小结
一次Web性能测试小结一次Web性能测试小结
一次Web性能测试小结
 
Csdn Emag(Oracle)第四期
Csdn Emag(Oracle)第四期Csdn Emag(Oracle)第四期
Csdn Emag(Oracle)第四期
 
GCC_Porting_on_MiniSystem
GCC_Porting_on_MiniSystemGCC_Porting_on_MiniSystem
GCC_Porting_on_MiniSystem
 
Showinnodbstatus公开
Showinnodbstatus公开Showinnodbstatus公开
Showinnodbstatus公开
 
Mnesia用户手册
Mnesia用户手册Mnesia用户手册
Mnesia用户手册
 
Heartbeat v2 安装和配置原理
Heartbeat v2 安装和配置原理Heartbeat v2 安装和配置原理
Heartbeat v2 安装和配置原理
 
嵌入式inux應用專題文件-智慧家庭系統
嵌入式inux應用專題文件-智慧家庭系統嵌入式inux應用專題文件-智慧家庭系統
嵌入式inux應用專題文件-智慧家庭系統
 
跨領域物理視算:力學與電磁系統之視覺化、模擬與平行運算
跨領域物理視算:力學與電磁系統之視覺化、模擬與平行運算跨領域物理視算:力學與電磁系統之視覺化、模擬與平行運算
跨領域物理視算:力學與電磁系統之視覺化、模擬與平行運算
 

Plus de Feng Yu

Cpu高效编程技术
Cpu高效编程技术Cpu高效编程技术
Cpu高效编程技术Feng Yu
 
Erlang开发实践
Erlang开发实践Erlang开发实践
Erlang开发实践Feng Yu
 
MySQL和IO(下)
MySQL和IO(下)MySQL和IO(下)
MySQL和IO(下)Feng Yu
 
了解Cpu
了解Cpu了解Cpu
了解CpuFeng Yu
 
SSD在淘宝的应用实践
SSD在淘宝的应用实践SSD在淘宝的应用实践
SSD在淘宝的应用实践Feng Yu
 
淘宝商品库MySQL优化实践
淘宝商品库MySQL优化实践淘宝商品库MySQL优化实践
淘宝商品库MySQL优化实践Feng Yu
 
开源混合存储方案(Flashcache)
开源混合存储方案(Flashcache)开源混合存储方案(Flashcache)
开源混合存储方案(Flashcache)Feng Yu
 
Erlang low cost_clound_computing
Erlang low cost_clound_computingErlang low cost_clound_computing
Erlang low cost_clound_computingFeng Yu
 
Systemtap
SystemtapSystemtap
SystemtapFeng Yu
 
Oprofile linux
Oprofile linuxOprofile linux
Oprofile linuxFeng Yu
 
C1000K高性能服务器构建技术
C1000K高性能服务器构建技术C1000K高性能服务器构建技术
C1000K高性能服务器构建技术Feng Yu
 
Erlang全接触
Erlang全接触Erlang全接触
Erlang全接触Feng Yu
 
Tsung 压力测试工具
Tsung 压力测试工具Tsung 压力测试工具
Tsung 压力测试工具Feng Yu
 
Inside Erlang Vm II
Inside Erlang Vm IIInside Erlang Vm II
Inside Erlang Vm IIFeng Yu
 

Plus de Feng Yu (16)

Cpu高效编程技术
Cpu高效编程技术Cpu高效编程技术
Cpu高效编程技术
 
Erlang开发实践
Erlang开发实践Erlang开发实践
Erlang开发实践
 
MySQL和IO(下)
MySQL和IO(下)MySQL和IO(下)
MySQL和IO(下)
 
了解Cpu
了解Cpu了解Cpu
了解Cpu
 
SSD在淘宝的应用实践
SSD在淘宝的应用实践SSD在淘宝的应用实践
SSD在淘宝的应用实践
 
淘宝商品库MySQL优化实践
淘宝商品库MySQL优化实践淘宝商品库MySQL优化实践
淘宝商品库MySQL优化实践
 
开源混合存储方案(Flashcache)
开源混合存储方案(Flashcache)开源混合存储方案(Flashcache)
开源混合存储方案(Flashcache)
 
Erlang low cost_clound_computing
Erlang low cost_clound_computingErlang low cost_clound_computing
Erlang low cost_clound_computing
 
Systemtap
SystemtapSystemtap
Systemtap
 
Oprofile linux
Oprofile linuxOprofile linux
Oprofile linux
 
Go
GoGo
Go
 
C1000K高性能服务器构建技术
C1000K高性能服务器构建技术C1000K高性能服务器构建技术
C1000K高性能服务器构建技术
 
Erlang全接触
Erlang全接触Erlang全接触
Erlang全接触
 
Tsung 压力测试工具
Tsung 压力测试工具Tsung 压力测试工具
Tsung 压力测试工具
 
Inside Erlang Vm II
Inside Erlang Vm IIInside Erlang Vm II
Inside Erlang Vm II
 
Go Lang
Go LangGo Lang
Go Lang
 

mnesia脑裂问题综述

  • 1. 目录 1. 现象与成因............................................................................................................................... 2 2. mnesia 运行机制 ...................................................................................................................... 3 3. 常见问题与注意事项............................................................................................................... 6 4. 源码分析................................................................................................................................... 8 1. mnesia:create_schema/1 的工作过程 ............................................................................. 8 1. 主体过程................................................................................................................... 8 2. 前半部分 mnesia:create_schema/1 做的工作 ........................................................ 9 3. 后半部分 mnesia:start/0 做的工作 ....................................................................... 19 4. mnesia:change_table_majority/2 的工作过程 .............................................................. 23 1. 调用接口................................................................................................................. 23 2. 事务操作................................................................................................................. 24 3. schema 事务提交接口 ........................................................................................... 29 4. schema 事务协议过程 ........................................................................................... 31 5. 远程节点事务管理器第一阶段提交 prepare 响应 .............................................. 34 6. 远程节点事务参与者第二阶段提交 precommit 响应 ......................................... 37 7. 请求节点事务发起者收到第二阶段提交 precommit 确认 ................................. 37 8. 远程节点事务参与者第三阶段提交 commit 响应............................................... 39 9. 第三阶段提交 commit 的本地提交过程 .............................................................. 39 5. majority 事务处理 .......................................................................................................... 45 6. 恢复................................................................................................................................. 46
  • 2. 1. 节点协议版本检查+节点 decision 通告与合并.................................................... 46 2. 节点发现,集群遍历 ............................................................................................. 51 3. 节点 schema 合并 .................................................................................................. 60 4. 节点数据合并部分 1,从远程节点加载表 .......................................................... 62 5. 节点数据合并部分 2,从本地磁盘加载表 .......................................................... 66 6. 节点数据合并部分 2,表加载完成 ...................................................................... 71 7. 分区检测......................................................................................................................... 73 1. 锁过程中的同步检测 ............................................................................................. 73 2. 事务过程中的同步检测 ......................................................................................... 75 3. 节点 down 异步检测.............................................................................................. 80 4. 节点 up 异步检测................................................................................................... 92 8. 其它................................................................................................................................. 95 分析代码版本为 erlang 版本 R15B03。 1. 现象与成因 现象:mnesia 在出现网络分区后,向各个分区写入不同数据,各个分区产生不一致的状态, 分区恢复后,mnesia 仍然呈现不一致的状态。若重启任意的分区,则重启的分区将从存活 的分区中拉取数据,自身原先的数据丢失。 原理:分布式系统受 CAP 原理(若系统能够容忍网络分区,则不能同时满足可用性和一致
  • 3. 性)约束,一些分布式存储系统为保证可用性,放弃强一致转而追求最终一致。 mnesia 也是最终一致的分布式数据库,在没有分区的时候,mnesia 为强一致的,而出现分 区后,mnesia 仍然允许写入,因此将呈现不一致的状态。分区消除后,需要应用者处理不 一致的状态。简单的恢复过程如重启被放弃的分区,令其重新从保留的分区拉取数据,复杂 的恢复过程则需要编写数据订正程序,应用订正程序进行恢复。 2. mnesia 运行机制 mnesia 运行机制状态图,事务过程采用 majority 事务,即当大多数节点在集群中时,才允 许写: mnesia 运行机制解释:
  • 4. 1. 事务:通过运行时事务保证无分区时的强一致性,mnesia 支持多种事务类型: a) 无锁无事务脏写,一阶段异步; b) 有锁异步事务,一阶段同步锁,一阶段同步一阶段异步事务; c) 有锁同步事务,一阶段同步锁,两阶段同步事务; d) 有锁 majority 事务,一阶段同步锁,两阶段同步一阶段异步事务; e) 有锁 schema 事务,一阶段同步锁,三阶段同步事务,是附带 schema 操作的 majority 事务; 2. 恢复:通过重启时恢复保证有分区时的最终一致性,mnesia 重启时进行如下分布式协商 工作: a) 节点发现; b) 节点协议版本协商; c) 节点 schema 合并; d) 节点事务 decision 合并; i. 若远程节点事务结果 abort,本节点事务结果 commit,则出现冲突,报告 {inconsistent_database, bad_decision, Node},本节点事务结果改为 abort; ii. 若远程节点事务结果 commit,本节点事务结果 abort,本节点事务结果仍为 abort,此时远程节点将进行修改和通报; iii. 若远程节点事务结果 unclear,本节点事务结果非 unclear,则事务结果为本节 点事务结果,远程节点进行修改; iv. 若远程节点事务结果 unclear,本节点事务结果 unclear,则等待其它直到事务 结果的节点启动,并按照其结果作为事务结果; v. 若所有节点事务结果均 unclear,则事务结果为 unclear;
  • 5. vi. 事务 decision 并不真正影响实际的数据内容; e) 节点表数据合并: i. 若本节点为 master 节点,则本节点从磁盘加载表数据; ii. 若本节点有 local 表,则本节点从磁盘加载 local 表数据; iii. 若远程节点存活,则从远程节点拉取表数据; iv. 若远程节点未存活,本节点为最后一个关闭的节点,则本节点从磁盘加载表数 据; v. 若远程节点未存活,本节点非最后一个关闭的节点,则等待其它远程节点启动 加载表数据后,再从远程节点拉取表数据,远程节点未启动加载时,表不可访 问; vi. 若表数据已经加载,则不会再从远程节点拉取表数据; vii. 从集群角度看: 1. 若有其它节点重启时发起新的分布式协商,本节点将其加入集群拓扑视图; 2. 若有集群中节点 down(关闭或者产生分区) 本节点将其移出集群拓扑视 , 图; 3. 分区恢复时不进行分布式协商,其它分区的节点不能加入集群拓扑视图, 各个分区依旧保持分区状态; 3. 不一致状态检测:通过运行时和重启时监控远程节点的 up 和 down 状态、远程节点对 事务的决议结果,检测是否曾经发生过程网络分区,若出现过,则意味着潜在的分区不 一致,此时将通告应用者一个 inconsistent_database 事件: a) 运行时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在 远程节点重新 up 时,即通告{inconsistent_database, running_partitioned_network,
  • 6. Node}; b) 重启时监控远程节点的 up 和 down 历史状态,若彼此都认为对方 down 过,则在 远程节点重新 up 时,即通告{inconsistent_database, starting_partitioned_network, Node}; c) 运行时和重启时与远程节点交换事务 decision,若发现对方 abort 而本身 commit 事务,即通告{inconsistent_database, bad_decision, Node}; 3. 常见问题与注意事项 此处的所有问题涉及事务的部分仅讨论 majority 事务,因为此类事务比同步和异步事务要完 备一些,也不包含 schema 操作。 fail_safe 状态:出现网络分区后,minority 分区不能写入的状态。 常见问题: 1. 出现分区后,本节点节点为 minority 分区,分区恢复后,若本节点不重启,能否一直保 持 fail_safe 状态? 若不断有其它节点启动,然后与本节点进行协商,加入本集群,使得集群变为 majority, 此时集群变为可写; 若没有任何其他节点启动,则本节点一致保持 fail_safe 状态; 2. 出现网络瞬断后,在 majority 分区写入,此写入不能到达 minority 分区,分区恢复后, 在 minority 分区写入,此时 minority 如何进入 fail_safe 状态? mnesia 依赖于 erlang 虚拟机检测是否有节点 down,在 majority 写入时,erlang 虚拟机 将检测到 minority 节点 down, minority 的 erlang 虚拟机也会发现 majority 节点 down。 而
  • 7. 双方都能发现对方 down,majority 继续可写,而 minority 进入 fail_safe 状态; 3. 对于集群 A、B、C,产生分区 A 与 B、C,在 B、C 写入数据,恢复分区。此时若重启 A, 有什么效果?重启 B、C 有什么效果? 经过试验得出: a) 若重启 A,则在 A 中能正确发现 B、 写入的记录, C 这依赖于 A 启动时的协商过程, A 向 B、C 请求表数据; b) 若重启 B、C,则在 B、C 中不能发现原先写入的记录,这依赖于 B、C 启动时的协 商过程,B、C 向 A 请求表数据; 注意事项: 1. mnesia 在出现分区时是最终一致的而非强一致的,要保证强一致,可以指定一个 master 节点,由他来仲裁最终的数据结果,但这样也会引入单点问题; 2. 订阅 mnesia 产生的 system 事件(包括 inconsistent_database 事件) mnesia 已经启动, 时 一些事件可能已经被 mnesia 发出,因此可能会遗漏一些事件; 3. 订阅 mnesia 事件是非持久的,mnesia 重启时需要重新订阅; 4. majority 事务是二次同步一次异步事务,提交过程中还需要参与节点创建一个进程进行 事务处理,加上原本的一个 ets 表和一次同步锁,可能进一步降低性能; 5. majority 事务不能约束恢复过程,而恢复过程将优先选择存活远程节点的表作为本节点 表的恢复依据; 6. mnesia 对 inconsistent_database 的检查与报告是一个较强的条件,可能产生误报; 7. 发现脑裂结束后,最好应采取告警,由人力介入来解决,而非自动解决;
  • 8. 4. 源码分析 主题包括: 1. mnesia 磁盘表副本要求 schema 也有磁盘表副本,因此需要参考 mnesia:create_schema/1 的工作过程; 2. 此处使用 majority 事务进行解释, 必须参考 mnesia:change_table_majority/2 的工作过程, 且此过程是 schema 事务,可以更详细全面的理解 majority 事务; 3. majority 事务处理将弱化 schema 事务模型,进行特定的解释; 4. 恢复过程分析 mnesia 启动时的主要工作、分布式协商过程、磁盘表加载; 5. 分区检查分析 mnesia 如何检查各类 inconsistent_database 事件; 1. mnesia:create_schema/1 的工作过程 1. 主体过程 安装 schema 的过程必须要在 mnesia 停机的条件下进行,此后,mnesia 启动。 schema 添加的过程本质上是一个两阶段提交过程: schema 变更发起节点 1. 询问各个参与节点是否已经由 schema 副本 2. 上全局锁{mnesia_table_lock, schema} 3. 在各个参与节点上建立 mnesia_fallback 进程 4. 第一阶段向各个节点的 mnesia_fallback 进程广播{start, Header, Schema2}消息,通知其保 存新生成的 schema 文件备份
  • 9. 5. 第二阶段向各个节点的 mnesia_fallback 进程广播 swap 消息,通知其完成提交过程,创 建真正的"FALLBACK.BUP"文件 6. 最终向各个节点的 mnesia_fallback 进程广播 stop 消息,完成变更 2. 前半部分 mnesia:create_schema/1 做的工作 mnesia.erl create_schema(Ns) -> mnesia_bup:create_schema(Ns). mnesia_bup.erl create_schema([]) -> create_schema([node()]); create_schema(Ns) when is_list(Ns) -> case is_set(Ns) of true -> create_schema(Ns, mnesia_schema:ensure_no_schema(Ns)); false -> {error, {combine_error, Ns}} end; create_schema(Ns) -> {error, {badarg, Ns}}. mnesia_schema.erl ensure_no_schema([H|T]) when is_atom(H) -> case rpc:call(H, ?MODULE, remote_read_schema, []) of {badrpc, Reason} -> {H, {"All nodes not running", H, Reason}}; {ok,Source, _} when Source /= default -> {H, {already_exists, H}}; _ -> ensure_no_schema(T) end; ensure_no_schema([H|_]) -> {error,{badarg, H}}; ensure_no_schema([]) -> ok. remote_read_schema() -> case mnesia_lib:ensure_loaded(?APPLICATION) of ok ->
  • 10. case mnesia_monitor:get_env(schema_location) of opt_disc -> read_schema(false); _ -> read_schema(false) end; {error, Reason} -> {error, Reason} end. 询问其它所有节点,检查其是否启动,并检查其是否已经具备了 mnesia 的 schema,仅当所 有预备建立 mnesia schema 的节点全部启动,且没有 schema 副本,该检查才成立。 回到 mnesia_bup.erl mnesia_bup.erl create_schema(Ns, ok) -> case mnesia_lib:ensure_loaded(?APPLICATION) of ok -> case mnesia_monitor:get_env(schema_location) of ram -> {error, {has_no_disc, node()}}; _ -> case mnesia_schema:opt_create_dir(true, mnesia_lib:dir()) of {error, What} -> {error, What}; ok -> Mod = mnesia_backup, Str = mk_str(), File = mnesia_lib:dir(Str), file:delete(File), case catch make_initial_backup(Ns, File, Mod) of {ok, _Res} -> case do_install_fallback(File, Mod) of ok -> file:delete(File), ok; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end
  • 11. end end; {error, Reason} -> {error, Reason} end; create_schema(_Ns, {error, Reason}) -> {error, Reason}; create_schema(_Ns, Reason) -> {error, Reason}. 通过 mnesia_bup:make_initial_backup 创建一个本地节点的新 schema 的描述文件,然后再通 过 mnesia_bup:do_install_fallback 将新 schema 描述文件通过恢复过程,变更 schema: make_initial_backup(Ns, Opaque, Mod) -> Orig = mnesia_schema:get_initial_schema(disc_copies, Ns), Modded = proplists:delete(storage_properties, proplists:delete(majority, Orig)), Schema = [{schema, schema, Modded}], O2 = do_apply(Mod, open_write, [Opaque], Opaque), O3 = do_apply(Mod, write, [O2, [mnesia_log:backup_log_header()]], O2), O4 = do_apply(Mod, write, [O3, Schema], O3), O5 = do_apply(Mod, commit_write, [O4], O4), {ok, O5}. 创建一个本地节点的新 schema 的描述文件,注意,新 schema 的 majority 属性没有在备份 中。 mnesia_schema.erl get_initial_schema(SchemaStorage, Nodes) -> Cs = #cstruct{name = schema, record_name = schema, attributes = [table, cstruct]}, Cs2 = case SchemaStorage of ram_copies -> Cs#cstruct{ram_copies = Nodes}; disc_copies -> Cs#cstruct{disc_copies = Nodes} end, cs2list(Cs2). mnesia_bup.erl do_install_fallback(Opaque, Mod) when is_atom(Mod) -> do_install_fallback(Opaque, [{module, Mod}]); do_install_fallback(Opaque, Args) when is_list(Args) -> case check_fallback_args(Args, #fallback_args{opaque = Opaque}) of
  • 12. {ok, FA} -> do_install_fallback(FA); {error, Reason} -> {error, Reason} end; do_install_fallback(_Opaque, Args) -> {error, {badarg, Args}}. 检 查 安 装 参 数 , 将 参 数 装 入 一 个 fallback_args 结 构 , 参 数 检 查 及 构 造 过 程 在 check_fallback_arg_type/2 中,然后进行安装 check_fallback_args([Arg | Tail], FA) -> case catch check_fallback_arg_type(Arg, FA) of {'EXIT', _Reason} -> {error, {badarg, Arg}}; FA2 -> check_fallback_args(Tail, FA2) end; check_fallback_args([], FA) -> {ok, FA}. check_fallback_arg_type(Arg, FA) -> case Arg of {scope, global} -> FA#fallback_args{scope = global}; {scope, local} -> FA#fallback_args{scope = local}; {module, Mod} -> Mod2 = mnesia_monitor:do_check_type(backup_module, Mod), FA#fallback_args{module = Mod2}; {mnesia_dir, Dir} -> FA#fallback_args{mnesia_dir = Dir, use_default_dir = false}; {keep_tables, Tabs} -> atom_list(Tabs), FA#fallback_args{keep_tables = Tabs}; {skip_tables, Tabs} -> atom_list(Tabs), FA#fallback_args{skip_tables = Tabs}; {default_op, keep_tables} -> FA#fallback_args{default_op = keep_tables}; {default_op, skip_tables} -> FA#fallback_args{default_op = skip_tables} end.
  • 13. 此处的构造过程记录 module 参数, mnesia_backup, 为 同时记录 opaque 参数, 为新建 schema 文件的文件名。 do_install_fallback(FA) -> Pid = spawn_link(?MODULE, install_fallback_master, [self(), FA]), Res = receive {'EXIT', Pid, Reason} -> % if appl has trapped exit {error, {'EXIT', Reason}}; {Pid, Res2} -> case Res2 of {ok, _} -> ok; {error, Reason} -> {error, {"Cannot install fallback", Reason}} end end, Res. install_fallback_master(ClientPid, FA) -> process_flag(trap_exit, true), State = {start, FA}, Opaque = FA#fallback_args.opaque, Mod = FA#fallback_args.module, Res = (catch iterate(Mod, fun restore_recs/4, Opaque, State)), unlink(ClientPid), ClientPid ! {self(), Res}, exit(shutdown). 从新建的 schema 文件中迭代恢复到本地节点和全局集群,此时 Mod 为 mnesia_backup, Opaque 为新建 schema 文件的文件名,State 为给出的 fallback_args 参数,均为默认值。 fallback_args 默认定义: -record(fallback_args, {opaque, scope = global, module = mnesia_monitor:get_env(backup_module), use_default_dir = true, mnesia_dir, fallback_bup, fallback_tmp, skip_tables = [],
  • 14. keep_tables = [], default_op = keep_tables }). iterate(Mod, Fun, Opaque, Acc) -> R = #restore{bup_module = Mod, bup_data = Opaque}, case catch read_schema_section(R) of {error, Reason} -> {error, Reason}; {R2, {Header, Schema, Rest}} -> case catch iter(R2, Header, Schema, Fun, Acc, Rest) of {ok, R3, Res} -> catch safe_apply(R3, close_read, [R3#restore.bup_data]), {ok, Res}; {error, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, Reason}; {'EXIT', Pid, Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, {'EXIT', Pid, Reason}}; {'EXIT', Reason} -> catch safe_apply(R2, close_read, [R2#restore.bup_data]), {error, {'EXIT', Reason}} end end. iter(R, Header, Schema, Fun, Acc, []) -> case safe_apply(R, read, [R#restore.bup_data]) of {R2, []} -> Res = Fun([], Header, Schema, Acc), {ok, R2, Res}; {R2, BupItems} -> iter(R2, Header, Schema, Fun, Acc, BupItems) end; iter(R, Header, Schema, Fun, Acc, BupItems) -> Acc2 = Fun(BupItems, Header, Schema, Acc), iter(R, Header, Schema, Fun, Acc2, []). read_schema_section 将读出新建 schema 文件的内容,得到文件头部,并组装出 schema 结 构,将 schema 应用回调函数,此处回调函数为 mnesia_bup 的 restore_recs/4 函数: restore_recs(Recs, Header, Schema, {start, FA}) -> %% No records in backup Schema2 = convert_schema(Header#log_header.log_version, Schema), CreateList = lookup_schema(schema, Schema2),
  • 15. case catch mnesia_schema:list2cs(CreateList) of {'EXIT', Reason} -> throw({error, {"Bad schema in restore_recs", Reason}}); Cs -> Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies), global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity), Args = [self(), FA], Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns], send_fallback(Pids, {start, Header, Schema2}), Res = restore_recs(Recs, Header, Schema2, Pids), global:del_lock({{mnesia_table_lock, schema}, self()}, Ns), Res end; 一个典型的 schema 结构如下: [{schema,schema, [{name,schema}, {type,set}, {ram_copies,[]}, {disc_copies,['rds_la_dev@10.232.64.77']}, {disc_only_copies,[]}, {load_order,0}, {access_mode,read_write}, {index,[]}, {snmp,[]}, {local_content,false}, {record_name,schema}, {attributes,[table,cstruct]}, {user_properties,[]}, {frag_properties,[]}, {cookie,{{1358,676768,107058},'rds_la_dev@10.232.64.77'}}, {version,{{2,0},[]}}]}] 构成一个{schema, schema, CreateList}的元组,同时调用 mnesia_schema:list2cs(CreateList),将 CreateList 还原回 schema 的 cstruct 结构。 mnesia_bup.erl restore_recs(Recs, Header, Schema, {start, FA}) -> %% No records in backup Schema2 = convert_schema(Header#log_header.log_version, Schema), CreateList = lookup_schema(schema, Schema2), case catch mnesia_schema:list2cs(CreateList) of {'EXIT', Reason} -> throw({error, {"Bad schema in restore_recs", Reason}});
  • 16. Cs -> Ns = get_fallback_nodes(FA, Cs#cstruct.disc_copies), global:set_lock({{mnesia_table_lock, schema}, self()}, Ns, infinity), Args = [self(), FA], Pids = [spawn_link(N, ?MODULE, fallback_receiver, Args) || N <- Ns], send_fallback(Pids, {start, Header, Schema2}), Res = restore_recs(Recs, Header, Schema2, Pids), global:del_lock({{mnesia_table_lock, schema}, self()}, Ns), Res end; get_fallback_nodes 将得到参与 schema 构建节点的 fallback 节点,通常为所有参与 schema 构建的节点。 构建过程要加入集群的全局锁{mnesia_table_lock, schema}。 在各个参与 schema 构建的节点上,均创建一个 fallback_receiver 进程, 处理 schema 的变更。 向这些节点的 fallback_receiver 进程广播{start, Header, Schema2}消息,并等待其返回结果。 所有节点的 fallback_receiver 进程对 start 消息响应后,进入下一个过程: restore_recs([], _Header, _Schema, Pids) -> send_fallback(Pids, swap), send_fallback(Pids, stop), stop; restore_recs 向所有节点的 fallback_receiver 进程广播后续的 swap 消息和 stop 消息,完成整 个 schema 变更过程,然后释放全局锁{mnesia_table_lock, schema}。 进入 fallback_receiver 进程的处理过程: fallback_receiver(Master, FA) -> process_flag(trap_exit, true), case catch register(mnesia_fallback, self()) of {'EXIT', _} -> Reason = {already_exists, node()}, local_fallback_error(Master, Reason); true -> FA2 = check_fallback_dir(Master, FA), Bup = FA2#fallback_args.fallback_bup, case mnesia_lib:exists(Bup) of
  • 17. true -> Reason2 = {already_exists, node()}, local_fallback_error(Master, Reason2); false -> Mod = mnesia_backup, Tmp = FA2#fallback_args.fallback_tmp, R = #restore{mode = replace, bup_module = Mod, bup_data = Tmp}, file:delete(Tmp), case catch fallback_receiver_loop(Master, R, FA2, schema) of {error, Reason} -> local_fallback_error(Master, Reason); Other -> exit(Other) end end end. 在自身的节点上注册进程名字为 mnesia_fallback。 构建初始化状态。 进入 fallback_receiver_loop 循环处理来自 schema 变更发起节点的消息。 fallback_receiver_loop(Master, R, FA, State) -> receive {Master, {start, Header, Schema}} when State =:= schema -> Dir = FA#fallback_args.mnesia_dir, throw_bad_res(ok, mnesia_schema:opt_create_dir(true, Dir)), R2 = safe_apply(R, open_write, [R#restore.bup_data]), R3 = safe_apply(R2, write, [R2#restore.bup_data, [Header]]), BupSchema = [schema2bup(S) || S <- Schema], R4 = safe_apply(R3, write, [R3#restore.bup_data, BupSchema]), Master ! {self(), ok}, fallback_receiver_loop(Master, R4, FA, records); … end. 在本地也创建一个 schema 临时文件, 接收来自变更发起节点构建的 header 部分和新 schema。 fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, swap} when State =/= schema -> ?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []),
  • 18. safe_apply(R, commit_write, [R#restore.bup_data]), Bup = FA#fallback_args.fallback_bup, Tmp = FA#fallback_args.fallback_tmp, throw_bad_res(ok, file:rename(Tmp, Bup)), catch mnesia_lib:set(active_fallback, true), ?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []), Master ! {self(), ok}, fallback_receiver_loop(Master, R, FA, stop); … end. mnesia_backup.erl commit_write(OpaqueData) -> B = OpaqueData, case disk_log:sync(B#backup.file_desc) of ok -> case disk_log:close(B#backup.file_desc) of ok -> case file:rename(B#backup.tmp_file, B#backup.file) of ok -> {ok, B#backup.file}; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end; {error, Reason} -> {error, Reason} end. 变更提交过程,新建的 schema 文件在写入到本节点时,为文件名后跟".BUPTMP"表明是一 个临时未提交的文件,此处进行提交时,sync 新建的 schema 文件到磁盘后关闭,并重命名 为真正的新建的 schema 文件名,消除最后的".BUPTMP" fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, swap} when State =/= schema -> ?eval_debug_fun({?MODULE, fallback_receiver_loop, pre_swap}, []), safe_apply(R, commit_write, [R#restore.bup_data]), Bup = FA#fallback_args.fallback_bup,
  • 19. Tmp = FA#fallback_args.fallback_tmp, throw_bad_res(ok, file:rename(Tmp, Bup)), catch mnesia_lib:set(active_fallback, true), ?eval_debug_fun({?MODULE, fallback_receiver_loop, post_swap}, []), Master ! {self(), ok}, fallback_receiver_loop(Master, R, FA, stop); … end. 在这个参与节点上,将新建 schema 文件命名为"FALLBACK.BUP",同时激活本地节点的 active_fallback 属性,表明称为一个活动 fallback 节点。 fallback_receiver_loop(Master, R, FA, State) -> receive … {Master, stop} when State =:= stop -> stopped; … end. 收到 stop 消息后,mnesia_fallback 进程退出。 3. 后半部分 mnesia:start/0 做的工作 mnesia 启 动 , 则 可 以 自 动 通 过 事 务 管 理 器 mnesia_tm 调 用 mnesia_bup:tm_fallback_start(IgnoreFallback)将 schema 建立到 dets 表中: mnesia_bup.erl tm_fallback_start(IgnoreFallback) -> mnesia_schema:lock_schema(), Res = do_fallback_start(fallback_exists(), IgnoreFallback), mnesia_schema: unlock_schema(), case Res of ok -> ok; {error, Reason} -> exit(Reason) end. 锁住 schema 表,然后通过"FALLBACK.BUP"文件进行 schema 恢复创建,最后释放 schema 表 锁
  • 20. do_fallback_start(true, false) -> verbose("Starting from fallback...~n", []), BupFile = fallback_bup(), Mod = mnesia_backup, LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]), case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of … end. 根据"FALLBACK.BUP"文件,调用 restore_tables 函数进行恢复 restore_tables(Recs, Header, Schema, {start, LocalTabs}) -> Dir = mnesia_lib:dir(), OldDir = filename:join([Dir, "OLD_DIR"]), mnesia_schema:purge_dir(OldDir, []), mnesia_schema:purge_dir(Dir, [fallback_name()]), init_dat_files(Schema, LocalTabs), State = {new, LocalTabs}, restore_tables(Recs, Header, Schema, State); init_dat_files(Schema, LocalTabs) -> TmpFile = mnesia_lib:tab2tmp(schema), Args = [{file, TmpFile}, {keypos, 2}, {type, set}], case dets:open_file(schema, Args) of % Assume schema lock {ok, _} -> create_dat_files(Schema, LocalTabs), ok = dets:close(schema), LocalTab = #local_tab{ name = schema, storage_type = disc_copies, open = undefined, add = undefined, close = undefined, swap = undefined, record_name = schema, opened = false}, ?ets_insert(LocalTabs, LocalTab); {error, Reason} -> throw({error, {"Cannot open file", schema, Args, Reason}}) end. 创建 schema 的 dets 表,文件名为 schema.TMP,根据"FALLBACK.BUP"文件,将各个表的元数 据恢复到新建的 schema 的 dets 表中。
  • 21. 调用 create_dat_files 构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数,然后 调用之,将其它表的元数据持久化到 schema 表中。 restore_tables(Recs, Header, Schema, {start, LocalTabs}) -> Dir = mnesia_lib:dir(), OldDir = filename:join([Dir, "OLD_DIR"]), mnesia_schema:purge_dir(OldDir, []), mnesia_schema:purge_dir(Dir, [fallback_name()]), init_dat_files(Schema, LocalTabs), State = {new, LocalTabs}, restore_tables(Recs, Header, Schema, State); 构建其它表在本节点的元数据信息的 Open/Add/Close/Swap 函数 restore_tables(All=[Rec | Recs], Header, Schema, {new, LocalTabs}) -> Tab = element(1, Rec), case ?ets_lookup(LocalTabs, Tab) of [] -> State = {not_local, LocalTabs, Tab}, restore_tables(Recs, Header, Schema, State); [LT] when is_record(LT, local_tab) -> State = {local, LocalTabs, LT}, case LT#local_tab.opened of true -> ignore; false -> (LT#local_tab.open)(Tab, LT), ?ets_insert(LocalTabs,LT#local_tab{opened=true}) end, restore_tables(All, Header, Schema, State) end; 打开表,不断检查表是否位于本地,若是则进行恢复添加过程: restore_tables(All=[Rec | Recs], Header, Schema, State={local, LocalTabs, LT}) -> Tab = element(1, Rec), if Tab =:= LT#local_tab.name -> Key = element(2, Rec), (LT#local_tab.add)(Tab, Key, Rec, LT), restore_tables(Recs, Header, Schema, State); true -> NewState = {new, LocalTabs}, restore_tables(All, Header, Schema, NewState) end; Add 函数主要为将表记录入 schema 表,此处是写入临时 schema,而未真正提交
  • 22. 待所有表恢复完成后,进行真正的提交工作: do_fallback_start(true, false) -> verbose("Starting from fallback...~n", []), BupFile = fallback_bup(), Mod = mnesia_backup, LocalTabs = ?ets_new_table(mnesia_local_tables, [set, public, {keypos, 2}]), case catch iterate(Mod, fun restore_tables/4, BupFile, {start, LocalTabs}) of {ok, _Res} -> catch dets:close(schema), TmpSchema = mnesia_lib:tab2tmp(schema), DatSchema = mnesia_lib:tab2dat(schema), AllLT = ?ets_match_object(LocalTabs, '_'), ?ets_delete_table(LocalTabs), case file:rename(TmpSchema, DatSchema) of ok -> [(LT#local_tab.swap)(LT#local_tab.name, LT) || LT <- AllLT, LT#local_tab.name =/= schema], file:delete(BupFile), ok; {error, Reason} -> file:delete(TmpSchema), {error, {"Cannot start from fallback. Rename error.", Reason}} end; {error, Reason} -> {error, {"Cannot start from fallback", Reason}}; {'EXIT', Reason} -> {error, {"Cannot start from fallback", Reason}} end. 将 schema.TMP 变更为 schema.DAT,正式启用持久 schema,提交 schema 表的变更 同 时 调 用在 create_dat_files 函 数 中创 建 的转 换 函数 , 进 行各 个表 的 提交工 作 , 对于 ram_copies 表,没有什么额外动作,对于 disc_only_copies 表,主要为提交其对应 dets 表的 文件名,对于 disc_copies 表,主要为记录 redo 日志,然后提交其对应 dets 表的文件名。 全部完成后,schema 表将成为持久的 dets 表,"FALLBACK.BUP"文件也将被删除。 事务管理器在完成 schema 的 dets 表的构建后,将初始化 mnesia_schema: mnesia_schema.erl
  • 23. init(IgnoreFallback) -> Res = read_schema(true, IgnoreFallback), {ok, Source, _CreateList} = exit_on_error(Res), verbose("Schema initiated from: ~p~n", [Source]), set({schema, tables}, []), set({schema, local_tables}, []), Tabs = set_schema(?ets_first(schema)), lists:foreach(fun(Tab) -> clear_whereabouts(Tab) end, Tabs), set({schema, where_to_read}, node()), set({schema, load_node}, node()), set({schema, load_reason}, initial), mnesia_controller:add_active_replica(schema, node()). 检查 schema 表从何处恢复,在 mnesia_gvar 这个全局状态 ets 表中,初始化 schema 的原始 信息,并将本节点作为 schema 表的初始活动副本 若某个节点作为一个表的活动副本,则表的 where_to_commit 和 where_to_write 属性必须 同时包含该节点。 4. mnesia:change_table_majority/2 的工作过程 mnesia 表可以在建立时,设置一个 majority 的属性,也可以在建立表之后通过 mnesia: change_table_majority/2 更改此属性。 该属性可以要求 mnesia 在进行事务时,检查所有参与事务的节点是否为表的提交节点的大 多数,这样可以在出现网络分区时,保证 majority 节点的可用性,同时也能保证整个网络的 一致性,minority 节点将不可用,这也是 CAP 理论的一个折中。 1. 调用接口 mnesia.erl change_table_majority(T, M) -> mnesia_schema:change_table_majority(T, M).
  • 24. mnesia_schema.erl change_table_majority(Tab, Majority) when is_boolean(Majority) -> schema_transaction(fun() -> do_change_table_majority(Tab, Majority) end). schema_transaction(Fun) -> case get(mnesia_activity_state) of undefined -> Args = [self(), Fun, whereis(mnesia_controller)], Pid = spawn_link(?MODULE, schema_coordinator, Args), receive {transaction_done, Res, Pid} -> Res; {'EXIT', Pid, R} -> {aborted, {transaction_crashed, R}} end; _ -> {aborted, nested_transaction} end. 启动一个 schema 事务的协调者 schema_coordinator 进程。 schema_coordinator(Client, Fun, Controller) when is_pid(Controller) -> link(Controller), unlink(Client), Res = mnesia:transaction(Fun), Client ! {transaction_done, Res, self()}, unlink(Controller), % Avoids spurious exit message unlink(whereis(mnesia_tm)), % Avoids spurious exit message exit(normal). 与普通事务不同, schema 事务使用的 schema_coordinator 进程 link 到的不是请求者的进程, 而是 mnesia_controller 进程。 启动一个 mnesia 事务,函数为 fun() -> do_change_table_majority(Tab, Majority) end。 2. 事务操作 do_change_table_majority(schema, _Majority) -> mnesia:abort({bad_type, schema}); do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)).
  • 25. 可以看出,不能修改 schema 表的 majority 属性。 对 schema 表主动申请写锁,而不对需要更改 majority 属性的表申请锁 get_tid_ts_and_lock(Tab, Intent) -> TidTs = get(mnesia_activity_state), case TidTs of {_Mod, Tid, Ts} when is_record(Ts, tidstore)-> Store = Ts#tidstore.store, case Intent of read -> mnesia_locker:rlock_table(Tid, Store, Tab); write -> mnesia_locker:wlock_table(Tid, Store, Tab); none -> ignore end, TidTs; _ -> mnesia:abort(no_transaction) end. 上锁的过程:直接向锁管理器 mnesia_locker 请求表锁。 do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)). 关注实际的 majority 属性的修改动作: make_change_table_majority(Tab, Majority) -> ensure_writable(schema), Cs = incr_version(val({Tab, cstruct})), ensure_active(Cs), OldMajority = Cs#cstruct.majority, Cs2 = Cs#cstruct{majority = Majority}, FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of {_, Tab} -> FragNames = mnesia_frag:frag_names(Tab) -- [Tab], lists:map( fun(T) -> get_tid_ts_and_lock(Tab, none), CsT = incr_version(val({T, cstruct})), ensure_active(CsT), CsT2 = CsT#cstruct{majority = Majority}, verify_cstruct(CsT2), {op, change_table_majority, vsn_cs2list(CsT2), OldMajority, Majority}
  • 26. end, FragNames); false -> []; {_, _} -> mnesia:abort({bad_type, Tab}) end, verify_cstruct(Cs2), [{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps]. 通过 ensure_writable 检查 schema 表的 where_to_write 属性是否为[],即是否有持久化的 schema 节点。 通过 incr_version 更新表的版本号。 通过 ensure_active 检查所有表的副本节点是否存活, 即与副本节点进行表的全局视图确认。 修改表的元数据版本号: incr_version(Cs) -> {{Major, Minor}, _} = Cs#cstruct.version, Nodes = mnesia_lib:intersect(val({schema, disc_copies}), mnesia_lib:cs_to_nodes(Cs)), V= case Nodes -- val({Cs#cstruct.name, active_replicas}) of [] -> {Major + 1, 0}; % All replicas are active _ -> {Major, Minor + 1} % Some replicas are inactive end, Cs#cstruct{version = {V, {node(), now()}}}. mnesia_lib.erl cs_to_nodes(Cs) -> Cs#cstruct.disc_only_copies ++ Cs#cstruct.disc_copies ++ Cs#cstruct.ram_copies. 重新计算表的元数据版本号,由于这是一个 schema 表的变更,需要参考有持久 schema 的 节点以及持有该表副本的节点的信息而计算表的版本号,若这二类节点的交集全部存活,则 主版本可以增加,否则仅能增加副版本,同时为表的 cstruct 结构生成一个新的版本描述符, 这个版本描述符包括三个部分:{新的版本号,{发起变更的节点,发起变更的时间}},相当 于时空序列+单调递增序列。版本号的计算类似于 NDB。 检查表的全局视图:
  • 27. ensure_active(Cs) -> ensure_active(Cs, active_replicas). ensure_active(Cs, What) -> Tab = Cs#cstruct.name, W = {Tab, What}, ensure_non_empty(W), Nodes = mnesia_lib:intersect(val({schema, disc_copies}), mnesia_lib:cs_to_nodes(Cs)), case Nodes -- val(W) of [] -> ok; Ns -> Expl = "All replicas on diskfull nodes are not active yet", case val({Tab, local_content}) of true -> case rpc:multicall(Ns, ?MODULE, is_remote_member, [W]) of {Replies, []} -> check_active(Replies, Expl, Tab); {_Replies, BadNs} -> mnesia:abort({not_active, Expl, Tab, BadNs}) end; false -> mnesia:abort({not_active, Expl, Tab, Ns}) end end. is_remote_member(Key) -> IsActive = lists:member(node(), val(Key)), {IsActive, node()}. 为了防止不一致的状态,需要向这样的未明节点进行确认:该节点不是表的活动副本节点, 却是表的副本节点,也是持久 schema 节点。确认的内容是:通过 is_remote_member 询问 该节点,其是否已经是该表的活动副本节点。这样就避免了未明节点与请求节点对改变状态 的不一致认知。 make_change_table_majority(Tab, Majority) -> ensure_writable(schema), Cs = incr_version(val({Tab, cstruct})), ensure_active(Cs), OldMajority = Cs#cstruct.majority, Cs2 = Cs#cstruct{majority = Majority}, FragOps = case lists:keyfind(base_table, 1, Cs#cstruct.frag_properties) of {_, Tab} ->
  • 28. FragNames = mnesia_frag:frag_names(Tab) -- [Tab], lists:map( fun(T) -> get_tid_ts_and_lock(Tab, none), CsT = incr_version(val({T, cstruct})), ensure_active(CsT), CsT2 = CsT#cstruct{majority = Majority}, verify_cstruct(CsT2), {op, change_table_majority, vsn_cs2list(CsT2), OldMajority, Majority} end, FragNames); false -> []; {_, _} -> mnesia:abort({bad_type, Tab}) end, verify_cstruct(Cs2), [{op, change_table_majority, vsn_cs2list(Cs2), OldMajority, Majority} | FragOps]. 变更表的 cstruct 中对 majority 属性的记录,检查表的新建 cstruct,主要检查 cstruct 的各项 成员的类型,内容是否合乎要求。 vsn_cs2list 将 cstruct 转换为一个 proplist,key 为 record 成员名,value 为 record 成员值。 生成一个 change_table_majority 动作,供给 insert_schema_ops 使用。 do_change_table_majority(Tab, Majority) -> TidTs = get_tid_ts_and_lock(schema, write), get_tid_ts_and_lock(Tab, none), insert_schema_ops(TidTs, make_change_table_majority(Tab, Majority)). 此时 make_change_table_majority 生成的动作为[{op, change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}] insert_schema_ops({_Mod, _Tid, Ts}, SchemaIOps) -> do_insert_schema_ops(Ts#tidstore.store, SchemaIOps). do_insert_schema_ops(Store, [Head | Tail]) -> ?ets_insert(Store, Head), do_insert_schema_ops(Store, Tail); do_insert_schema_ops(_Store, []) -> ok. 可以看到, 插入过程仅仅将 make_change_table_majority 操作记入当前事务的临时 ets 表中。 这个临时插入动作完成后,mnesia 将开始执行提交过程,与普通表事务不同,由于操作是
  • 29. op 开头,表明这是一个 schema 事务,事务管理器需要额外的处理,使用不同的事务提交过 程。 3. schema 事务提交接口 mnesia_tm.erl t_commit(Type) -> {_Mod, Tid, Ts} = get(mnesia_activity_state), Store = Ts#tidstore.store, if Ts#tidstore.level == 1 -> intercept_friends(Tid, Ts), case arrange(Tid, Store, Type) of {N, Prep} when N > 0 -> multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store); {0, Prep} -> multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store) end; true -> %% nested commit Level = Ts#tidstore.level, [{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores, req({del_store, Tid, Store, Obsolete, false}), NewTs = Ts#tidstore{store = Store, up_stores = Tail, level = Level - 1}, NewTidTs = {OldMod, Tid, NewTs}, put(mnesia_activity_state, NewTidTs), do_commit_nested end. 首先在操作重排时进行检查: arrange(Tid, Store, Type) -> %% The local node is always included Nodes = get_elements(nodes,Store), Recs = prep_recs(Nodes, []), Key = ?ets_first(Store), N = 0, Prep = case Type of async -> #prep{protocol = sym_trans, records = Recs};
  • 30. sync -> #prep{protocol = sync_sym_trans, records = Recs} end, case catch do_arrange(Tid, Store, Key, Prep, N) of {'EXIT', Reason} -> dbg_out("do_arrange failed ~p ~p~n", [Reason, Tid]), case Reason of {aborted, R} -> mnesia:abort(R); _ -> mnesia:abort(Reason) end; {New, Prepared} -> {New, Prepared#prep{records = reverse(Prepared#prep.records)}} end. Key 参数即为插入临时 ets 表的第一个操作,此处将为 op。 do_arrange(Tid, Store, {Tab, Key}, Prep, N) -> Oid = {Tab, Key}, Items = ?ets_lookup(Store, Oid), %% Store is a bag P2 = prepare_items(Tid, Tab, Key, Items, Prep), do_arrange(Tid, Store, ?ets_next(Store, Oid), P2, N + 1); do_arrange(Tid, Store, SchemaKey, Prep, N) when SchemaKey == op -> Items = ?ets_lookup(Store, SchemaKey), %% Store is a bag P2 = prepare_schema_items(Tid, Items, Prep), do_arrange(Tid, Store, ?ets_next(Store, SchemaKey), P2, N + 1); 可以看出,普通表的 key 为{Tab, Key},而 schema 表的 key 为 op,取得的 Itens 为[{op, change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority}],这导致本次事 务使用不同的提交协议: prepare_schema_items(Tid, Items, Prep) -> Types = [{N, schema_ops} || N <- val({current, db_nodes})], Recs = prepare_nodes(Tid, Types, Items, Prep#prep.records, schema), Prep#prep{protocol = asym_trans, records = Recs}. prepare_node 在 Recs 的 schema_ops 成员中记录 schema 表的操作,同时将表的提交协议设 置为 asym_trans。 prepare_node(_Node, _Storage, Items, Rec, Kind) when Kind == schema, Rec#commit.schema_ops == [] -> Rec#commit{schema_ops = Items}; t_commit(Type) ->
  • 31. {_Mod, Tid, Ts} = get(mnesia_activity_state), Store = Ts#tidstore.store, if Ts#tidstore.level == 1 -> intercept_friends(Tid, Ts), case arrange(Tid, Store, Type) of {N, Prep} when N > 0 -> multi_commit(Prep#prep.protocol,majority_attr(Prep),Tid,Prep#prep.records,Store); {0, Prep} -> multi_commit(read_only, majority_attr(Prep), Tid, Prep#prep.records, Store) end; true -> %% nested commit Level = Ts#tidstore.level, [{OldMod,Obsolete} | Tail] = Ts#tidstore.up_stores, req({del_store, Tid, Store, Obsolete, false}), NewTs = Ts#tidstore{store = Store, up_stores = Tail, level = Level - 1}, NewTidTs = {OldMod, Tid, NewTs}, put(mnesia_activity_state, NewTidTs), do_commit_nested end. 提交过程使用 asym_trans,这个协议主要用于:schema 操作, majority 属性的表的操作, 有 recover_coordinator 过程,restore_op 操作。 4. schema 事务协议过程 multi_commit(asym_trans, Majority, Tid, CR, Store) -> D = #decision{tid = Tid, outcome = presume_abort}, {D2, CR2} = commit_decision(D, CR, [], []), DiscNs = D2#decision.disc_nodes, RamNs = D2#decision.ram_nodes, case have_majority(Majority, DiscNs ++ RamNs) of ok -> ok; {error, Tab} -> mnesia:abort({no_majority, Tab}) end, Pending = mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs), ?ets_insert(Store, Pending), {WaitFor, Local} = ask_commit(asym_trans, Tid, CR2, DiscNs, RamNs),
  • 32. SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})), {Votes, Pids} = rec_all(WaitFor, Tid, do_commit, []), ?eval_debug_fun({?MODULE, multi_commit_asym_got_votes}, [{tid, Tid}, {votes, Votes}]), case Votes of do_commit -> case SchemaPrep of {_Modified, C = #commit{}, DumperMode} -> mnesia_log:log(C), % C is not a binary ?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_rec}, [{tid, Tid}]), D3 = C#commit.decision, D4 = D3#decision{outcome = unclear}, mnesia_recover:log_decision(D4), ?eval_debug_fun({?MODULE, multi_commit_asym_log_commit_dec}, [{tid, Tid}]), tell_participants(Pids, {Tid, pre_commit}), rec_acc_pre_commit(Pids, Tid, Store, {C,Local}, do_commit, DumperMode, [], []); {'EXIT', Reason} -> mnesia_recover:note_decision(Tid, aborted), ?eval_debug_fun({?MODULE, multi_commit_asym_prepare_exit}, [{tid, Tid}]), tell_participants(Pids, {Tid, {do_abort, Reason}}), do_abort(Tid, Local), {do_abort, Reason} end; {do_abort, Reason} -> mnesia_recover:note_decision(Tid, aborted), ?eval_debug_fun({?MODULE, multi_commit_asym_do_abort}, [{tid, Tid}]), tell_participants(Pids, {Tid, {do_abort, Reason}}), do_abort(Tid, Local), {do_abort, Reason} end. 事务处理过程从 mnesia_tm:t_commit/1 开始,流程如下: 1. 发起节点检查 majority 条件是否满足,即表的存活副本节点数必须大于表的磁盘和内存 副本节点数的一半,等于一半时亦不满足 2. 发起节点操作发起节点调用 mnesia_checkpoint:tm_enter_pending,产生检查点 3. 发起节点向各个参与节点的事务管理器发起第一阶段提交过程 ask_commit,注意此时协 议类型为 asym_trans 4. 参与节点事务管理器创建一个 commit_participant 进程,该进程将进行负责接下来的提
  • 33. 交过程 注意 majority 表和 schema 表操作,需要额外创建一个进程辅助提交,可能导致性能变 低 5. 参 与 节 点 commit_participant 进 程 进 行 本 地 schema 操 作 的 prepare 过 程 , 对 于 change_table_majority,没有什么需要 prepare 的 6. 参与节点 commit_participant 进程同意提交,向发起节点返回 vote_yes 7. 发起节点收到所有参与节点的同意提交消息 8. 发起节点进行本地 schema 操作的 prepare 过程,对于 change_table_majority,同样没有 什么需要 prepare 的 9. 发起节点收到所有参与节点的 vote_yes 后,记录需要提交的操作的日志 10. 发起节点记录第一阶段恢复日志 presume_abort; 11. 发起节点记录第二阶段恢复日志 unclear 12. 发起节点向各个参与节点的 commit_participant 进程发起第二阶段提交过程 pre_commit 13. 参与节点 commit_participant 进程收到 pre_commit 进行预提交 14. 参与节点记录第一阶段恢复日志 presume_abort 15. 参与节点记录第二阶段恢复日志 unclear 16. 参与节点 commit_participant 进程同意预提交,向发起节点返回 acc_pre_commit 17. 发起节点收到所有参与节点的 acc_pre_commit 后,记录需要等待的 schema 操作参与节 点,用于崩溃恢复过程 18. 发起节点向各个参与节点的 commit_participant 进程发起第三阶段提交过程 committed 19. a.发起节点通知完参与节点进行 committed 后,立即记录第二阶段恢复日志 committed b.参与节点 commit_participant 进程收到 committed 后进行提交,立即记录第二阶段恢复
  • 34. 日志 committed 20. a.发起节点记录完第二阶段恢复日志后,进行本地提交,通过 do_commit 完成 b.参与节点 commit_participant 进程记录完第二阶段恢复日志后,进行本地提交,通过 do_commit 完成 21. a.发起节点本地提交完成后,若有 schema 操作,则同步等待参与节点 commit_participant 进程的 schema 操作的提交结果 b.参与节点 commit_participant 进程本地提交完成后,若有 schema 操作,则向发起节点 返回 schema_commit 22. a.发起节点本地收到所有参与节点的 schema_commit 后,释放锁和事务资源 b.参与节点 commit_participant 进程释放锁和事务资源 5. 远程节点事务管理器第一阶段提交 prepare 响应 参与节点事务管理器收到第一阶段提交的消息后: mnesia.erl doit_loop(#state{coordinators=Coordinators,participants=Participants,supervisor=Sup}=State) -> … {From, {ask_commit, Protocol, Tid, Commit, DiscNs, RamNs}} -> ?eval_debug_fun({?MODULE, doit_ask_commit}, [{tid, Tid}, {prot, Protocol}]), mnesia_checkpoint:tm_enter_pending(Tid, DiscNs, RamNs), Pid = case Protocol of asym_trans when node(Tid#tid.pid) /= node() -> Args = [tmpid(From), Tid, Commit, DiscNs, RamNs], spawn_link(?MODULE, commit_participant, Args); _ when node(Tid#tid.pid) /= node() -> %% *_sym_trans reply(From, {vote_yes, Tid}), nopid end, P = #participant{tid = Tid,
  • 35. pid = Pid, commit = Commit, disc_nodes = DiscNs, ram_nodes = RamNs, protocol = Protocol}, State2 = State#state{participants = gb_trees:insert(Tid,P,Participants)}, doit_loop(State2); … 创建一个 commit_participant 进程,参数包括[发起节点的进程号,事务 id,提交内容,磁盘 节点列表,内存节点列表],辅助事务提交: commit_participant(Coord, Tid, Bin, DiscNs, RamNs) when is_binary(Bin) -> process_flag(trap_exit, true), Commit = binary_to_term(Bin), commit_participant(Coord, Tid, Bin, Commit, DiscNs, RamNs); commit_participant(Coord, Tid, C = #commit{}, DiscNs, RamNs) -> process_flag(trap_exit, true), commit_participant(Coord, Tid, C, C, DiscNs, RamNs). commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) -> ?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]), case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of {Modified, C = #commit{}, DumperMode} ->、 case lists:member(node(), DiscNs) of false -> ignore; true -> case Modified of false -> mnesia_log:log(Bin); true -> mnesia_log:log(C) end end, ?eval_debug_fun({?MODULE, commit_participant, vote_yes}, [{tid, Tid}]), reply(Coord, {vote_yes, Tid, self()}), … 参与节点的 commit_participant 进程在创建初期,需要在本地进行 schema 表的 prepare 工作: mnesia_schema.erl prepare_commit(Tid, Commit, WaitFor) -> case Commit#commit.schema_ops of [] -> {false, Commit, optional};
  • 36. OrigOps -> {Modified, Ops, DumperMode} = prepare_ops(Tid, OrigOps, WaitFor, false, [], optional), InitBy = schema_prepare, GoodRes = {Modified, Commit#commit{schema_ops = lists:reverse(Ops)}, DumperMode}, case DumperMode of optional -> dbg_out("Transaction log dump skipped (~p): ~w~n", [DumperMode, InitBy]); mandatory -> case mnesia_controller:sync_dump_log(InitBy) of dumped -> GoodRes; {error, Reason} -> mnesia:abort(Reason) end end, case Ops of [] -> ignore; _ -> mnesia_controller:wait_for_schema_commit_lock() end, GoodRes end. 注意此处,包含三个主要分支: 1. 若操作中不包含任何 schema 操作,则不进行任何动作,仅返回{false, 原 Commit 内容, optional},这适用于 majority 类表的操作 2. 若操作通过 prepare_ops 判定后,如果包含这些操作:rec,announce_im_running, sync_trans , create_table , delete_table , add_table_copy , del_table_copy , change_table_copy_type,dump_table,add_snmp,transform,merge_schema,有可能 但不一定需要进行 prepare,prepare 动作包括各类操作自身的一些内容记录,以及 sync 日志,这适用于出现上述操作的时候 3. 若操作通过 prepare_ops 判定后,仅包含其它类型的操作,则不作任何动作,仅返回{true, 原 Commit 内容, optional},这适用于较小的 schema 操作,此处的 change_table_majority 就属于这类操作
  • 37. 6. 远程节点事务参与者第二阶段提交 precommit 响应 commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) -> … receive {Tid, pre_commit} -> D = C#commit.decision, mnesia_recover:log_decision(D#decision{outcome = unclear}), ?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]), Expect_schema_ack = C#commit.schema_ops /= [], reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}), receive {Tid, committed} -> mnesia_recover:log_decision(D#decision{outcome = committed}), ?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]), do_commit(Tid, C, DumperMode), case Expect_schema_ack of false -> ignore; true -> reply(Coord, {schema_commit, Tid, self()}) end, ?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]); … end; … 参与节点的 commit_participant 进程收到预提交消息后,同样记录第二阶段恢复日志 unclear, 并返回 acc_pre_commit 7. 请求节点事务发起者收到第二阶段提交 precommit 确认 发起节点收到所有参与节点的 acc_pre_commit 消息后: rec_acc_pre_commit([], Tid, Store, {Commit,OrigC}, Res, DumperMode, GoodPids, SchemaAckPids) -> D = Commit#commit.decision, case Res of do_commit ->
  • 38. prepare_sync_schema_commit(Store, SchemaAckPids), tell_participants(GoodPids, {Tid, committed}), D2 = D#decision{outcome = committed}, mnesia_recover:log_decision(D2), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_commit}, [{tid, Tid}]), do_commit(Tid, Commit, DumperMode), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_commit}, [{tid, Tid}]), sync_schema_commit(Tid, Store, SchemaAckPids), mnesia_locker:release_tid(Tid), ?MODULE ! {delete_transaction, Tid}; {do_abort, Reason} -> tell_participants(GoodPids, {Tid, {do_abort, Reason}}), D2 = D#decision{outcome = aborted}, mnesia_recover:log_decision(D2), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_log_abort}, [{tid, Tid}]), do_abort(Tid, OrigC), ?eval_debug_fun({?MODULE, rec_acc_pre_commit_done_abort}, [{tid, Tid}]) end, Res. prepare_sync_schema_commit(_Store, []) -> ok; prepare_sync_schema_commit(Store, [Pid | Pids]) -> ?ets_insert(Store, {waiting_for_commit_ack, node(Pid)}), prepare_sync_schema_commit(Store, Pids). 发起节点在本地记录参与 schema 操作的节点,用于崩溃恢复过程,然后向所有参与节点 commit_participant 进程发送 committed,通知其进行最终提交,此时发起节点可以进行本地 提交,记录第二阶段恢复日志 committed,本地提交通过 do_commit 完成,然后同步等待参 与节点的 schema 操作提交结果,若没有 schema 操作,则可以立即返回,此处需要等待: sync_schema_commit(_Tid, _Store, []) -> ok; sync_schema_commit(Tid, Store, [Pid | Tail]) -> receive {?MODULE, _, {schema_commit, Tid, Pid}} -> ?ets_match_delete(Store, {waiting_for_commit_ack, node(Pid)}), sync_schema_commit(Tid, Store, Tail); {mnesia_down, Node} when Node == node(Pid) -> ?ets_match_delete(Store, {waiting_for_commit_ack, Node}), sync_schema_commit(Tid, Store, Tail) end.
  • 39. 8. 远程节点事务参与者第三阶段提交 commit 响应 参与节点 commit_participant 进程收到提交消息后: commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) -> … receive {Tid, pre_commit} -> D = C#commit.decision, mnesia_recover:log_decision(D#decision{outcome = unclear}), ?eval_debug_fun({?MODULE, commit_participant, pre_commit}, [{tid, Tid}]), Expect_schema_ack = C#commit.schema_ops /= [], reply(Coord, {acc_pre_commit, Tid, self(), Expect_schema_ack}), receive {Tid, committed} -> mnesia_recover:log_decision(D#decision{outcome = committed}), ?eval_debug_fun({?MODULE, commit_participant, log_commit}, [{tid, Tid}]), do_commit(Tid, C, DumperMode), case Expect_schema_ack of false -> ignore; true -> reply(Coord, {schema_commit, Tid, self()}) end, ?eval_debug_fun({?MODULE, commit_participant, do_commit}, [{tid, Tid}]); … end; … 参与节点的 commit_participant 进程收到预提交消息后,同样记录第而阶段恢复日志 committed,通过 do_commit 进行本地提交后,若有 schema 操作,则向发起节点返回 schema_commit,否则完成事务。 9. 第三阶段提交 commit 的本地提交过程 do_commit(Tid, C, DumperMode) -> mnesia_dumper:update(Tid, C#commit.schema_ops, DumperMode), R = do_snmp(Tid, C#commit.snmp),
  • 40. R2 = do_update(Tid, ram_copies, C#commit.ram_copies, R), R3 = do_update(Tid, disc_copies, C#commit.disc_copies, R2), R4 = do_update(Tid, disc_only_copies, C#commit.disc_only_copies, R3), mnesia_subscr:report_activity(Tid), R4. 这里仅关注对于 schema 表的更新,同时需要注意,这些更新操作会同时发生在发起节点与 参与节点中。 对于 schema 表的更新包括: 1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为设置的值,同时更 新表的 where_to_wlock 属性 2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各 个属性 3. 在 schema 的 ets 表中,记录表的 cstruct 4. 在 schema 的 dets 表中,记录表的 cstruct 更新过程如下: mnesia_dumper.erl update(_Tid, [], _DumperMode) -> dumped; update(Tid, SchemaOps, DumperMode) -> UseDir = mnesia_monitor:use_dir(), Res = perform_update(Tid, SchemaOps, DumperMode, UseDir), mnesia_controller:release_schema_commit_lock(), Res. perform_update(_Tid, _SchemaOps, mandatory, true) -> InitBy = schema_update, ?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]), opt_dump_log(InitBy); perform_update(Tid, SchemaOps, _DumperMode, _UseDir) -> InitBy = fast_schema_update, InPlace = mnesia_monitor:get_env(dump_log_update_in_place), ?eval_debug_fun({?MODULE, dump_schema_op}, [InitBy]), case catch insert_ops(Tid, schema_ops, SchemaOps, InPlace, InitBy,
  • 41. mnesia_log:version()) of {'EXIT', Reason} -> Error = {error, {"Schema update error", Reason}}, close_files(InPlace, Error, InitBy), fatal("Schema update error ~p ~p", [Reason, SchemaOps]); _ -> ?eval_debug_fun({?MODULE, post_dump}, [InitBy]), close_files(InPlace, ok, InitBy), ok end. insert_ops(_Tid, _Storage, [], _InPlace, _InitBy, _) -> ok; insert_ops(Tid, Storage, [Op], InPlace, InitBy, Ver) when Ver >= "4.3"-> insert_op(Tid, Storage, Op, InPlace, InitBy), ok; insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver >= "4.3"-> insert_op(Tid, Storage, Op, InPlace, InitBy), insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver); insert_ops(Tid, Storage, [Op | Ops], InPlace, InitBy, Ver) when Ver < "4.3" -> insert_ops(Tid, Storage, Ops, InPlace, InitBy, Ver), insert_op(Tid, Storage, Op, InPlace, InitBy). … insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) -> Cs = mnesia_schema:list2cs(TabDef), case InitBy of startup -> ignore; _ -> mnesia_controller:change_table_majority(Cs) end, insert_cstruct(Tid, Cs, true, InPlace, InitBy); … 对于 change_table_majority 操作,其本身的格式为: {op, change_taboe_majority, 表的新 cstruct 组成的 proplist, OldMajority, Majority} 此处将 proplist 形态的 cstruct 转换为 record 形态的 cstruct,然进行真正的设置 mnesia_controller.erl change_table_majority(Cs) -> W = fun() -> Tab = Cs#cstruct.name, set({Tab, majority}, Cs#cstruct.majority), update_where_to_wlock(Tab)
  • 42. end, update(W). update_where_to_wlock(Tab) -> WNodes = val({Tab, where_to_write}), Majority = case catch val({Tab, majority}) of true -> true; _ -> false end, set({Tab, where_to_wlock}, {WNodes, Majority}). 该处做的更新主要为:在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 majority 属性为 设置的值,同时更新表的 where_to_wlock 属性,重设 majority 部分 mnesia_dumper.erl … insert_op(Tid, _, {op, change_table_majority,TabDef, _OldAccess, _Access}, InPlace, InitBy) -> Cs = mnesia_schema:list2cs(TabDef), case InitBy of startup -> ignore; _ -> mnesia_controller:change_table_majority(Cs) end, insert_cstruct(Tid, Cs, true, InPlace, InitBy); … insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) -> Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts), {schema, Tab, _} = Val, S = val({schema, storage_type}), disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy), Tab. 除了在 mnesia 全局变量 ets 表 mnesia_gvar 中更新表的 where_to_wlock 属性外,还要更新 其 cstruct 属性,及由此属性导出的其它属性,另外,还需要更新 schema 的 ets 表中记录的 表的 cstruct mnesia_schema.erl insert_cstruct(Tid, Cs, KeepWhereabouts) -> Tab = Cs#cstruct.name, TabDef = cs2list(Cs), Val = {schema, Tab, TabDef}, mnesia_checkpoint:tm_retain(Tid, schema, Tab, write), mnesia_subscr:report_table_event(schema, Tid, Val, write), Active = val({Tab, active_replicas}),
  • 43. case KeepWhereabouts of true -> ignore; false when Active == [] -> clear_whereabouts(Tab); false -> ignore end, set({Tab, cstruct}, Cs), ?ets_insert(schema, Val), do_set_schema(Tab, Cs), Val. do_set_schema(Tab) -> List = get_create_list(Tab), Cs = list2cs(List), do_set_schema(Tab, Cs). do_set_schema(Tab, Cs) -> Type = Cs#cstruct.type, set({Tab, setorbag}, Type), set({Tab, local_content}, Cs#cstruct.local_content), set({Tab, ram_copies}, Cs#cstruct.ram_copies), set({Tab, disc_copies}, Cs#cstruct.disc_copies), set({Tab, disc_only_copies}, Cs#cstruct.disc_only_copies), set({Tab, load_order}, Cs#cstruct.load_order), set({Tab, access_mode}, Cs#cstruct.access_mode), set({Tab, majority}, Cs#cstruct.majority), set({Tab, all_nodes}, mnesia_lib:cs_to_nodes(Cs)), set({Tab, snmp}, Cs#cstruct.snmp), set({Tab, user_properties}, Cs#cstruct.user_properties), [set({Tab, user_property, element(1, P)}, P) || P <- Cs#cstruct.user_properties], set({Tab, frag_properties}, Cs#cstruct.frag_properties), mnesia_frag:set_frag_hash(Tab, Cs#cstruct.frag_properties), set({Tab, storage_properties}, Cs#cstruct.storage_properties), set({Tab, attributes}, Cs#cstruct.attributes), Arity = length(Cs#cstruct.attributes) + 1, set({Tab, arity}, Arity), RecName = Cs#cstruct.record_name, set({Tab, record_name}, RecName), set({Tab, record_validation}, {RecName, Arity, Type}), set({Tab, wild_pattern}, wild(RecName, Arity)), set({Tab, index}, Cs#cstruct.index), %% create actual index tabs later set({Tab, cookie}, Cs#cstruct.cookie), set({Tab, version}, Cs#cstruct.version), set({Tab, cstruct}, Cs), Storage = mnesia_lib:schema_cs_to_storage_type(node(), Cs), set({Tab, storage_type}, Storage),
  • 44. mnesia_lib:add({schema, tables}, Tab), Ns = mnesia_lib:cs_to_nodes(Cs), case lists:member(node(), Ns) of true -> mnesia_lib:add({schema, local_tables}, Tab); false when Tab == schema -> mnesia_lib:add({schema, local_tables}, Tab); false -> ignore end. do_set_schema 更新由 cstruct 导出的各项属性,如版本,cookie 等 mnesia_dumper.erl insert_cstruct(Tid, Cs, KeepWhereabouts, InPlace, InitBy) -> Val = mnesia_schema:insert_cstruct(Tid, Cs, KeepWhereabouts), {schema, Tab, _} = Val, S = val({schema, storage_type}), disc_insert(Tid, S, schema, Tab, Val, write, InPlace, InitBy), Tab. disc_insert(_Tid, Storage, Tab, Key, Val, Op, InPlace, InitBy) -> case open_files(Tab, Storage, InPlace, InitBy) of true -> case Storage of disc_copies when Tab /= schema -> mnesia_log:append({?MODULE,Tab}, {{Tab, Key}, Val, Op}), ok; _ -> dets_insert(Op,Tab,Key,Val) end; false -> ignore end. dets_insert(Op,Tab,Key,Val) -> case Op of write -> dets_updated(Tab,Key), ok = dets:insert(Tab, Val); … end. dets_updated(Tab,Key) -> case get(mnesia_dumper_dets) of undefined -> Empty = gb_trees:empty(),
  • 45. Tree = gb_trees:insert(Tab, gb_sets:singleton(Key), Empty), put(mnesia_dumper_dets, Tree); Tree -> case gb_trees:lookup(Tab,Tree) of {value, cleared} -> ignore; {value, Set} -> T = gb_trees:update(Tab, gb_sets:add(Key, Set), Tree), put(mnesia_dumper_dets, T); none -> T = gb_trees:insert(Tab, gb_sets:singleton(Key), Tree), put(mnesia_dumper_dets, T) end end. 更新 schema 的 dets 表中记录的表 cstruct。 综上所述,对于 schema 表的变更,或者 majority 类的表,其事务提交过程为三阶段,同时 有良好的崩溃恢复检测 schema 表的变更包括对多处地方的更新,包括: 1. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 xxx 属性为设置的值 2. 在 mnesia 全局变量 ets 表 mnesia_gvar 中,记录表的 cstruct,并记录由 cstruct 导出的各 个属性 3. 在 schema 的 ets 表中,记录表的 cstruct 4. 在 schema 的 dets 表中,记录表的 cstruct 5. majority 事务处理 majority 事务总体与 schema 事务处理过程相同,只是在 mnesia_tm:multi_commit 的提交过 程中,不调用 mnesia_schema:prepare_commit/3、mnesia_tm:prepare_sync_schema_commit/2 修改 schema 表,也不调用 mnesia_tm:sync_schema_commit 等待第三阶段同步提交完成。
  • 46. 6. 恢复 mnesia 的连接协商过程用于在启动时,结点间交互状态信息: 整个协商包括如下过程: 1. 节点发现,集群遍历 2. 节点协议版本检查 3. 节点 schema 合并 4. 节点 decision 通告与合并 5. 节点数据重新载入与合并 1. 节点协议版本检查+节点 decision 通告与合并 mnesia_recover.erl connect_nodes(Ns) -> %%Ns 为要检查的节点 call({connect_nodes, Ns}). handle_call({connect_nodes, Ns}, From, State) -> %% Determine which nodes we should try to connect AlreadyConnected = val(recover_nodes), {_, Nodes} = mnesia_lib:search_delete(node(), Ns), Check = Nodes -- AlreadyConnected, %%开始版本协商 case mnesia_monitor:negotiate_protocol(Check) of busy -> %% monitor is disconnecting some nodes retry %% the req (to avoid deadlock). erlang:send_after(2, self(), {connect_nodes,Ns,From}), {noreply, State}; [] -> %% No good noodes to connect to! %% We can't use reply here because this function can be
  • 47. %% called from handle_info gen_server:reply(From, {[], AlreadyConnected}), {noreply, State}; GoodNodes -> %% GoodNodes 是协商通过的节点 %% Now we have agreed upon a protocol with some new nodes %% and we may use them when we recover transactions mnesia_lib:add_list(recover_nodes, GoodNodes), %%协议版本协商通过后,告知这些节点本节点曾经的历史事务 decision cast({announce_all, GoodNodes}), case get_master_nodes(schema) of [] -> Context = starting_partitioned_network, %%检查曾经是否与这些节点出现过分区 mnesia_monitor:detect_inconcistency(GoodNodes, Context); _ -> %% If master_nodes is set ignore old inconsistencies ignore end, gen_server:reply(From, {GoodNodes, AlreadyConnected}), {noreply,State} end; handle_cast({announce_all, Nodes}, State) -> announce_all(Nodes), {noreply, State}; announce_all([]) -> ok; announce_all(ToNodes) -> Tid = trans_tid_serial(), announce(ToNodes, [{trans_tid,serial,Tid}], [], false). announce(ToNodes, [Head | Tail], Acc, ForceSend) -> Acc2 = arrange(ToNodes, Head, Acc, ForceSend), announce(ToNodes, Tail, Acc2, ForceSend); announce(_ToNodes, [], Acc, _ForceSend) -> send_decisions(Acc). send_decisions([{Node, Decisions} | Tail]) -> %%注意此处,decision 合并过程是一个异步过程 abcast([Node], {decisions, node(), Decisions}), send_decisions(Tail); send_decisions([]) ->
  • 48. ok. 遍历所有协商通过的节点,告知其本节点的历史事务 decision 下列流程位于远程节点中,远程节点将被称为接收节点,而本节点将称为发送节点 handle_cast({decisions, Node, Decisions}, State) -> mnesia_lib:add(recover_nodes, Node), State2 = add_remote_decisions(Node, Decisions, State), {noreply, State2}; 接收节点的 mnesia_monitor 在收到这些广播来的 decision 后,进行比较合并。 decision 有多种类型,用于事务提交的为 decision 结构和 transient_decision 结构 add_remote_decisions(Node, [D | Tail], State) when is_record(D, decision) -> State2 = add_remote_decision(Node, D, State), add_remote_decisions(Node, Tail, State2); add_remote_decisions(Node, [C | Tail], State) when is_record(C, transient_decision) -> D = #decision{tid = C#transient_decision.tid, outcome = C#transient_decision.outcome, disc_nodes = [], ram_nodes = []}, State2 = add_remote_decision(Node, D, State), add_remote_decisions(Node, Tail, State2); add_remote_decisions(Node, [{mnesia_down, _, _, _} | Tail], State) -> add_remote_decisions(Node, Tail, State); add_remote_decisions(Node, [{trans_tid, serial, Serial} | Tail], State) -> %%对于发送节点传来的未决事务,接收节点需要继续询问其它节点 sync_trans_tid_serial(Serial), case State#state.unclear_decision of undefined -> ignored; D -> case lists:member(Node, D#decision.ram_nodes) of true -> ignore; false -> %%若未决事务 decision 的发送节点不是内存副本节点,则接收节点将向其询 问该未决事务的真正结果 abcast([Node], {what_decision, node(), D}) end
  • 49. end, add_remote_decisions(Node, Tail, State); add_remote_decisions(_Node, [], State) -> State. add_remote_decision(Node, NewD, State) -> Tid = NewD#decision.tid, OldD = decision(Tid), %%根据合并策略进行 decision 合并,对于唯一的冲突情况,即接收节点提交事务,而 发送节点中止事务,则接收节点处也选择中止事务,而事务本身的状态将由检查点和 redo 日志进行重构 D = merge_decisions(Node, OldD, NewD), %%记录合并结果 do_log_decision(D, false, undefined), Outcome = D#decision.outcome, if OldD == no_decision -> ignore; Outcome == unclear -> ignore; true -> case lists:member(node(), NewD#decision.disc_nodes) or lists:member(node(), NewD#decision.ram_nodes) of true -> %%向其它节点告知本节点的 decision 合并结果 tell_im_certain([Node], D); false -> ignore end end, case State#state.unclear_decision of U when U#decision.tid == Tid -> WaitFor = State#state.unclear_waitfor -- [Node], if Outcome == unclear, WaitFor == [] -> %% Everybody are uncertain, lets abort %%询问过未决事务的所有参与节点后,仍然没有任何节点可以提供事务提交 结果,此时决定终止事务 NewOutcome = aborted, CertainD = D#decision{outcome = NewOutcome,
  • 50. disc_nodes = [], ram_nodes = []}, tell_im_certain(D#decision.disc_nodes, CertainD), tell_im_certain(D#decision.ram_nodes, CertainD), do_log_decision(CertainD, false, undefined), verbose("Decided to abort transaction ~p " "since everybody are uncertain ~p~n", [Tid, CertainD]), gen_server:reply(State#state.unclear_pid, {ok, NewOutcome}), State#state{unclear_pid = undefined, unclear_decision = undefined, unclear_waitfor = undefined}; Outcome /= unclear -> %%发送节点知道事务结果,通告事务结果 verbose("~p told us that transaction ~p was ~p~n", [Node, Tid, Outcome]), gen_server:reply(State#state.unclear_pid, {ok, Outcome}), State#state{unclear_pid = undefined, unclear_decision = undefined, unclear_waitfor = undefined}; Outcome == unclear -> %%发送节点也不知道事务结果,此时继续等待 State#state{unclear_waitfor = WaitFor} end; _ -> State end. 合并策略: merge_decisions(Node, D, NewD0) -> NewD = filter_aborted(NewD0), if D == no_decision, node() /= Node -> %% We did not know anything about this txn NewD#decision{disc_nodes = []}; D == no_decision -> NewD; is_record(D, decision) -> DiscNs = D#decision.disc_nodes -- ([node(), Node]), OldD = filter_aborted(D#decision{disc_nodes = DiscNs}), if
  • 51. OldD#decision.outcome == unclear, NewD#decision.outcome == unclear -> D; OldD#decision.outcome == NewD#decision.outcome -> %% We have come to the same decision OldD; OldD#decision.outcome == committed, NewD#decision.outcome == aborted -> %%decision 发送节点与接收节点唯一冲突的位置,即接收节点提交事务,而发 送节点中止事务,此时仍然选择中止事务 Msg = {inconsistent_database, bad_decision, Node}, mnesia_lib:report_system_event(Msg), OldD#decision{outcome = aborted}; OldD#decision.outcome == aborted -> OldD#decision{outcome = aborted}; NewD#decision.outcome == aborted -> OldD#decision{outcome = aborted}; OldD#decision.outcome == committed, NewD#decision.outcome == unclear -> OldD#decision{outcome = committed}; OldD#decision.outcome == unclear, NewD#decision.outcome == committed -> OldD#decision{outcome = committed} end end. 2. 节点发现,集群遍历 mnesia_controller.erl merge_schema() -> AllNodes = mnesia_lib:all_nodes(), %%尝试合并 schema,合并完了后通知所有曾经的集群节点,与本节点进行数据转移 case try_merge_schema(AllNodes, [node()], fun default_merge/1) of ok -> %%合并 schema 成功后,将进行数据合并 schema_is_merged(); {aborted, {throw, Str}} when is_list(Str) -> fatal("Failed to merge schema: ~s~n", [Str]); Else -> fatal("Failed to merge schema: ~p~n", [Else]) end.
  • 52. try_merge_schema(Nodes, Told0, UserFun) -> %%开始集群遍历,启动一个 schema 合并事务 case mnesia_schema:merge_schema(UserFun) of {atomic, not_merged} -> %% No more nodes that we need to merge the schema with %% Ensure we have told everybody that we are running case val({current,db_nodes}) -- mnesia_lib:uniq(Told0) of [] -> ok; Tell -> im_running(Tell, [node()]), ok end; {atomic, {merged, OldFriends, NewFriends}} -> %% Check if new nodes has been added to the schema Diff = mnesia_lib:all_nodes() -- [node() | Nodes], mnesia_recover:connect_nodes(Diff), %% Tell everybody to adopt orphan tables %%通知所有的集群节点,本节点启动,开始数据合并申请 im_running(OldFriends, NewFriends), im_running(NewFriends, OldFriends), Told = case lists:member(node(), NewFriends) of true -> Told0 ++ OldFriends; false -> Told0 ++ NewFriends end, try_merge_schema(Nodes, Told, UserFun); {atomic, {"Cannot get cstructs", Node, Reason}} -> dbg_out("Cannot get cstructs, Node ~p ~p~n", [Node, Reason]), timer:sleep(300), % Avoid a endless loop look alike try_merge_schema(Nodes, Told0, UserFun); {aborted, {shutdown, _}} -> %% One of the nodes is going down timer:sleep(300), % Avoid a endless loop look alike try_merge_schema(Nodes, Told0, UserFun); Other -> Other end. mnesia_schema.erl merge_schema() -> schema_transaction(fun() -> do_merge_schema([]) end). merge_schema(UserFun) -> schema_transaction(fun() -> UserFun(fun(Arg) -> do_merge_schema(Arg) end) end). 可以看出 merge_schema 的过程也是放在一个 mnesia 元数据事务中进行的,这个事务的主
  • 53. 题操作包括: {op, announce_im_running, node(), SchemaDef, Running, RemoteRunning} {op, merge_schema, CstructList} 这个过程会与集群中的事务节点进行 schema 协商,检查 schema 是否兼容。 do_merge_schema(LockTabs0) -> %% 锁 schema 表 {_Mod, Tid, Ts} = get_tid_ts_and_lock(schema, write), LockTabs = [{T, tab_to_nodes(T)} || T <- LockTabs0], [get_tid_ts_and_lock(T,write) || {T,_} <- LockTabs], Connected = val(recover_nodes), Running = val({current, db_nodes}), Store = Ts#tidstore.store, %% Verify that all nodes are locked that might not be the %% case, if this trans where queued when new nodes where added. case Running -- ets:lookup_element(Store, nodes, 2) of [] -> ok; %% All known nodes are locked Miss -> %% Abort! We don't want the sideeffects below to be executed mnesia:abort({bad_commit, {missing_lock, Miss}}) end, %% Connected 是本节点的已连接节点,通常为当前集群中通信协议兼容的结点; Running 是本节点的当前 db_nodes,通常为当前集群中与本节点一致的结点; case Connected -- Running of %% 对于那些已连接,但是还未进行 decision 的节点,需要进行通信协议协商,然后进 行 decision 协商,这个过程实质上是一个全局拓扑下的节点发现过程(遍历算法) ,这个过 程由某个节点发起, [Node | _] = OtherNodes -> %% Time for a schema merging party! mnesia_locker:wlock_no_exist(Tid, Store, schema, [Node]), [mnesia_locker:wlock_no_exist( Tid, Store, T, mnesia_lib:intersect(Ns, OtherNodes)) || {T,Ns} <- LockTabs], %% 从远程结点 Node 处取得其拥有的表的 cstruct,及其 db_nodes RemoteRunning1 case fetch_cstructs(Node) of {cstructs, Cstructs, RemoteRunning1} ->
  • 54. LockedAlready = Running ++ [Node], %% 取得 cstruct 后,通过 mnesia_recover:connect_nodes,与远程节点 Node 的集群中的每一个节点进行协商,协商主要包括检查双方的通信协议版本,并检查之前与这 些结点是否曾有过分区 {New, Old} = mnesia_recover:connect_nodes(RemoteRunning1), %% New 为 RemoteRunning1 中版本兼容的新结点, 为本节点原先的集群存 Old 活结点,来自于 recover_nodes RemoteRunning = mnesia_lib:intersect(New ++ Old, RemoteRunning1), If %% RemoteRunning = (New∪Old)∩RemoteRunning1 %% RemoteRunning≠RemoteRunning <=> %% New∪(Old∩RemoteRunning1) < RemoteRunning1 %%意味着 RemoteRunning1(远程节点 Node 的集群,也即此次探查的目标集 群)中有部分节点不能与本节点相连 RemoteRunning /= RemoteRunning1 -> mnesia_lib:error("Mnesia on ~p could not connect to node(s) ~p~n", [node(), RemoteRunning1 -- RemoteRunning]), mnesia:abort({node_not_running, RemoteRunning1 -- RemoteRunning}); true -> ok end, NeedsLock = RemoteRunning -- LockedAlready, mnesia_locker:wlock_no_exist(Tid, Store, schema, NeedsLock), [mnesia_locker:wlock_no_exist(Tid, Store, T, mnesia_lib:intersect(Ns,NeedsLock)) || {T,Ns} <- LockTabs], NeedsConversion = need_old_cstructs(NeedsLock ++ LockedAlready), {value, SchemaCs} = lists:keysearch(schema, #cstruct.name, Cstructs), SchemaDef = cs2list(NeedsConversion, SchemaCs), %% Announce that Node is running %%开始 announce_im_running 的过程,向集群的事务事务通告本节点进入集 群,同时告知本节点,集群事务节点在这个事务中会与本节点进行 schema 合并 A = [{op, announce_im_running, node(), SchemaDef, Running, RemoteRunning}],
  • 55. do_insert_schema_ops(Store, A), %% Introduce remote tables to local node %%make_merge_schema 构造一系列合并 schema 的 merge_schema 操作,在提 交成功后由 mnesia_dumper 执行生效 do_insert_schema_ops(Store, make_merge_schema(Node, NeedsConversion, Cstructs)), %% Introduce local tables to remote nodes Tabs = val({schema, tables}), Ops = [{op, merge_schema, get_create_list(T)} || T <- Tabs, not lists:keymember(T, #cstruct.name, Cstructs)], do_insert_schema_ops(Store, Ops), %%Ensure that the txn will be committed on all nodes %%向另一个可连接集群中的所有节点通告本节点正在加入集群 NewNodes = RemoteRunning -- Running, mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}), announce_im_running(NewNodes, SchemaCs), {merged, Running, RemoteRunning}; {error, Reason} -> {"Cannot get cstructs", Node, Reason}; {badrpc, Reason} -> {"Cannot get cstructs", Node, {badrpc, Reason}} end; [] -> %% No more nodes to merge schema with not_merged end. announce_im_running([N | Ns], SchemaCs) -> %%与新的可连接集群的节点经过协商 {L1, L2} = mnesia_recover:connect_nodes([N]), case lists:member(N, L1) or lists:member(N, L2) of true -> %%若协商通过,则这些节点就可以作为本节点的事务节点了,注意此处,这个修改是 立即生效的,而不会延迟到事务提交 mnesia_lib:add({current, db_nodes}, N), mnesia_controller:add_active_replica(schema, N, SchemaCs);
  • 56. false -> %%若协商未通过,则中止事务,此时会通过 announce_im_running 的 undo 动作,将新 加入的事务节点全部剥离 mnesia_lib:error("Mnesia on ~p could not connect to node ~p~n", [node(), N]), mnesia:abort({node_not_running, N}) end, announce_im_running(Ns, SchemaCs); announce_im_running([], _) -> []. schema 操作在三阶段提交时,mnesia_tm 首先要进行 prepare: mnesia_tm.erl multi_commit(asym_trans, Majority, Tid, CR, Store) -> … SchemaPrep = (catch mnesia_schema:prepare_commit(Tid, Local, {coord, WaitFor})), … mnesia_schema.erl prepare_commit(Tid, Commit, WaitFor) -> case Commit#commit.schema_ops of [] -> {false, Commit, optional}; OrigOps -> {Modified, Ops, DumperMode} = prepare_ops(Tid, OrigOps, WaitFor, false, [], optional), … end. prepare_ops(Tid, [Op | Ops], WaitFor, Changed, Acc, DumperMode) -> case prepare_op(Tid, Op, WaitFor) of … {false, optional} -> prepare_ops(Tid, Ops, WaitFor, true, Acc, DumperMode) end; prepare_ops(_Tid, [], _WaitFor, Changed, Acc, DumperMode) -> {Changed, Acc, DumperMode}. prepare_op(_Tid, {op, announce_im_running, Node, SchemaDef, Running, RemoteRunning}, _WaitFor) -> SchemaCs = list2cs(SchemaDef), if Node == node() -> %% Announce has already run on local node
  • 57. ignore; %% from do_merge_schema true -> %% If a node has restarted it may still linger in db_nodes, %% but have been removed from recover_nodes Current = mnesia_lib:intersect(val({current,db_nodes}), [node()|val(recover_nodes)]), NewNodes = mnesia_lib:uniq(Running++RemoteRunning) -- Current, mnesia_lib:set(prepare_op, {announce_im_running,NewNodes}), announce_im_running(NewNodes, SchemaCs) end, {false, optional}; 此处可以看出,在 announce_im_running 的 prepare 过程中,要与远程未连接的节点进行协 商,协商通过后,这些未连接节点将加入本节点的事务节点集群 反之,一旦该 schema 操作中止,mnesia_tm 将进行 undo 动作: mnesia_tm.erl commit_participant(Coord, Tid, Bin, C0, DiscNs, _RamNs) -> ?eval_debug_fun({?MODULE, commit_participant, pre}, [{tid, Tid}]), case catch mnesia_schema:prepare_commit(Tid, C0, {part, Coord}) of {Modified, C = #commit{}, DumperMode} -> %% If we can not find any local unclear decision %% we should presume abort at startup recovery case lists:member(node(), DiscNs) of false -> ignore; true -> case Modified of false -> mnesia_log:log(Bin); true -> mnesia_log:log(C) end end, ?eval_debug_fun({?MODULE, commit_participant, vote_yes}, [{tid, Tid}]), reply(Coord, {vote_yes, Tid, self()}), receive {Tid, pre_commit} -> … receive {Tid, committed} -> … {Tid, {do_abort, _Reason}} ->