9. enc, decの学習
• skill-encoder 𝑞""
𝑧 𝜏 and skill-decoder 𝑝"!
𝑎*, … , 𝑎*+,-. 𝑠*, 𝑧
where 𝑧 ∈ 𝒵 is skill, 𝑠 ∈ 𝒮 is state, 𝑎 ∈ 𝒜 is action, 𝜏 is trajectory(state-
action) sequences.
• Update 𝑝"!
and 𝑞""
by maximizing loss function(ELBO of 𝛽-VAE with Gaussian prior with 𝑃!)
𝔼'∼𝒟,/∼0#(/|') 𝑃! 𝜏 ℒ456789:4;6:<78 + 𝛽 ⋅ ℒ45=;>?4<@?:<78
10. "
ℛ"の学習
• Update 𝜂 by minimizing loss function(binary cross-entropy):
where is distribution of Bradley-Terry model.
• [Note] 演算⼦ A ≻ B は,AがBよりも優先されることの意味.