MIRU2018 tutorial

オート
エンコーダ(AE)
敵対的生成モ
デル(GAN)
畳み込み
ニューラルネッ
トワーク(CNN)
トワーク(NN)
リカレント
トワーク
人工知能モデル
パーセプトロン
生成モデル
多層化畳み込み層導入
時系列へ対応
ロボティクス
コンピュータ
ビジョン
自然言語
処理
音声認識
パターン
認識
最適化手法
確率的勾配降
下法
ベルマン方
程式
ベイズ最適化
深層強化学習
Deep Q
Network
逆強化
学習
ガウス 
過程

Team year Error (top-5)
AlexNet 2012 15.3%
Clarifai 2013 11.2%
VGG 2014 7.32%
GoogLeNet 2014 6.67%
ResNet 2015 3.57%
ResNet+ 2016 2.99%
SENet 2017 2.25%
human expert 5.1%

• 2012年のILSVRC優勝モデル
• Rectified Linear Units (ReLU)
• Local Response Normalization (LRN)
• Dropout（全結合層）
• Pre-training（事前学習）
A. Krizhevsky, "Imagenet classification with deep convolutional neural networks”, NIPS, 2012.

• 2014年のILSVRC 2位モデル
• シンプルなモデルの設計
• 畳み込み層のフィルタサイズを3x3に統⼀
• プーリング後の畳み込み層のフィルタ数を２倍
• 段階的な学習（畳み込み層を追加させてfine-tuningする）
K. Simonyan, "Very Deep Convolutional Networks for Large-Scale Image Recognition", ICPR, 2015.

• 異なるサイズのフィルタを利⽤するInceptionモジュール
• Global Average Pooling (GAP)
• Auxiliary loss
3x3convolutions5x5convolutions
concatenation
Previouslayer
3x3maxpooling1x1convolutions1x1convolutions
1x1convolutions1x1convolutions
Convolution
Pooling
Softmax
Other
C. Szegedy, “Going Deeper with Convolutions”, CVPR2015.

3x3
convolutions
5x5
convolutions
concatenation
Previous
layer
3x3max
pooling
1x1
convolutions
1x1
convolutions
1x1
convolutions
1x1
convolutions
C. Szegedy, “Rethinking the Inception Architecture for Computer Vision”, CVPR2016.
3x3
convolutions
3x3
convolutions
concatenation
Previous
layer
3x3max
pooling
1x1
convolutions
1x1
convolutions
1x1
convolutions
1x1
convolutions
3x3
convolutions

• スキップ接続を導⼊したResidualモジュール
• Batch normalization
• ReLUに適したパラメータの初期化
Convolution(3x3)
Batch Norm.
ReLU
Convolution(3x3)
Batch Norm.
ReLU
Convolution(3x3)
Batch Norm.
ReLU
Convolution(3x3)
Batch Norm.
ReLU
x
H(x)
x
F(x) +
H(x)=F(x)+x
x H(x)
H(x) x
K. He, “Deep Residual Learning for Image Recognition”, CVPR2015.

K. He, “Identity mappings in deep residual networks”, ECCV2016.
• Residual モジュール構造の⽐較

S. Zagoruyko, " Wide Residual Networks”, BMVC2016
• フィルタ数の増やし⽅
3x3, 64
3x3, 64
3x3, 128
3x3, 128
3x3, 256
3x3, 256
ResNet
2
ResNet
3x3, 64
3x3, 64
3x3, 128
3x3, 128
3x3, 256
3x3, 256
WideResNet
3x3, 64
3x3, 64
3x3, 96
3x3, 96
3x3, 128
3x3, 128
PyramidNet
D. Han, " Deep Pyramidal Residual Networks”, CVPR2017

• チャネル毎の特徴マップを適応的に重み付けするSE ブロック
• Squeeze step：グローバルな情報を取得
• Excitation step：チャネル間の依存関係を抽出
• 様々なネットワークに導⼊可能
GlobalAverage
Pooling
Convolution(1x1)
ReLU
Convolution(1x1)
Sigmoid
Scaling
Squeeze Excitation
J. Hu, “Squeeze-and-Excitation Networks”, CVPR2018.

h^ps://medium.com/towards-data-science/neural-network-architectures-156e5bad51ba
• ResNet50，Inception-V3が精度と計算量のバランスが良い

CIFAR10/100での精度
アーキテクチャ名
エラー率[%]
学習時間[h]CIFAR10 CIFAR100
VGG 6.0 27.4 2.9
GoogLeNet 9.3 31.7 4.6
SqueezeNet 10.8(18.0) 36.0 1.8
ResNet101 5.3(6.5) 25.9 9.2
WideResNet 4.8(4.0) 20.7 19.6
PyramidNet 4.3(3.8) 22.0 15.6
ResNeXt 4.5(3.6) 20.7 27
FractalNet 5(4.6) 26.0 5.1
DenseNet 5.1(3.8) 24.6 19.6
SENet 4.8 24.2 36.7
• WideResNet・PyramidNetは精度が良い
• 学習時間はResNetの1.5-2倍程度

1990s 2012 2013 2014 2015 2016 2017

• 物体領域候補検出：Selective Search
• 最終判定：RoI プーリング + FC層
CNN(全結合層)
CNN(全結合層)
CNN(全結合層)
畳み込み層
Selective search RoIプーリング
conv.
conv.
pooling
conv.
conv.
pooling
conv.
conv.
pooling
conv.
conv.
pooling
conv.
conv.
FC
FC
出力層
(識別)
出力層
(回帰)
pooling
VGGNet
(RoI )
RoI
R. Girshick, “Fast R-CNN”, ICCV2015.

• 物体領域候補検出：Region Proposal Network
• 最終判定：RoI プーリング + FC層（Fast R-CNNと同じ）
(1:1) (1:2) (2:1)
conv.(1x1)
( )
( )conv.(3x3)
Region Proposal Network (RPN)
RoI
VGGNet
畳み込み層
conv.
conv.
pooling
conv.
conv.
pooling
conv.
conv.
pooling
conv.
conv.
pooling
conv.
conv.
FC
FC
出力層
(識別)
出力層
(回帰)
pooling
S. Ren, “Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal
Networks”, NIPS2015.

• 物体領域候補検出：Region Proposal Network
• 最終判定：RoI プーリング + position sensitive map(Conv.層）
conv.(1x1)
( )
( )conv.(3x3)
Region Proposal Network (RPN)
RoI
畳み込み層
J. Dai, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS2016.
畳み込み層
position sensitive map

畳み込み層
P x y w h
• 画像をNxNグリッドに分割
• 各グリッドの物体らしさ，矩形，物体クラスの確率を出⼒
J. Redmon, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR2016.

畳み込み層
38x38
512
19x19
1024
10x10
512
5x5
256
3x3
256
1x1
256
Conv.(1x1)
( )
Final detections
4 (5x5 )
W. Liu, “SSD: Single Shot MultiBox Detector.”, ECCV2016.
• 複数の特徴マップを利⽤してマルチスケールの物体検出

J. Huang, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR2017

J. Dai, et. al., “Deformable Convolutional Networks”, ICCV2017.
• 畳み込む位置やプーリング位置をバラバラにする
• 各位置のオフセットを学習により獲得

Conv.+pooling
Conv.+pooling
Conv.+pooling
Conv.
+pooling
Conv.
x32
x2
x4x2
x16
x8
(32
(16
(8
3
2 (2 )
1
1
(2 )
2
(4 )
1
J. Long, “Fully Convolutional Networks for Semantic Segmentation”, CVPR2015.
• 解像度の⾼い特徴マップを利⽤して精度を向上

99 80 45 78
93 72 89 99
32 45 24 64
77 44 32 58
a 0 0 0
0 0 0 b
0 0 0 d
c 0 0 0
99 99
77 64
0 3
2 1
( )
( )
a b
c d
)
( )
V. Badrinarayanan, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation”, PAMI2017.
• エンコーダ・デコーダ構成
• エンコーダ時のプーリング位置を利⽤

&
O. Ronneberger, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI2015.
• エンコーダ側の詳細な特徴を利⽤

ResNet
pooling
pooling
pooling
pooling
conv.
conv.
conv.
conv.
conv.
Pyramid Pooling Module
• エンコーダ・デコーダ構造
• 異なる解像度で特徴を捉えるPyramid Pooling Moduleを導⼊
H. Zhao, “Pyramid Scene Parsing Network”, CVPR2017.

L. Chen, “Rethinking Atrous Convosution for Semantic Image Segmentation”, 2017.
• Atrous Convolutionにより特徴マップサイズを⼩さくしない
• 特徴マップサイズが⼤きいのため詳細なセグメンテーションが可能
⼀般的なネットワーク
Atrous Convolution のネットワーク
Atrous Spatial pyramid pooling
Atrous Convolution(3x3)

L. Chen, “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”, ECCV2018.
• エンコーダ・デコーダ構造を導⼊
• Atrous ConvolutionにDepthwiseの畳み込みを利⽤
Atrous Spatial pyramid pooling

conv.(9x9)
pooling
conv.(9x9)
pooling
conv.(9x9)
pooling
conv.(5x5)
conv.(9x9)
conv.(1x1)
conv.(1x1)
conv.(11x11)
conv.(11x11)
conv.(11x11)
conv.(1x1)
conv.(1x1)
conv.(11x11)
conv.(11x11)
conv.(11x11)
conv.(1x1)
conv.(1x1)
+ +
H. Zhao, “Pyramid Scene Parsing Network”, CVPR2017.
• 信頼度マップとして各⾻格の位置を信頼度マップとして出⼒
• 多段にすることで着⽬領域を広げて全体のコンテキストを捉える

conv.(3x3
conv.(3x3
conv.(3x3
conv.(1x1)
conv.(1x1)
conv.(7x7)
conv.(7x7)
conv.(7x7)
conv.(1x1)
conv.(1x1)
+
conv.(3x3
conv.(3x3
conv.(3x3
conv.(1x1)
conv.(1x1)
(Parts Affinity Fields)
conv.(7x7)
conv.(7x7)
conv.(7x7)
conv.(7x7)
conv.(7x7)
conv.(1x1)
conv.(1x1)
conv.(7x7)
conv.(7x7)
+
(Parts Affinity Fields)
Z. Cao, “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, CVPR2017.
• 各⾻格の位置を信頼度マップとして出⼒
• Part Affinity Fields により⾻格間の位置関係を表現
• 複数の⼈物の姿勢推定を実現

• ブラウザ上でリアルタイムに姿勢推定が可能
• 17点の⾻格位置を推定

• 誤差をどの程度逆伝播するか？
• 学習係数の決め⽅
• SGD(Stochastic Gradient Decent) : あらかじめ決めておく
• モーメンタム：過去の勾配を考慮
• AdaGrad，AdaDelta, Adam：⾃動で決める/学習係数なし
https://blog.pocketcluster.io/page/5/

• 誤差をどのように逆伝播するか？
• 誤差伝播の流れ
• 誤差逆伝播法(Back propagation): 誤差をもとに勾配を伝播
• Stochastic Depth：ランダムに伝播しない層あり
• Shake-Shake Regularization:順伝播/逆伝播で勾配をスケーリング
• ShakeDrop : スケーリング＆伝播しない層あり

G. Huang, " Deep Networks with Stochastic Depth”, 2016
• 確率的に層数を変更することで汎化性を⾼める
• 各更新において残差パスをランダムにスキップ
• Dropoutの層バージョン，ResNetに特化
• スキップ接続はそのまま
• 層ごとに確率は変える（後段ほどスキップされやすくする）

X. Gastaldi, "Shake-Shake Regularization of 3-Branch Residual Networks”, 2017
• ２つのResidual ブロックの特徴マップを重み付き加算
• 順伝播と逆伝播で異なる重み（ランダムに決める）
• 推論時は同じ重み（=0.5)

• Stochastic DepthとShake-shakeの組み合わせ
• b : スキップするかどうかを決めるベルヌーイ変数
• α, β：特徴マップに対する重み
Y. Yamada, "ShakeDrop Regularization”, ICLR2018.

CIFAR10/100での正則化⼿法の⽐較
正則化⼿法
エラー率[%]
学習時間[h]CIFAR10 CIFAR100
ResNet101 5.3(6.5) 25.9 13.2
Stochastic Depth 5.2 23.5 12
Shake-Shake 3.0 18.4 77.8
Swapout 6.0 26.8 42.0
ResNet101+cutout 3.5 25.0 13.1
Shake-Shake+cutout 2.7 17.5 76.5
• Shake-Shakeは正則化として効果が⾼い(ただし，学習時間が⼤幅に増加)
• Cutoutは汎化性向上に効く

• 学習データを増やせば精度は向上する
• どのようにデータを増やすか？
• とにかく撮影する or ⼤規模データセットを利⽤する
• データ拡張(data augmentation)
tems
e, while Big Data Represents its Fuel. Big Data and Improving Supercomputing Power
king, with the support of the large-scale deep learning platform, when the data traffic
would be greatly improved
Supercomputing
Prediction accuracy
Deep learning
Other machine learning tools
Size of training data
High Prediction Accuracy of Deep Learning
Data
ep
ning
Deep Learning
Table 3. Average Precision @ IOU threshold of 0.5 on PASCAL VOC 2007 ‘test’ set. Th
are used for training.
#Iters on JFT-300M #Epochs mAP@[0.5,0.95]
12M 1.3 35.0
24M 2.6 36.1
36M 4 36.8
Table 4. mAP@[.5,.95] on COCO minival⇤ with JFT-300M check-
point trained from scratch for different number of epochs.
#Iters on ImageNet #Epochs mAP@[0.5,0.95]
100K 3 22.2
200K 6 25.9
400K 12 27.4
5M 150 34.5
Table 5. mAP@[.5,.95] on COCO minival⇤ with ImageNet check-
point trained for different number of epochs.
10 30 100 300
Number of examples (in millions)
0
10
20
30
40
meanAP
Fine-tuning
No Fine-tuning
10 30 100 300
Number of examples (in millions)
40
60
80
meanAP
Fine-tuning
No Fine-tuning
Figure 4. Object detection performance when initial checkpoints
are pre-trained on different subsets of JFT-300M from scratch.
x-axis is the data size in log-scale, y-axis is the detection per-
formance in mAP@[.5,.95] on COCO minival⇤ (left), and in
mAP@.5 on PASCAL VOC 2007 test (right).
Number of classes mAP@[.5,.95]
1K ImageNet 31.2
18K JFT 31.9
Table 6. Object detection performance in mean AP@[.5,.95] on
COCO minival⇤ set. We compare checkpoints pre-trained on 30M
JFT images where labels are limited to the 1K ImageNet classes,
and 30M JFT images covering all 18K JFT classes.
#Layers Ima
50 3
101 3
152 3
Figure 5. Object
ResNet models w
Impact of Mod
Finally, we stud
images are avai
iments on the
models. Each
300M data, w
ResNet-101 exp
models on Imag
hyper paramete
Figure 5 sho
pre-trained mod
higher capacity
For example, in
pared to when u
5.3. Semantic
We use the P
benchmark [12]
ground classes
practice, all mo
VOC [12] 2012
notations from
PASCAL VOC
dard mean inter
Implementa
model [4] has
of ResNet101
C. Sun, "Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”, ICCV2017.

https://www.cityscapes-dataset.com/ M. Cordts, The Cityscapes Dataset for Semantic Urban Scene Understanding, CVPR2016

T. DeVries, "Improved Regularization of Convolutional Neural Networks with Cutout", 2017
• ⼊⼒画像をランダムなマスクで⽋落させる
• CNNにおいてDropoutよりも効果が⾼い
Z. Zhong, "Random Erasing Data Augmentation", 2017

• 2つの学習データを混合して新たなデータを作成
• 正解ラベルも混合して新たなラベルをつける
H. Zhang, "mixup: Beyond Empirical risk minimization”, ICLR2018.

D. Tran, “Learning Spaiotemporal Features with 3D Convoluional Networks”, CVPR2015
J. Donahue, "Long-term Recurrent Convoluional Networks for Visual Recogniion and Descripion", CVPR2017

Elman network
その他
　Jordan network
　Time delay network　など
• 中間層で⾃分⾃⾝に再帰する経路を持つ
• ⼊⼒層→中間層→出⼒層の経路は通常のNNと等しい

• 各時刻の⼊⼒が系列データの⼀つの要素に相当
• ある時刻tにおいて時刻0からの全⼊⼒を考慮

+
tanh
σσ
x
tanh
x
σ
x
+
tanh
σσ
x
tanh
x
σ
x
+
tanh
σσ
x
tanh
x
σ
x





中間層の
ユニット
入力
出力層へ
次の時刻へ
メモリセル
出力層へ出力層へ
次の時刻へ
tanh tanh tanh



中間層の
ユニット
入力
次層へ
次の時刻へ
次層へ次層へ
次の時刻へ
• ⼀般的なRNN
• ⻑期間の情報を考慮できない
• LSTM-RNN
• メモリセルにより⻑期間の情報を考慮

+
tanh
σσ
x

tanh
x
σ
x

忘却ゲートメモリセル
中間層の
ユニット
入力
入力ゲート
出力ゲート
次層へ
次時刻へ
次時刻へ
• １つのメモリセルと３つのゲートから構成
• ３つのゲートによりメモリセルの値を更新

+
tanh
σσ
x

tanh
x
σ
x

中間層の
ユニット
入力
入力ゲート
出力ゲート
次層へ
次時刻へ
次時刻へ
• メモリセルの値をどの程度忘却するかを決める
• 忘却ゲートの値=0 リセット，1 そのまま

+
tanh
σσ
x

tanh
x
σ
x

中間層の
ユニット
入力
入力ゲート
出力ゲート
次層へ
次時刻へ
次時刻へ
• メモリセルに⼊⼒の値をどれだけ加えるか決める
• ⼊⼒ゲートの値=0 そのまま，1 ⼊⼒の値(-1から1)

+
tanh
σσ
x

tanh
x
σ
x

中間層の
ユニット
入力
入力ゲート
出力ゲート
次層へ
次時刻へ
次時刻へ
• メモリセルの値を(次層または次時刻へ)どれだけ出⼒するか決める
• 出⼒ゲート=0 出⼒は0，1 メモリセルの値

+
σ x

tanh
x
σ
x
1-

出力ゲート
中間層の
ユニット
入力
更新ゲート
次時刻へ
• ２つのゲートにより中間層の値を制御
• メモリセルが不要でLSTMを簡略化
• 更新ゲート：LSTMの忘却ゲートと⼊⼒ゲートを組みわせたもの

入力層
中間層
入力層
中間層
入力層
中間層
<EOS>
context
入力層
中間層
入力層
中間層
入力層
中間層
入力層
中間層
出力層出力層出力層出力層出力層
<EOS>
x1 x2 x3
y1 y2 y3 y4
y1 y2 y3 y4
• RNNを⽤いたエンコーダ・デコーダモデル
• 機械翻訳などに利⽤
I. Sutskever, "Sequence to sequence learning with neural networks." NIPS2014.

<EOS>
context
入力層
中間層
入力層
中間層
入力層
中間層
入力層
中間層
出力層出力層出力層出力層出力層
<EOS>
y1 y2 y3 y4
y1 y2 y3 y4
• CNNから画像特徴を抽出し，デコーダ側にコンテキストとして与える
• デコーダは次の単語を予測する
A. Karpathy, "Deep Visual-Semanic Alignments for Generaing Image Descripions”, CVPR2015.

• 複数の歩⾏者の移動経路を同時に予測するLSTM⼿法を提案
• 周囲に存在する歩⾏者の位置の特徴を抽出
• Social Poolingにより歩⾏者間のインタラクションを考慮
A. Alahi, "Social LSTM: Human Trajectory Predicion in Crowded Spaces”, CVPR2016.

• Convolutional Social Poolingにより⾃動⾞の⾏動を予測
• LSTM encoder-decoder model
• 確率分布の変数を出⼒(レーンチェンジ，速度キープ，加速，減速）
LSTM
M. Trivedi, "Convoluional Social Pooling for Vehicle Trajectory Predicion”, CVPR-WS2018.

• GANを使⽤した経路予測⼿法
• Generator：経路を⽣成
• Discriminator：⽣成された経路をReal/Fakeの判別
People
Merging
Group
Avoiding
Person
Following
(1) (2) (3) (4)
Figure 6: Examples of diverse predictions from our model. Each row shows a different set of observed trajectories; columns
show four different samples from our model for each scenario which demonstrate different types of socially acceptable
behavior. BEST is the sample closest to the ground-truth; in SLOW and FAST samples, people change speed to avoid
collision; in DIR samples people change direction to avoid each other. Our model learns these different avoidance strategies
in a data-driven manner, and jointly predicts globally consistent and socially acceptable trajectories for all people in the scene.
We also show some failure cases in supplementary material.
A. Gupta, "Social GAN: Socially Acceptable Trajectories with Generaive Adversarial Networks”, CVPR2018.

• 軌跡と属性・環境を考慮した経路予測⼿法
• 属性：歩⾏者・⾞・バイクなど(one-hot vector)
• 環境：コンテキスト（セマンティックセグメンテーション）
Car Pedestrian Bicycle
, " LSTM ”, SSII2018.

0.1
0.4
0.5
0.9
0.3
0.5
0.5
0.8
0.4
0.2
0.6
0.7
各時刻の
ベクトルデータ
• 何を⼊⼒，出⼒とするか？
• ⼊⼒：現時刻の電⼒・気温・湿度など
• 出⼒：３０分後の電⼒など

Week 1 2 3 4 5 6 7 8
Accuracy 71.3% 79.3% 79.8% 84.6% 84.1% 84.9% 85.1% 84.6%
A 84.0% 93.4% 90.4% 95.5% 94.9% 94.3% 93.4% 91.9%
B 12.1% 6.1% 27.3% 30.3% 33.3% 42.4% 42.4% 54.5%
C 0.0% 11.8% 35.3% 41.2% 23.5% 29.4% 47.1% 41.2%
D 0.0% 14.3% 14.3% 28.6% 28.6% 42.9% 28.6% 28.6%
F 0.0% 0.0% 12.5% 0.0% 25.0% 25.0% 50.0% 50.0%
• 各週の学習者のアクションから最終成績(A, B, C, D, F)を予測
• ４週⽬で84.6%の精度で予測可能
• ７週⽬でF判定のうち半分の学習者を特定
F. okubo, "Students' Performance Predicion Using Data of Muliple Courses by Recurrent Neural Network ”, ICCE2017.

• 重回帰モデルよりもRNNモデルの⽅が⾼い精度
F. okubo, "Students' Performance Predicion Using Data of Muliple Courses by Recurrent Neural Network ”, ICCE2017.

• 各タスクに対応した出⼒を⾏うことでマルチタスク化が可能

S.Ruder, "An Overview of Muli-Task Learning in Deep Neural Networks, arXiv:1706.05098, 2017.

• シングルショット系の物体検出の出⼒において，セグメンテーションを同
時に⾏うようにする
• DSSD : エンコード・デコード構造の物体検出
C. Fu, W. Liu, “DSSD : Deconvoluional Single Shot Detector", arXiv:1701.06659, 2017

• シングルショット系の物体検出の出⼒において，セグメンテーションを同
時に⾏うようにする
• DSSD : エンコード・デコード構造の物体検出
• セグメンテーション結果をNMSに反映
, “ Deconvoluional Single Shot Detector ", MIRU2018(PS3-6)

+14% +15% +24%
SegNet

+3% +0.5% +2%
DSSD

• 物体検出(RPNベース)に検出とセグメンテーションを同時に⾏う
• RoI Alignを導⼊
 
person1.00 person1.00
person1.00
person1.00
person.99
person.99
bench.76
skateboard.91
skateboard.83
handbag.81
surfboard1.00
person1.00
person1.00 surfboard1.00person1.00
person.98
surfboard1.00
person1.00
surfboard.98 surfboard1.00
person.91
person.74person1.00
person1.00
person1.00
person1.00
person1.00person1.00person.98
person.99
person1.00
person.99 umbrella1.00
person.95
umbrella.99umbrella.97
umbrella.97
umbrella.96
umbrella1.00
backpack.96
umbrella.98
backpack.95
person.80
backpack.98
bicycle.93
umbrella.89
person.89
handbag.97
handbag.85
person1.00
person1.00person1.00person1.00
person1.00
person1.00
motorcycle.72
kite.89
person.99
kite.95
person.99
person1.00
person.81
person.72
kite.93
person.89
kite1.00
person.98
person1.00
kite.84
kite.97
person.80
handbag.80
person.99
kite.82
person.98person.96
kite.98
person.99
person.82
kite.81
person.95 person.84
kite.98
kite.72
kite.99
kite.84
kite.99
person.94
person.72person.98
kite.95
person.98person.77
kite.73
person.78 person.71person.87
kite.88
kite.88
person.94
kite.86
kite.89
zebra.99
zebra1.00
zebra1.00
zebra.99 zebra1.00
zebra.96
zebra.74
zebra.96
zebra.99zebra.90
zebra.88
zebra.76
dining table.91
dining table.78
chair.97
person.99
person.86
chair.94
chair.98
person.95
chair.95
person.97
chair.92
chair.99
person.97
person.99
person.94person.99
person.87
person.99
chair.83
person.94
person.99person.98
chair.87
chair.95
person.97
person.96
chair.99
person.86
person.89
chair.89
wine glass.93
person.98 person.88
person.97
person.88
person.88
person.91 chair.96
person.95
person.77
person.92
wine glass.94
cup.83
wine glass.94
wine glass.83
cup.91
chair.85 dining table.96
wine glass.91
person.96
cup.98
person.83
dining table.75
cup.96
person.72
wine glass.80
chair.98
person.81
person.82
dining table.81
chair.85
chair.78
cup.75
person.77
cup.71
wine glass.80
cup.79cup.93
cup.71
person.99
person.99
person1.00
person1.00
frisbee1.00
person.80
person.82
person1.00
person1.00person1.00
person1.00
person1.00
person1.00
person1.00
dog1.00
baseball bat.99
baseball bat.85
baseball bat.98
Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1)
backbone AP AP50 AP75 APS APM APL
MNC [7] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6
FCIS [20] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0
FCIS+++ [20] +OHEM ResNet-101-C5-dilated 33.6 54.5 - - - -
Mask R-CNN ResNet-101-C4 33.1 54.9 34.8 12.1 35.6 51.1
Mask R-CNN ResNet-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4
Mask R-CNN ResNeXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5
Table 1. Instance segmentation mask AP on COCO test-dev. MNC [7] and FCIS [20] are the winners of the COCO 2015 and 2016
segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which include
multi-scale train/test, horizontal ﬂip test, and OHEM [29]. All entries are single-model results.
is evaluating using mask IoU. As in previous work [3, 21],
we train using the union of 80k train images and a 35k sub-
set of val images (trainval35k), and report ablations on
the remaining 5k subset of val images (minival). We also
report results on test-dev [22], which has no disclosed
labels. Upon publication, we will upload our full results on
test-std to the public leaderboard, as recommended.
4.1. Main Results
We compare Mask R-CNN to the state-of-the-art meth-
ods in instance segmentation in Table 1. All instantia-
tions of our model outperform baseline variants of previ-
ous state-of-the-art models. This includes MNC [7] and
expect many such improvements to be applicable to ours.
Mask R-CNN outputs are visualized in Figures 2 and 4
Mask R-CNN achieves good results even under challeng
ing conditions. In Figure 5 we compare our Mask R-CNN
baseline and FCIS+++ [20]. FCIS+++ exhibits systematic
artifacts on overlapping instances, suggesting that it is chal
lenged by the fundamental difﬁculty of instance segmenta
tion. Mask R-CNN shows no such artifacts.
4.2. Ablation Experiments
We run a number of ablations to analyze Mask R-CNN
Results are shown in Table 2 and discussed in detail next.
K. He, “Mask R-CNN“, ICCV2017.

https://www.youtube.com/watch?v=UWtac4cFERM

• 異なる種別のタスクを同時に学習・実⾏
H. Fukui, “Multiple Facial Attributes Estimation based on Weighted Heterogeneous Learning”, ACCV-WS 2016.

0.000001
0.0001
0.01
1
( )
734
7
01 2
2
80 2
5
96 5
10#
10$%
10$&
10$'
• タスクにおける誤差の収束速度・バラツキが異なる

Z. Zhang, “Facial Landmark Detection by Deep Multi-task Learning”, ECCV2014.
• メインタスクの性能向上させるためにサブタスクを⽤意
• メインタスク: 顔器官点検出
• サブタスク : 性別，笑顔，メガネ，顔の向き
Task-wise early stoppingを導⼊

• 全てのタスクの精度を考慮
• サブタスクの学習誤差関数に重みwtを付与
• 基準タスク: 学習誤差が最⼩かつ安定しているタスク
• サブタスク: 基準タスク以外のタスク
• タスク間の学習誤差のばらつきを抑制 → 学習が安定し，精度向上
E =
1
M
M
m=1
||Lf,m Of,m||2
2 +
T
t=f
wt||Lt,m Ot,m||2
2
M
T
L
O

wt =
Nf
Nt
Nf
µt t
t t
• タスクtごとに安定度を算出
• 単⼀のCNNで学習した際の学習誤差を⽤いて算出
• 基準タスクの選択
• 安定度が最⼩のタスク
• 基準タスクとの安定度の⽐から重みを算出
Nt = µt + 3 t
Nt = µt + 3 t
Nt = µt + 3 t
H. Fukui, “Facial Image Analysis by CNN with Heterogeneous Learning”, IWAIT2017

• 付与する重みを各エポックで算出
• 現エポックまでの学習誤差をもとに安定度を算出
• 基準タスクの選択
• 安定度が最⼩のタスク
• 基準タスクとの安定度の⽐から重みを算出
Nt,e = µt,e + 3 t,e
Tmain,e = arg min
t
Nt,e
wt,e = wmain,e ·
Nmain,e
Nt,e
w2,1 = 3.8 10 2
wt,1 = 1.7 × 10−1
w1,1 = wmain,1 = 1.0
0.0001
0.001
0.01
0.1
110#
10$%
10$&
10$'
10$(
0.0001
0.001
0.01
0.1
110#
10$%
10$&
10$'
10$(
0.0001
0.001
0.01
0.1
110#
10$%
10$&
10$'
10$(
N1,1 = 1.7 × 10−2
N2,1 = 4.5 × 10−1
Nt,1 = 1.0 10 1
Nmain,e
wmain,e
Tmain,e
Tmain,e

0
20
40
60
80
100
[%]
CNN
Heterogeneous Multi-task Learning
• 重みを付与することで回帰推定のタスクの精度が向上
• 顔器官点検出が約10%向上
• 単⼀のCNNと同等の識別精度を達成
H. Fukui, “Facial Image Analysis by CNN with Heterogeneous Learning”, IWAIT2017

A. Kendall, “Muli-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semanics”, CVPR2018
• タスクの不確かさを利⽤して重み付け
• 学習するタスク
• セマンティックセグメンテーション
• インスタンスセグメンテーション
• デプス推定

ℒ = log & ' ( + log[1 − & - . ( |( ]+1 - .|( (
• 特定の属性を含む画像を⽣成するようにGeneratorとDiscriminatorを改良
• Generator : 属性情報を各層に与える
• Discriminator：本物/偽物の識別と属性推定のマルチタスク
, “ Condiional Generaive Adversarial Network ", MIRU2018(PS3-16 )

, “ Condiional Generaive Adversarial Network ", MIRU2018(PS3-16 )

• アテンションと認識を同時に学習・再利⽤することで認識精度を向上
, “Global Average Pooling A^enion Branch Network", MIRU2018(OS2-L2 )

n
n
• A3Cベースの深層強化学習にアテンションを導⼊
• Attention branch → 状態価値とアテンションマップを出⼒
• Perception branch → アテンションマップを利⽤して⾏動を出⼒

• 画像認識における深層学習について
• 畳み込みニューラルネットワークを利⽤する際のポイント
• リカレントニューラルネットワークとその活⽤事例
• マルチタスクラーニング
質問があれば takayoshi@isc.chubu.ac.jp へ

MIRU2018 tutorial

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à MIRU2018 tutorial

Similaire à MIRU2018 tutorial (20)

Plus de Takayoshi Yamashita

Plus de Takayoshi Yamashita (12)

MIRU2018 tutorial