3. Team year Error (top-5)
AlexNet 2012 15.3%
Clarifai 2013 11.2%
VGG 2014 7.32%
GoogLeNet 2014 6.67%
ResNet 2015 3.57%
ResNet+ 2016 2.99%
SENet 2017 2.25%
human expert 5.1%
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14. • 2012年のILSVRC優勝モデル
• Rectified Linear Units (ReLU)
• Local Response Normalization (LRN)
• Dropout(全結合層)
• Pre-training(事前学習)
A. Krizhevsky, "Imagenet classification with deep convolutional neural networks”, NIPS, 2012.
15. • 2014年のILSVRC 2位モデル
• シンプルなモデルの設計
• 畳み込み層のフィルタサイズを3x3に統⼀
• プーリング後の畳み込み層のフィルタ数を2倍
• 段階的な学習(畳み込み層を追加させてfine-tuningする)
K. Simonyan, "Very Deep Convolutional Networks for Large-Scale Image Recognition", ICPR, 2015.
16. • 2014年のILSVRC優勝モデル
• 異なるサイズのフィルタを利⽤するInceptionモジュール
• Global Average Pooling (GAP)
• Auxiliary loss
3x3convolutions5x5convolutions
concatenation
Previouslayer
3x3maxpooling1x1convolutions1x1convolutions
1x1convolutions1x1convolutions
Convolution
Pooling
Softmax
Other
C. Szegedy, “Going Deeper with Convolutions”, CVPR2015.
55. • 学習データを増やせば精度は向上する
• どのようにデータを増やすか?
• とにかく撮影する or ⼤規模データセットを利⽤する
• データ拡張(data augmentation)
tems
e, while Big Data Represents its Fuel. Big Data and Improving Supercomputing Power
king, with the support of the large-scale deep learning platform, when the data traffic
would be greatly improved
Supercomputing
Prediction accuracy
Deep learning
Other machine learning tools
Size of training data
High Prediction Accuracy of Deep Learning
Data
ep
ning
Deep Learning
Table 3. Average Precision @ IOU threshold of 0.5 on PASCAL VOC 2007 ‘test’ set. Th
are used for training.
#Iters on JFT-300M #Epochs mAP@[0.5,0.95]
12M 1.3 35.0
24M 2.6 36.1
36M 4 36.8
Table 4. mAP@[.5,.95] on COCO minival⇤ with JFT-300M check-
point trained from scratch for different number of epochs.
#Iters on ImageNet #Epochs mAP@[0.5,0.95]
100K 3 22.2
200K 6 25.9
400K 12 27.4
5M 150 34.5
Table 5. mAP@[.5,.95] on COCO minival⇤ with ImageNet check-
point trained for different number of epochs.
10 30 100 300
Number of examples (in millions)
0
10
20
30
40
meanAP
Fine-tuning
No Fine-tuning
10 30 100 300
Number of examples (in millions)
40
60
80
meanAP
Fine-tuning
No Fine-tuning
Figure 4. Object detection performance when initial checkpoints
are pre-trained on different subsets of JFT-300M from scratch.
x-axis is the data size in log-scale, y-axis is the detection per-
formance in mAP@[.5,.95] on COCO minival⇤ (left), and in
mAP@.5 on PASCAL VOC 2007 test (right).
Number of classes mAP@[.5,.95]
1K ImageNet 31.2
18K JFT 31.9
Table 6. Object detection performance in mean AP@[.5,.95] on
COCO minival⇤ set. We compare checkpoints pre-trained on 30M
JFT images where labels are limited to the 1K ImageNet classes,
and 30M JFT images covering all 18K JFT classes.
#Layers Ima
50 3
101 3
152 3
Figure 5. Object
ResNet models w
Impact of Mod
Finally, we stud
images are avai
iments on the
models. Each
300M data, w
ResNet-101 exp
models on Imag
hyper paramete
Figure 5 sho
pre-trained mod
higher capacity
For example, in
pared to when u
5.3. Semantic
We use the P
benchmark [12]
ground classes
practice, all mo
VOC [12] 2012
notations from
PASCAL VOC
dard mean inter
Implementa
model [4] has
of ResNet101
C. Sun, "Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”, ICCV2017.
73. • Convolutional Social Poolingにより⾃動⾞の⾏動を予測
• LSTM encoder-decoder model
• 確率分布の変数を出⼒(レーンチェンジ,速度キープ,加速,減速)
LSTM
M. Trivedi, "Convoluional Social Pooling for Vehicle Trajectory Predicion”, CVPR-WS2018.
74. • GANを使⽤した経路予測⼿法
• Generator:経路を⽣成
• Discriminator:⽣成された経路をReal/Fakeの判別
People
Merging
Group
Avoiding
Person
Following
(1) (2) (3) (4)
Figure 6: Examples of diverse predictions from our model. Each row shows a different set of observed trajectories; columns
show four different samples from our model for each scenario which demonstrate different types of socially acceptable
behavior. BEST is the sample closest to the ground-truth; in SLOW and FAST samples, people change speed to avoid
collision; in DIR samples people change direction to avoid each other. Our model learns these different avoidance strategies
in a data-driven manner, and jointly predicts globally consistent and socially acceptable trajectories for all people in the scene.
We also show some failure cases in supplementary material.
A. Gupta, "Social GAN: Socially Acceptable Trajectories with Generaive Adversarial Networks”, CVPR2018.
91. • 物体検出(RPNベース)に検出とセグメンテーションを同時に⾏う
• RoI Alignを導⼊
person1.00 person1.00
person1.00
person1.00
person.99
person.99
bench.76
skateboard.91
skateboard.83
handbag.81
surfboard1.00
person1.00
person1.00 surfboard1.00person1.00
person.98
surfboard1.00
person1.00
surfboard.98 surfboard1.00
person.91
person.74person1.00
person1.00
person1.00
person1.00
person1.00person1.00person.98
person.99
person1.00
person.99 umbrella1.00
person.95
umbrella.99umbrella.97
umbrella.97
umbrella.96
umbrella1.00
backpack.96
umbrella.98
backpack.95
person.80
backpack.98
bicycle.93
umbrella.89
person.89
handbag.97
handbag.85
person1.00
person1.00person1.00person1.00
person1.00
person1.00
motorcycle.72
kite.89
person.99
kite.95
person.99
person1.00
person.81
person.72
kite.93
person.89
kite1.00
person.98
person1.00
kite.84
kite.97
person.80
handbag.80
person.99
kite.82
person.98person.96
kite.98
person.99
person.82
kite.81
person.95 person.84
kite.98
kite.72
kite.99
kite.84
kite.99
person.94
person.72person.98
kite.95
person.98person.77
kite.73
person.78 person.71person.87
kite.88
kite.88
person.94
kite.86
kite.89
zebra.99
zebra1.00
zebra1.00
zebra.99 zebra1.00
zebra.96
zebra.74
zebra.96
zebra.99zebra.90
zebra.88
zebra.76
dining table.91
dining table.78
chair.97
person.99
person.86
chair.94
chair.98
person.95
chair.95
person.97
chair.92
chair.99
person.97
person.99
person.94person.99
person.87
person.99
chair.83
person.94
person.99person.98
chair.87
chair.95
person.97
person.96
chair.99
person.86
person.89
chair.89
wine glass.93
person.98 person.88
person.97
person.88
person.88
person.91 chair.96
person.95
person.77
person.92
wine glass.94
cup.83
wine glass.94
wine glass.83
cup.91
chair.85 dining table.96
wine glass.91
person.96
cup.98
person.83
dining table.75
cup.96
person.72
wine glass.80
chair.98
person.81
person.82
dining table.81
chair.85
chair.78
cup.75
person.77
cup.71
wine glass.80
cup.79cup.93
cup.71
person.99
person.99
person1.00
person1.00
frisbee1.00
person.80
person.82
person1.00
person1.00person1.00
person1.00
person1.00
person1.00
person1.00
dog1.00
baseball bat.99
baseball bat.85
baseball bat.98
Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1)
backbone AP AP50 AP75 APS APM APL
MNC [7] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6
FCIS [20] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0
FCIS+++ [20] +OHEM ResNet-101-C5-dilated 33.6 54.5 - - - -
Mask R-CNN ResNet-101-C4 33.1 54.9 34.8 12.1 35.6 51.1
Mask R-CNN ResNet-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4
Mask R-CNN ResNeXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5
Table 1. Instance segmentation mask AP on COCO test-dev. MNC [7] and FCIS [20] are the winners of the COCO 2015 and 2016
segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which include
multi-scale train/test, horizontal flip test, and OHEM [29]. All entries are single-model results.
is evaluating using mask IoU. As in previous work [3, 21],
we train using the union of 80k train images and a 35k sub-
set of val images (trainval35k), and report ablations on
the remaining 5k subset of val images (minival). We also
report results on test-dev [22], which has no disclosed
labels. Upon publication, we will upload our full results on
test-std to the public leaderboard, as recommended.
4.1. Main Results
We compare Mask R-CNN to the state-of-the-art meth-
ods in instance segmentation in Table 1. All instantia-
tions of our model outperform baseline variants of previ-
ous state-of-the-art models. This includes MNC [7] and
expect many such improvements to be applicable to ours.
Mask R-CNN outputs are visualized in Figures 2 and 4
Mask R-CNN achieves good results even under challeng
ing conditions. In Figure 5 we compare our Mask R-CNN
baseline and FCIS+++ [20]. FCIS+++ exhibits systematic
artifacts on overlapping instances, suggesting that it is chal
lenged by the fundamental difficulty of instance segmenta
tion. Mask R-CNN shows no such artifacts.
4.2. Ablation Experiments
We run a number of ablations to analyze Mask R-CNN
Results are shown in Table 2 and discussed in detail next.
K. He, “Mask R-CNN“, ICCV2017.
96. • 全てのタスクの精度を考慮
• サブタスクの学習誤差関数に重みwtを付与
• 基準タスク: 学習誤差が最⼩かつ安定しているタスク
• サブタスク: 基準タスク以外のタスク
• タスク間の学習誤差のばらつきを抑制 → 学習が安定し,精度向上
E =
1
M
M
m=1
||Lf,m Of,m||2
2 +
T
t=f
wt||Lt,m Ot,m||2
2
M
T
L
O
H. Fukui, “Multiple Facial Attributes Estimation based on Weighted Heterogeneous Learning”, ACCV-WS 2016.
97. wt =
Nf
Nt
Nf
µt t
t t
• タスクtごとに安定度 を算出
• 単⼀のCNNで学習した際の学習誤差を⽤いて算出
• 基準タスクの選択
• 安定度 が最⼩のタスク
• 基準タスクとの安定度の⽐から重みを算出
Nt = µt + 3 t
Nt = µt + 3 t
Nt = µt + 3 t
H. Fukui, “Facial Image Analysis by CNN with Heterogeneous Learning”, IWAIT2017