SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
PR-325
주성훈, Samsung SDS
2021. 06. 13.
1. Research Background
1. Research Background
Pre-training mechanism for cross-modality tasks
• Self-attention 기법과 self-supervision의 결합으로 labeling없이도 초대량의 데이터로부터 context를 학습할 수 있게
되었고, language와 vision 각각의 분야 뿐 아니라 modality domain (Visual Question Answering, image captioning,
image retrieval) 에서도 좋은 성능을 보이고 있다.
Visual Question Answering (VQA) text-to-image retrieval
Park, Gwangbeen, and Woobin Im. arXiv:1612.08354
3/27
1. Research Background
Cross-modality learning in vison and language
• Visual representation으로서 Visual classification을 위한
pre-trained CNN feature 활용
Ben-Younes, Hedi, et al. "Mutan: Multimodal
tucker fusion for visual question answering."
ICCV. (2017)
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
4/27
1. Research Background
Cross-modality learning in vison and language
https://visualgenome.org/
Krishna, Ranjay, et al. "Visual genome: Connecting
language and vision using crowdsourced dense image
annotations." IJCV. (2017)
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
5/27
1. Research Background
Cross-modality learning in vison and language
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
* PR-012 (By JinWon lee)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn:
Towards real-time object detection with region
proposal networks. NIPS. (2015)
Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and
visual question answering." CVPR. (2018)
Faster R-CNN
6/27
1. Research Background
Cross-modality learning in vison and language
Visual Genome Dataset 발표
CNN feature 활용 Visual backbone으로
faster-RCNN 활용
Transformer를 이용한 두 modality의
dense connection 학습
Self-attention 개념을 통해 입력된 문장 안의 각 token
embedding을 다른 token들을 고려해 구할 수 있음
https://jalammar.github.io/illustrated-transformer/
Li, Xiujun, et al. "Oscar: Object-semantics aligned pre-training for vision-language
tasks." ECCV. (2020)
• Transformer의 self-attention과 feed-forward network을 통해 intra-domain 및
inter-domain의 dense connections 형성
Self-attention in Transformer
7/27
1. Research Background
• Two-stream neural network approach
• Visual 정보와 language 정보를 각각 처리한 two-stream neural network를 transformer layer를 통해 합치게 됨
• Vilbert
• Single-stream neural network approach
• Sentence embedding feature와 bounding box feature를 합친 후 BERT를 통해 bi-directional joint distribution을 학습시킴
Lu, J. et al., Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NIPS (2019)
Get better joint representation for vision and language tasks
8/27
1. Research Background
• Two-stream neural network approach
• Visual 정보와 language 정보를 각각 처리한 two-stream neural network를 transformer layer를 통해 합치게 됨
• Single-stream neural network approach
• Sentence embedding feature와 bounding box feature를 합친 후 BERT를 통해 bi-directional joint distribution을 학습시킴
• VL-BERT
Get better joint representation for vision and language tasks
Su, Weijie, et al. "Vl-bert: Pre-training of generic visual-linguistic representations." ICLR 2020.
9/27
1. Research Background
Region-based visual feature extractors를 사용한 방법의 한계점
• Object detection이라는 특정 task를 수행하기 위해 추출된 visual feature이므로 language understanding를 하기에 충분한 image
feature를 추출할 수 없다는 주장
• Visual information의 소실 : object의 형태, object들이 겹치는 부분에서 나타나는 관계 정보, 배경이나 이미지의 분위기에서 얻을 수
있는 정보들
10/27
1. Research Background
Objective & Approach
• We step out of the bounding box to make the full power of visual information in images for
vision and language learning.
• We propose Pixel-BERT that learns to align image pixels with text to build a more thorough
semantic embedding between visual and textual information.
word-level token embedding based
on BERT
CNN that takes image pixels as
input for visual embedding learning
Multi-modal transformers
for jointly learning
Task (VQA, retrieval, …)
11/27
2. Methods
2. Methods
1) Pre-Training Pixel-BERT
2) Training for downstream tasks
Approach
13/27
2. Methods
Architectures
① we learn from pixels to represent an image instead of using bounding boxes.
② Randomly sample feature pixels (100 features) during pre-training. (for robustness, computation cost)
③ Transformer의 self-attention과 feed-forward network을 통해 intra-domain 및 inter-domain의 dense connections 형성
③
Tokenize
①
100 features
ResNet-50, ResNeXt-152.
pre-trained model on ImageNet
②
14/27
• Masked Language Modeling (MLM)
- visual feature가 token을 예측하는 데에 활용됨으로서 두 modality 사이가
mapping되도록 유도
- randomly mask language tokens with a probability of 0.15
- UNITER (2020, ECCV), vl-BERT (2020, ICLR), LXMERT (2019, EMNLP)
에서 적용된 방법
2. Methods
pre-trained by MLM and ITM tasks
• MS-COCO captions
Chen, Xinlei, et al. arXiv:1504.00325 (2015).
• Visual Genome (VG)
15/27
• Image-Text Matching (ITM)
- Transformer로 입력된 문장이 같이 입력된 이미지를 잘 설명하고 있는지 맞추는 binary
classifiation task
- 같은 수의 negative (unmatched image-sentence pairs)-positive sample 사용
2. Methods
pre-trained by MLM and ITM tasks
• For CNN: SGD with learning rate 1e-2 and weight decay 5e-4
For Transformer: AdamW with learning rate 1e-4 and weight decay 1e-2
• Pre-training: 64 NVIDIA Tesla V100 GPUs with the batch size 4096 samples for
40 epochs.
• Pre-training setup
The man at bat readies to swing at the
pitch while the umpire looks on.
A man taking a picture behind the girl
True False
16/27
• Masked Language Modeling (MLM)
- visual feature가 token을 예측하는데에 활용됨으로서 두
modality 사이가 mapping되도록 유도
- randomly mask language tokens with a probability of 0.15
2. Methods
pre-trained by MLM and ITM tasks
• For CNN: SGD with learning rate 1e-2 and weight decay 5e-4 to optimize the CNN backbone
For Transformer: AdamW with learning rate 1e-4 and weight decay 1e-2
• Pre-training: 64 NVIDIA Tesla V100 GPUs with the batch size 4096 samples for 40 epochs.
• Image-Text Matching (ITM)
같은 수의 negative (unmatched image-sentence pairs)-
positive sample 사용
• MS-COCO captions
Chen, Xinlei, et al. arXiv:1504.00325 (2015).
• Visual Genome (VG)
• Pre-training setup
17/27
3. Experimental Results
3. Experimental Results
Downstream task - Visual Question Answering (VQA)
Question [SEP] Image
Pretrained Pixel-BERT
• 18 epochs on 16 NVIDIA Tesla V100 GPUs with batch size 256
• Learning rate decay : by 10 at 12th and 16th epoch.
ResNeXt-152
Faster-RCNN, transformer,
pretrained model
• Pixel-level에서의 image representation을 학습하는게 성능 향상에 도움이 됨
[CLS]
0 / 1
19/27
CNN feature, no transformer
Faster-RCNN, no transformer
ResNet-50
3. Experimental Results
Downstream task - Natural Language for Visual Reasoning for Real (NLVR2)
𝑝1
𝑐𝑙𝑠
Q [SEP] Image1
Pixel-BERT
• 18 epochs on 16 NVIDIA Tesla V100 GPUs with batch size 128
• Learning rate decay : by 10 at 12th and 16th epoch.
https://lil.nlp.cornell.edu/nlvr/
cross-entropy loss
[CLS]
𝑝2
𝑐𝑙𝑠
Q [SEP] Image2
Pixel-BERT
[CLS]
0 / 1
Concat.
20/27
3. Experimental Results
• 2개의 image를 입력받는 NLVR2 task에서도 좋은 성능을 보임
Downstream task - Natural Language for Visual Reasoning for Real (NLVR2)
21/27
3. Experimental Results
• Query(text)와 Image를 Pixel-BERT에 입력하고 relevance score를 구하여 상위 검색 결과를 표시
• Unicoder-VL, UNITER와 비교했을 때, Pixel-BERT가 모든 subtask에서 의미 있는 성능 향상을 보임
• IR subtask는 이미지에 대한 global description을 이해하는 것이 필요한데, 이러한 점에서 pixel-BERT는 image pixel
과 language사이의 attention을 학습하도록 하는 장점이 있음
Downstream task - Image-Text Retrieval
1K testing results on Flickr30K 5-fold 1K testing and 5K testing results on MS-COCO
TR: Image-to-image retrieval. 이미지가 주어졌을 때 적합한 텍스트 검색
12-Layer Transformer
IR: Text-to-image retrieval. 텍스트가 주어졌을 때 적합한 이미지 검색
R@k: Recall at k. 모델이 검색결과로 k개를 냈을 때, 실제 relevant item이 얼마나 포함되어 있는지 나타냄.
22/27
3. Experimental Results
• Pre-training task, Random pixel sampling 등의 아이디어가 subtask의 성능을 높이는 데에 역할을 했음을 확인
• 더 우수한 visual backbone모델을 활용하면 pixel-BERT의 성능을 높일 수 있을 것으로 기대됨
Ablation Study
1) ITM, MLM pre-training task
Random pixel sampling
2) Random pixel sampling method
3) 더 우수한 visual backbone을 활용
23/27
3. Experimental Results
• Pixel-BERT can well learn the visual representation in region level with cross-modality learning
Visualization
Bounding box 모델에서는 표현하기
어려웠던 부분
24/27
4. Conclusion
4. Conclusions
1. CNN-based Visual Encoder와 multimodal Transformer를 결합한 Pixel-BERT를 제안
2. Pixel-BERT기반의 pre-training model을 구축하고 down-stream task에서의 성능 확인
• Masked language model and image-text matching are two tasks designed for pre-training.
• 4가지의 downstream vision and language tasks를 수행하고 대부분의 task에서 최고의 성능을 보임
3. Robustness를 위해 “a random pixel sampling mechanism”을 제안
4. Future work
• Conceptual Caption Dataset에 대한 Pixel-BERT pre-training
• Self-supervised task를 Pixel-BERT에 적용 가능한지 연구
26/27
Thank you
4. Conclusions
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
28/27
4. Conclusions
Pixel Feature Embedding의 비효율성
Pixel-BERT Oscar
Randomly selected 100 features (2048 dim)
region feature is a P-dimensional
vector (i.e., P = 2048), region position
z a R-dimensional vector (i.e., R = 4
or 6)
Pixel-BERT
Oscar Single Tesla P100 (16GB)
29/27

Contenu connexe

Tendances

인공지능개론 (머신러닝 중심)
인공지능개론 (머신러닝 중심)인공지능개론 (머신러닝 중심)
인공지능개론 (머신러닝 중심)SK(주) C&C - 강병호
 
クラシックな機械学習の入門  5. サポートベクターマシン
クラシックな機械学習の入門  5. サポートベクターマシンクラシックな機械学習の入門  5. サポートベクターマシン
クラシックな機械学習の入門  5. サポートベクターマシンHiroshi Nakagawa
 
머신러닝(딥러닝 요약)
머신러닝(딥러닝 요약)머신러닝(딥러닝 요약)
머신러닝(딥러닝 요약)Byung-han Lee
 
[DL輪読会]自動運転技術の課題に役立つかもしれない論文3本
[DL輪読会]自動運転技術の課題に役立つかもしれない論文3本[DL輪読会]自動運転技術の課題に役立つかもしれない論文3本
[DL輪読会]自動運転技術の課題に役立つかもしれない論文3本Deep Learning JP
 
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2Preferred Networks
 
畳み込みLstm
畳み込みLstm畳み込みLstm
畳み込みLstmtak9029
 
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...tomoaki0705
 
[DL輪読会]MetaFormer is Actually What You Need for Vision
[DL輪読会]MetaFormer is Actually What You Need for Vision[DL輪読会]MetaFormer is Actually What You Need for Vision
[DL輪読会]MetaFormer is Actually What You Need for VisionDeep Learning JP
 
Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)
Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)
Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)YuMaruyama
 
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition
論文紹介  wav2vec: Unsupervised Pre-training for Speech Recognition論文紹介  wav2vec: Unsupervised Pre-training for Speech Recognition
論文紹介 wav2vec: Unsupervised Pre-training for Speech RecognitionYosukeKashiwagi1
 
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...Deep Learning JP
 
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けてThe Whole Brain Architecture Initiative
 
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-D...
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-D...DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-D...
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-D...harmonylab
 
6 線形代数に基づくデータ解析の基礎
6 線形代数に基づくデータ解析の基礎6 線形代数に基づくデータ解析の基礎
6 線形代数に基づくデータ解析の基礎Seiichi Uchida
 
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP LatentsDeep Learning JP
 
【招待講演】パラメータ制約付き行列分解のベイズ汎化誤差解析【StatsML若手シンポ2020】
【招待講演】パラメータ制約付き行列分解のベイズ汎化誤差解析【StatsML若手シンポ2020】【招待講演】パラメータ制約付き行列分解のベイズ汎化誤差解析【StatsML若手シンポ2020】
【招待講演】パラメータ制約付き行列分解のベイズ汎化誤差解析【StatsML若手シンポ2020】Naoki Hayashi
 
数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理Taiji Suzuki
 

Tendances (20)

인공지능개론 (머신러닝 중심)
인공지능개론 (머신러닝 중심)인공지능개론 (머신러닝 중심)
인공지능개론 (머신러닝 중심)
 
クラシックな機械学習の入門  5. サポートベクターマシン
クラシックな機械学習の入門  5. サポートベクターマシンクラシックな機械学習の入門  5. サポートベクターマシン
クラシックな機械学習の入門  5. サポートベクターマシン
 
머신러닝(딥러닝 요약)
머신러닝(딥러닝 요약)머신러닝(딥러닝 요약)
머신러닝(딥러닝 요약)
 
[DL輪読会]自動運転技術の課題に役立つかもしれない論文3本
[DL輪読会]自動運転技術の課題に役立つかもしれない論文3本[DL輪読会]自動運転技術の課題に役立つかもしれない論文3本
[DL輪読会]自動運転技術の課題に役立つかもしれない論文3本
 
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
 
畳み込みLstm
畳み込みLstm畳み込みLstm
畳み込みLstm
 
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...
 
[DL輪読会]MetaFormer is Actually What You Need for Vision
[DL輪読会]MetaFormer is Actually What You Need for Vision[DL輪読会]MetaFormer is Actually What You Need for Vision
[DL輪読会]MetaFormer is Actually What You Need for Vision
 
Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)
Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)
Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition
論文紹介  wav2vec: Unsupervised Pre-training for Speech Recognition論文紹介  wav2vec: Unsupervised Pre-training for Speech Recognition
論文紹介 wav2vec: Unsupervised Pre-training for Speech Recognition
 
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
 
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて
 
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-D...
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-D...DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-D...
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-D...
 
6 線形代数に基づくデータ解析の基礎
6 線形代数に基づくデータ解析の基礎6 線形代数に基づくデータ解析の基礎
6 線形代数に基づくデータ解析の基礎
 
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
【DL輪読会】Hierarchical Text-Conditional Image Generation with CLIP Latents
 
実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE
 
【招待講演】パラメータ制約付き行列分解のベイズ汎化誤差解析【StatsML若手シンポ2020】
【招待講演】パラメータ制約付き行列分解のベイズ汎化誤差解析【StatsML若手シンポ2020】【招待講演】パラメータ制約付き行列分解のベイズ汎化誤差解析【StatsML若手シンポ2020】
【招待講演】パラメータ制約付き行列分解のベイズ汎化誤差解析【StatsML若手シンポ2020】
 
数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理
 
PRML chapter7
PRML chapter7PRML chapter7
PRML chapter7
 

Similaire à [PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

multi modal transformers representation generation .pptx
multi modal transformers representation generation .pptxmulti modal transformers representation generation .pptx
multi modal transformers representation generation .pptxsiddharth1729
 
2016 MediaEval - Interestingness Task Overview
2016 MediaEval - Interestingness Task Overview2016 MediaEval - Interestingness Task Overview
2016 MediaEval - Interestingness Task Overviewmultimediaeval
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]Dongmin Choi
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using KerasIRJET Journal
 
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...CSCJournals
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSitakanta Mishra
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용홍배 김
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...CSCJournals
 
Semantic Image Retrieval Using Relevance Feedback
Semantic Image Retrieval Using Relevance Feedback  Semantic Image Retrieval Using Relevance Feedback
Semantic Image Retrieval Using Relevance Feedback dannyijwest
 
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIRJET Journal
 
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learning
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic LearningOmni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learning
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learningsipij
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learningijtsrd
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level FeatureDongmin Choi
 
Large-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxLarge-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxcroysierkathey
 
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...IRJET Journal
 
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...dbpublications
 

Similaire à [PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers (20)

multi modal transformers representation generation .pptx
multi modal transformers representation generation .pptxmulti modal transformers representation generation .pptx
multi modal transformers representation generation .pptx
 
2016 MediaEval - Interestingness Task Overview
2016 MediaEval - Interestingness Task Overview2016 MediaEval - Interestingness Task Overview
2016 MediaEval - Interestingness Task Overview
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using Keras
 
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
 
Semantic Image Retrieval Using Relevance Feedback
Semantic Image Retrieval Using Relevance Feedback  Semantic Image Retrieval Using Relevance Feedback
Semantic Image Retrieval Using Relevance Feedback
 
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
 
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learning
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic LearningOmni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learning
Omni-Modeler: Rapid Adaptive Visual Recognition with Dynamic Learning
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
Large-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxLarge-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docx
 
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
IRJET- Remote Sensing Image Retrieval using Convolutional Neural Network with...
 
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
 
kanimozhi2019.pdf
kanimozhi2019.pdfkanimozhi2019.pdf
kanimozhi2019.pdf
 
19
1919
19
 

Plus de Sunghoon Joo

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterSunghoon Joo
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersSunghoon Joo
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfSunghoon Joo
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...Sunghoon Joo
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionSunghoon Joo
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...Sunghoon Joo
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.Sunghoon Joo
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningSunghoon Joo
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningSunghoon Joo
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...Sunghoon Joo
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingSunghoon Joo
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...Sunghoon Joo
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationSunghoon Joo
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesSunghoon Joo
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchSunghoon Joo
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesSunghoon Joo
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...Sunghoon Joo
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...Sunghoon Joo
 

Plus de Sunghoon Joo (20)

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But Faster
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdf
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learning
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document reranking
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseases
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture Search
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
 

Dernier

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 

Dernier (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 

[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

  • 3. 1. Research Background Pre-training mechanism for cross-modality tasks • Self-attention 기법과 self-supervision의 결합으로 labeling없이도 초대량의 데이터로부터 context를 학습할 수 있게 되었고, language와 vision 각각의 분야 뿐 아니라 modality domain (Visual Question Answering, image captioning, image retrieval) 에서도 좋은 성능을 보이고 있다. Visual Question Answering (VQA) text-to-image retrieval Park, Gwangbeen, and Woobin Im. arXiv:1612.08354 3/27
  • 4. 1. Research Background Cross-modality learning in vison and language • Visual representation으로서 Visual classification을 위한 pre-trained CNN feature 활용 Ben-Younes, Hedi, et al. "Mutan: Multimodal tucker fusion for visual question answering." ICCV. (2017) Visual Genome Dataset 발표 CNN feature 활용 Visual backbone으로 faster-RCNN 활용 Transformer를 이용한 두 modality의 dense connection 학습 4/27
  • 5. 1. Research Background Cross-modality learning in vison and language https://visualgenome.org/ Krishna, Ranjay, et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." IJCV. (2017) Visual Genome Dataset 발표 CNN feature 활용 Visual backbone으로 faster-RCNN 활용 Transformer를 이용한 두 modality의 dense connection 학습 5/27
  • 6. 1. Research Background Cross-modality learning in vison and language Visual Genome Dataset 발표 CNN feature 활용 Visual backbone으로 faster-RCNN 활용 Transformer를 이용한 두 modality의 dense connection 학습 * PR-012 (By JinWon lee) Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. (2015) Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering." CVPR. (2018) Faster R-CNN 6/27
  • 7. 1. Research Background Cross-modality learning in vison and language Visual Genome Dataset 발표 CNN feature 활용 Visual backbone으로 faster-RCNN 활용 Transformer를 이용한 두 modality의 dense connection 학습 Self-attention 개념을 통해 입력된 문장 안의 각 token embedding을 다른 token들을 고려해 구할 수 있음 https://jalammar.github.io/illustrated-transformer/ Li, Xiujun, et al. "Oscar: Object-semantics aligned pre-training for vision-language tasks." ECCV. (2020) • Transformer의 self-attention과 feed-forward network을 통해 intra-domain 및 inter-domain의 dense connections 형성 Self-attention in Transformer 7/27
  • 8. 1. Research Background • Two-stream neural network approach • Visual 정보와 language 정보를 각각 처리한 two-stream neural network를 transformer layer를 통해 합치게 됨 • Vilbert • Single-stream neural network approach • Sentence embedding feature와 bounding box feature를 합친 후 BERT를 통해 bi-directional joint distribution을 학습시킴 Lu, J. et al., Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NIPS (2019) Get better joint representation for vision and language tasks 8/27
  • 9. 1. Research Background • Two-stream neural network approach • Visual 정보와 language 정보를 각각 처리한 two-stream neural network를 transformer layer를 통해 합치게 됨 • Single-stream neural network approach • Sentence embedding feature와 bounding box feature를 합친 후 BERT를 통해 bi-directional joint distribution을 학습시킴 • VL-BERT Get better joint representation for vision and language tasks Su, Weijie, et al. "Vl-bert: Pre-training of generic visual-linguistic representations." ICLR 2020. 9/27
  • 10. 1. Research Background Region-based visual feature extractors를 사용한 방법의 한계점 • Object detection이라는 특정 task를 수행하기 위해 추출된 visual feature이므로 language understanding를 하기에 충분한 image feature를 추출할 수 없다는 주장 • Visual information의 소실 : object의 형태, object들이 겹치는 부분에서 나타나는 관계 정보, 배경이나 이미지의 분위기에서 얻을 수 있는 정보들 10/27
  • 11. 1. Research Background Objective & Approach • We step out of the bounding box to make the full power of visual information in images for vision and language learning. • We propose Pixel-BERT that learns to align image pixels with text to build a more thorough semantic embedding between visual and textual information. word-level token embedding based on BERT CNN that takes image pixels as input for visual embedding learning Multi-modal transformers for jointly learning Task (VQA, retrieval, …) 11/27
  • 13. 2. Methods 1) Pre-Training Pixel-BERT 2) Training for downstream tasks Approach 13/27
  • 14. 2. Methods Architectures ① we learn from pixels to represent an image instead of using bounding boxes. ② Randomly sample feature pixels (100 features) during pre-training. (for robustness, computation cost) ③ Transformer의 self-attention과 feed-forward network을 통해 intra-domain 및 inter-domain의 dense connections 형성 ③ Tokenize ① 100 features ResNet-50, ResNeXt-152. pre-trained model on ImageNet ② 14/27
  • 15. • Masked Language Modeling (MLM) - visual feature가 token을 예측하는 데에 활용됨으로서 두 modality 사이가 mapping되도록 유도 - randomly mask language tokens with a probability of 0.15 - UNITER (2020, ECCV), vl-BERT (2020, ICLR), LXMERT (2019, EMNLP) 에서 적용된 방법 2. Methods pre-trained by MLM and ITM tasks • MS-COCO captions Chen, Xinlei, et al. arXiv:1504.00325 (2015). • Visual Genome (VG) 15/27
  • 16. • Image-Text Matching (ITM) - Transformer로 입력된 문장이 같이 입력된 이미지를 잘 설명하고 있는지 맞추는 binary classifiation task - 같은 수의 negative (unmatched image-sentence pairs)-positive sample 사용 2. Methods pre-trained by MLM and ITM tasks • For CNN: SGD with learning rate 1e-2 and weight decay 5e-4 For Transformer: AdamW with learning rate 1e-4 and weight decay 1e-2 • Pre-training: 64 NVIDIA Tesla V100 GPUs with the batch size 4096 samples for 40 epochs. • Pre-training setup The man at bat readies to swing at the pitch while the umpire looks on. A man taking a picture behind the girl True False 16/27
  • 17. • Masked Language Modeling (MLM) - visual feature가 token을 예측하는데에 활용됨으로서 두 modality 사이가 mapping되도록 유도 - randomly mask language tokens with a probability of 0.15 2. Methods pre-trained by MLM and ITM tasks • For CNN: SGD with learning rate 1e-2 and weight decay 5e-4 to optimize the CNN backbone For Transformer: AdamW with learning rate 1e-4 and weight decay 1e-2 • Pre-training: 64 NVIDIA Tesla V100 GPUs with the batch size 4096 samples for 40 epochs. • Image-Text Matching (ITM) 같은 수의 negative (unmatched image-sentence pairs)- positive sample 사용 • MS-COCO captions Chen, Xinlei, et al. arXiv:1504.00325 (2015). • Visual Genome (VG) • Pre-training setup 17/27
  • 19. 3. Experimental Results Downstream task - Visual Question Answering (VQA) Question [SEP] Image Pretrained Pixel-BERT • 18 epochs on 16 NVIDIA Tesla V100 GPUs with batch size 256 • Learning rate decay : by 10 at 12th and 16th epoch. ResNeXt-152 Faster-RCNN, transformer, pretrained model • Pixel-level에서의 image representation을 학습하는게 성능 향상에 도움이 됨 [CLS] 0 / 1 19/27 CNN feature, no transformer Faster-RCNN, no transformer ResNet-50
  • 20. 3. Experimental Results Downstream task - Natural Language for Visual Reasoning for Real (NLVR2) 𝑝1 𝑐𝑙𝑠 Q [SEP] Image1 Pixel-BERT • 18 epochs on 16 NVIDIA Tesla V100 GPUs with batch size 128 • Learning rate decay : by 10 at 12th and 16th epoch. https://lil.nlp.cornell.edu/nlvr/ cross-entropy loss [CLS] 𝑝2 𝑐𝑙𝑠 Q [SEP] Image2 Pixel-BERT [CLS] 0 / 1 Concat. 20/27
  • 21. 3. Experimental Results • 2개의 image를 입력받는 NLVR2 task에서도 좋은 성능을 보임 Downstream task - Natural Language for Visual Reasoning for Real (NLVR2) 21/27
  • 22. 3. Experimental Results • Query(text)와 Image를 Pixel-BERT에 입력하고 relevance score를 구하여 상위 검색 결과를 표시 • Unicoder-VL, UNITER와 비교했을 때, Pixel-BERT가 모든 subtask에서 의미 있는 성능 향상을 보임 • IR subtask는 이미지에 대한 global description을 이해하는 것이 필요한데, 이러한 점에서 pixel-BERT는 image pixel 과 language사이의 attention을 학습하도록 하는 장점이 있음 Downstream task - Image-Text Retrieval 1K testing results on Flickr30K 5-fold 1K testing and 5K testing results on MS-COCO TR: Image-to-image retrieval. 이미지가 주어졌을 때 적합한 텍스트 검색 12-Layer Transformer IR: Text-to-image retrieval. 텍스트가 주어졌을 때 적합한 이미지 검색 R@k: Recall at k. 모델이 검색결과로 k개를 냈을 때, 실제 relevant item이 얼마나 포함되어 있는지 나타냄. 22/27
  • 23. 3. Experimental Results • Pre-training task, Random pixel sampling 등의 아이디어가 subtask의 성능을 높이는 데에 역할을 했음을 확인 • 더 우수한 visual backbone모델을 활용하면 pixel-BERT의 성능을 높일 수 있을 것으로 기대됨 Ablation Study 1) ITM, MLM pre-training task Random pixel sampling 2) Random pixel sampling method 3) 더 우수한 visual backbone을 활용 23/27
  • 24. 3. Experimental Results • Pixel-BERT can well learn the visual representation in region level with cross-modality learning Visualization Bounding box 모델에서는 표현하기 어려웠던 부분 24/27
  • 26. 4. Conclusions 1. CNN-based Visual Encoder와 multimodal Transformer를 결합한 Pixel-BERT를 제안 2. Pixel-BERT기반의 pre-training model을 구축하고 down-stream task에서의 성능 확인 • Masked language model and image-text matching are two tasks designed for pre-training. • 4가지의 downstream vision and language tasks를 수행하고 대부분의 task에서 최고의 성능을 보임 3. Robustness를 위해 “a random pixel sampling mechanism”을 제안 4. Future work • Conceptual Caption Dataset에 대한 Pixel-BERT pre-training • Self-supervised task를 Pixel-BERT에 적용 가능한지 연구 26/27
  • 28. 4. Conclusions Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks 28/27
  • 29. 4. Conclusions Pixel Feature Embedding의 비효율성 Pixel-BERT Oscar Randomly selected 100 features (2048 dim) region feature is a P-dimensional vector (i.e., P = 2048), region position z a R-dimensional vector (i.e., R = 4 or 6) Pixel-BERT Oscar Single Tesla P100 (16GB) 29/27