SlideShare une entreprise Scribd logo
1  sur  30
3-step Parallel Corpus Cleaning using
Monolingual Crowd Workers
Toshiaki Nakazawa, Sadao Kurohashi
(Kyoto University)
Hayato Kobayashi, Hiroki Ishikawa
Manabu Sassano
(Yahoo Japan Corporation)
20/05/2015@PACLING2015
Parallel Corpora
• Essential resources for almost all MT systems
• The quality and quantity greatly affect the
translation quality
• Can be automatically constructed from
existing resources
– Europarl, patent families, Wikipedia…
• Need to manually construct it for domains
which do not have enough existing resources
2
Quality of Parallel Corpus
• Translation flaws are inevitable even thought
the professionals translate
– Homer nods (弘法にも筆の誤り in Japanese)
• The number of flaws might be reduced by
reviewing the whole corpus, but impossible
– The size of the parallel corpus is usually very big
– Very costly if we ask professionals to modify
3
This Work
• Detect and edit the translation flaws in the
existing manually-translated parallel corpus in
effective and cheap way
• Use crowdsourcing in 3-steps
1. Fluency Judgement
2. Edit of Unnatural Sentences
3. Verification of Edits
• The workers can be monolingual
4
Outline
• Motivation
• Brief introduction of collaborative research
between Yahoo Japan and Kyoto University
• 3-step parallel corpus cleaning
• Experiments
– Parallel corpus cleaning
– Translation
• Conclusion and Future Work
5
BRIEF INTRODUCTION OF
COLLABORATIVE RESEARCH
6
Collaborative Research Between Yahoo
Japan and Kyoto University
• Goal: Improve the Chinese-to-Japanese
translation for E-commerce site
• Task:
– Develop a corpus-based MT system
– Construct a parallel corpus of EC-site, especially
fashion domain
7
Fashion-domain EC-site Parallel Corpus
• 1.2M sentences (Zh: 6.3M, Ja: 8.7M words)
• Manually translated from fashion item pages
of Chinese EC-site (taobao) into Japanese
• Most of the sentences were translated by
Chinese native speakers (through the
translation company)
– Found many translation flaws in the Japanese
translations
8
[FDEC Corpus]
Mother-tongue Principle
9
http://portal.unesco.org/en/ev.php-URL_ID=13089&URL_DO=DO_TOPIC&URL_SECTION=201.html
“A translator should, as far as possible, translate
into his own mother tongue or into a language
of which he or she has a mastery equal to that of
his or her mother tongue.”
Recommendation on the Legal Protection of Translators and Translations and the
Practical Means to improve the Status of Translators, UNESCO, 22 Nov. 1976
Source Natives vs. Target Natives
• Pros and cons for source and target native
speakers
• Target natives for translation modification
[Albrecht+, 2009]
10
Source Natives Target Natives
Background knowledge
about the input sentence
High Medium/Low
Fluency and grammatical
correctness of the
output sentence
Medium/Low High
Examples of Translation Flaws
• Insertion
Ja: 随意にに1種類だけ注文
(order one type at at your own will)
• Remaining Chinese character
Ja: 元气あふれるという効果があります
Ref: 元気あふれるという効果があります
• Unnatural
Ja: お手入れの時、電源を切れ、 プラグを抜いてください。
(when cleaning, turn the power off, please pull out the plug)
11
气 気
Hanzi Kanji
Other Translation Flaws
• Omission (not translated)
Zh: 看看有没有其他合适的商品
Ja: 看看有没有その他合適的商品
• Mistranslation
Zh:加湿器功能 (functions of humidifier)
Ja: 除湿器の機能 (functions of dehumidifier)
Zh: 木耳
12
our framework cannot fix these kinds of flaws
frillwood ear mushroomskirt with wood ear mushroom?
3-STEP
PARALLEL CORPUS CLEANING
13
3-steps of Cleaning
1. Fluency Judgement
– detects the translation flaws
2. Edit of Unnatural Sentences
– edit the translated sentences
3. Verification of Edits
– check if the edited translation is better than the
original one
14
Step 1: Fluency Judgement
• Task: judge if the sentences are natural and
grammatically correct
• Only showing the translated (target, Japanese)
sentences
15
e.g. 随意にに1種類だけ注文
Is this sentence natural and
grammatically correct?
No!!
No!!
Step 2: Edit of Unnatural Sentences
• Task: edit the unnatural translated sentences
• only showing the translated sentences, or
show the source sentence as well for the
reference
16
e.g. 随意にに1種類だけ注文(随便拍下一种)
Please modify this
sentence to be natural and
grammatically correct
随意に1種類だけ注文
Nothing to modify!
Step 3: Verification of Edits
• Task: judge if the edited translation is better
than the original one
• This step is important to further improve the
quality of the outcome because the edits are
not necessarily correct
17
e.g. 随意にに1種類だけ注文 vs. 随意に1種類だけ注文
Is the right one more
natural and grammatically
correct than the left one?
Yes!!
Yes!!
CORPUS CLEANING EXPERIMENTS
18
Crowdsourcing Service in the World
19
http://www.crowdinfo.jp/2014/02/02/world-crowdsourcing-service/
• Several styles of crowdsourcing tasks such as
Yes/No questions and free writings
• The service is run in Japan; therefore most of
the workers are Japanese
• Not able to select the workers by their abilities
• The workers in our experiments do not
necessarily understand Chinese
– perhaps almost all of them does not
20
Crowdsourcing
http://crowdsourcing.yahoo.co.jp
Step 1: Fluency Judgement
• 358,085 sentences from the FDEC corpus with
length between 10 and 130 characters
• Only Japanese sentences are shown
• Asked 5 different workers for each question
21
# unnatural 5 4 3 2 1 0
# sents.
ratio
13,056
(3.6%)
35,048
(9.8%)
60,200
(16.8%)
83,150
(23.2%)
93,187
(26.0%)
73,444
(20.5%)
30%!
Step 2: Edit of Unnatural Sentences
• 47,420 sentences which were judged as
unnatural by 4 or more workers in Step 1
• Original Chinese sentence is also shown
• Asked 3 different workers for each question
22
# edits 3 2 1 0
# sents.
ratio
3,755
(7.9%)
12,498
(26.4%)
18,289
(38.6%)
12,878
(27.2%)
Step 3: Verification of Edits
• 54,550 edits which were generated in Step 2
• Original Chinese sentence is also shown
• Asked 5 different workers for each question
23
# better 5 4 3 2 1 0
# sents.
ratio
25,053
(45.9%)
16,478
(30.2%)
7,706
(14.1%)
3,338
(6.1%)
1,462
(2.7%)
513
(0.9%)
Translation Experiment
• Dataset: whole FDEC Corpus
– Cleaned: verified to be better by the majority
• Decoder: KyotoEBMT [Richardson+, 2014]
• Evaluation: BLEU
24
# sentences Original Cleaned
Train 1,220,597 1,256,908
Dev 11,186 11,489
Test 11,200 11,495
Experimental Results
• Corpus cleaning contributes to improve the
translation quality!
• Cleaning the Dev and Test sets has bad effect
on translation quality…
25
Train Original Cleaned Cleaned Cleaned
Dev Original Original Cleaned Cleaned
Test Original Original Original Cleaned
BLEU 21.39 21.69 21.34 21.12
Natural, but Incorrect/Unequal
• Reviewed 100 edits which are judged to be
more natural than the original sentence by 5
workers
• Found 3 types of inequalities
1. deletion of symbols (8 cases)
2. omission (13 cases)
3. mistranslation (5 cases)
see the proceedings for detailed examples
26
Experimental Results
• The inequalities have bad effect on the
automatic evaluation scores because they
suppose the content of the input and output
are strictly equal
27
Train Original Cleaned Cleaned Cleaned
Dev Original Original Cleaned Cleaned
Test Original Original Original Cleaned
BLEU 21.39 21.69 21.34 21.12
Crowdsourcing Cost
• Cost for cleaning 6.8M words used in the
experiments
28
Professional* Our Work
Fee 40 million JPY 2.6 million JPY
Time 1700 days 186 hours
* These values are estimated from
http://www.editage.com
Conclusion and Future Work
• Proposed a framework of cleaning existing
parallel corpora efficiently and cheaply
– 3-step monolingual crowdsourcing
– Improved the fluency of the sentences
• Future work
– How to reduce the inequalities of the edits?
– How to improve the correctness of the translation
by monolingual workers?
29
30

Contenu connexe

En vedette

Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsCrowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsMatthew Lease
 
Promoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine TranslationPromoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine TranslationToshiaki Nakazawa
 
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Toshiaki Nakazawa
 
Data Science with Humans in the Loop
Data Science with Humans in the LoopData Science with Humans in the Loop
Data Science with Humans in the LoopLora Aroyo
 
G社のNMT論文を読んでみた
G社のNMT論文を読んでみたG社のNMT論文を読んでみた
G社のNMT論文を読んでみたToshiaki Nakazawa
 
第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析Toshiaki Nakazawa
 
Attention-based NMT description
Attention-based NMT descriptionAttention-based NMT description
Attention-based NMT descriptionToshiaki Nakazawa
 
自然言語処理のためのDeep Learning
自然言語処理のためのDeep Learning自然言語処理のためのDeep Learning
自然言語処理のためのDeep LearningYuta Kikuchi
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情Yuta Kikuchi
 
ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017Toshiaki Nakazawa
 

En vedette (12)

Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsCrowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to Ethics
 
Promoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine TranslationPromoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine Translation
 
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
 
Data Science with Humans in the Loop
Data Science with Humans in the LoopData Science with Humans in the Loop
Data Science with Humans in the Loop
 
G社のNMT論文を読んでみた
G社のNMT論文を読んでみたG社のNMT論文を読んでみた
G社のNMT論文を読んでみた
 
第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析
 
Attention-based NMT description
Attention-based NMT descriptionAttention-based NMT description
Attention-based NMT description
 
NLP2017 NMT Tutorial
NLP2017 NMT TutorialNLP2017 NMT Tutorial
NLP2017 NMT Tutorial
 
自然言語処理のためのDeep Learning
自然言語処理のためのDeep Learning自然言語処理のためのDeep Learning
自然言語処理のためのDeep Learning
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向
 
ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017
 

Similaire à 3-step parallel corpus cleaning using monolingual crowd workers

Translation effect.ppt
Translation effect.pptTranslation effect.ppt
Translation effect.pptALFAFAAMIN
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
Intro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 PresentationIntro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 PresentationFederal Communicators Network
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8Ron Martinez
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Antonio Toral
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Ron Martinez
 
How to Self Assess Your Skillset
How to Self Assess Your SkillsetHow to Self Assess Your Skillset
How to Self Assess Your SkillsetEliana Lobo
 
Multi lingual corpus for machine aided translation
Multi lingual corpus for machine aided translationMulti lingual corpus for machine aided translation
Multi lingual corpus for machine aided translationAashna Phanda
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...Lifeng (Aaron) Han
 
Non-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editorNon-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editorMark Matsuno
 
MT and Post Editing in master's level translation education
MT and Post Editing in master's level translation education MT and Post Editing in master's level translation education
MT and Post Editing in master's level translation education Jakub Absolon
 

Similaire à 3-step parallel corpus cleaning using monolingual crowd workers (20)

1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
Translation effect.ppt
Translation effect.pptTranslation effect.ppt
Translation effect.ppt
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Intro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 PresentationIntro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 Presentation
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019
 
How to Self Assess Your Skillset
How to Self Assess Your SkillsetHow to Self Assess Your Skillset
How to Self Assess Your Skillset
 
Multi lingual corpus for machine aided translation
Multi lingual corpus for machine aided translationMulti lingual corpus for machine aided translation
Multi lingual corpus for machine aided translation
 
Modality-Preserving Phrase-based Statistical Machine Translation
Modality-Preserving Phrase-based Statistical Machine TranslationModality-Preserving Phrase-based Statistical Machine Translation
Modality-Preserving Phrase-based Statistical Machine Translation
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
 
Machine translator Introduction
Machine translator IntroductionMachine translator Introduction
Machine translator Introduction
 
Non-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editorNon-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editor
 
MT and Post Editing in master's level translation education
MT and Post Editing in master's level translation education MT and Post Editing in master's level translation education
MT and Post Editing in master's level translation education
 

Dernier

Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and momentdonamiaquintan2
 
Explainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosExplainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosZachary Labe
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and AnnovaMansi Rastogi
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDivyaK787011
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterHanHyoKim
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
How we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxHow we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxJosielynTars
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11GelineAvendao
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 

Dernier (20)

AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clone
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and moment
 
Explainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosExplainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenarios
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annova
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarter
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
How we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptxHow we decide powerpoint presentation.pptx
How we decide powerpoint presentation.pptx
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 

3-step parallel corpus cleaning using monolingual crowd workers

  • 1. 3-step Parallel Corpus Cleaning using Monolingual Crowd Workers Toshiaki Nakazawa, Sadao Kurohashi (Kyoto University) Hayato Kobayashi, Hiroki Ishikawa Manabu Sassano (Yahoo Japan Corporation) 20/05/2015@PACLING2015
  • 2. Parallel Corpora • Essential resources for almost all MT systems • The quality and quantity greatly affect the translation quality • Can be automatically constructed from existing resources – Europarl, patent families, Wikipedia… • Need to manually construct it for domains which do not have enough existing resources 2
  • 3. Quality of Parallel Corpus • Translation flaws are inevitable even thought the professionals translate – Homer nods (弘法にも筆の誤り in Japanese) • The number of flaws might be reduced by reviewing the whole corpus, but impossible – The size of the parallel corpus is usually very big – Very costly if we ask professionals to modify 3
  • 4. This Work • Detect and edit the translation flaws in the existing manually-translated parallel corpus in effective and cheap way • Use crowdsourcing in 3-steps 1. Fluency Judgement 2. Edit of Unnatural Sentences 3. Verification of Edits • The workers can be monolingual 4
  • 5. Outline • Motivation • Brief introduction of collaborative research between Yahoo Japan and Kyoto University • 3-step parallel corpus cleaning • Experiments – Parallel corpus cleaning – Translation • Conclusion and Future Work 5
  • 7. Collaborative Research Between Yahoo Japan and Kyoto University • Goal: Improve the Chinese-to-Japanese translation for E-commerce site • Task: – Develop a corpus-based MT system – Construct a parallel corpus of EC-site, especially fashion domain 7
  • 8. Fashion-domain EC-site Parallel Corpus • 1.2M sentences (Zh: 6.3M, Ja: 8.7M words) • Manually translated from fashion item pages of Chinese EC-site (taobao) into Japanese • Most of the sentences were translated by Chinese native speakers (through the translation company) – Found many translation flaws in the Japanese translations 8 [FDEC Corpus]
  • 9. Mother-tongue Principle 9 http://portal.unesco.org/en/ev.php-URL_ID=13089&URL_DO=DO_TOPIC&URL_SECTION=201.html “A translator should, as far as possible, translate into his own mother tongue or into a language of which he or she has a mastery equal to that of his or her mother tongue.” Recommendation on the Legal Protection of Translators and Translations and the Practical Means to improve the Status of Translators, UNESCO, 22 Nov. 1976
  • 10. Source Natives vs. Target Natives • Pros and cons for source and target native speakers • Target natives for translation modification [Albrecht+, 2009] 10 Source Natives Target Natives Background knowledge about the input sentence High Medium/Low Fluency and grammatical correctness of the output sentence Medium/Low High
  • 11. Examples of Translation Flaws • Insertion Ja: 随意にに1種類だけ注文 (order one type at at your own will) • Remaining Chinese character Ja: 元气あふれるという効果があります Ref: 元気あふれるという効果があります • Unnatural Ja: お手入れの時、電源を切れ、 プラグを抜いてください。 (when cleaning, turn the power off, please pull out the plug) 11 气 気 Hanzi Kanji
  • 12. Other Translation Flaws • Omission (not translated) Zh: 看看有没有其他合适的商品 Ja: 看看有没有その他合適的商品 • Mistranslation Zh:加湿器功能 (functions of humidifier) Ja: 除湿器の機能 (functions of dehumidifier) Zh: 木耳 12 our framework cannot fix these kinds of flaws frillwood ear mushroomskirt with wood ear mushroom?
  • 14. 3-steps of Cleaning 1. Fluency Judgement – detects the translation flaws 2. Edit of Unnatural Sentences – edit the translated sentences 3. Verification of Edits – check if the edited translation is better than the original one 14
  • 15. Step 1: Fluency Judgement • Task: judge if the sentences are natural and grammatically correct • Only showing the translated (target, Japanese) sentences 15 e.g. 随意にに1種類だけ注文 Is this sentence natural and grammatically correct? No!! No!!
  • 16. Step 2: Edit of Unnatural Sentences • Task: edit the unnatural translated sentences • only showing the translated sentences, or show the source sentence as well for the reference 16 e.g. 随意にに1種類だけ注文(随便拍下一种) Please modify this sentence to be natural and grammatically correct 随意に1種類だけ注文 Nothing to modify!
  • 17. Step 3: Verification of Edits • Task: judge if the edited translation is better than the original one • This step is important to further improve the quality of the outcome because the edits are not necessarily correct 17 e.g. 随意にに1種類だけ注文 vs. 随意に1種類だけ注文 Is the right one more natural and grammatically correct than the left one? Yes!! Yes!!
  • 19. Crowdsourcing Service in the World 19 http://www.crowdinfo.jp/2014/02/02/world-crowdsourcing-service/
  • 20. • Several styles of crowdsourcing tasks such as Yes/No questions and free writings • The service is run in Japan; therefore most of the workers are Japanese • Not able to select the workers by their abilities • The workers in our experiments do not necessarily understand Chinese – perhaps almost all of them does not 20 Crowdsourcing http://crowdsourcing.yahoo.co.jp
  • 21. Step 1: Fluency Judgement • 358,085 sentences from the FDEC corpus with length between 10 and 130 characters • Only Japanese sentences are shown • Asked 5 different workers for each question 21 # unnatural 5 4 3 2 1 0 # sents. ratio 13,056 (3.6%) 35,048 (9.8%) 60,200 (16.8%) 83,150 (23.2%) 93,187 (26.0%) 73,444 (20.5%) 30%!
  • 22. Step 2: Edit of Unnatural Sentences • 47,420 sentences which were judged as unnatural by 4 or more workers in Step 1 • Original Chinese sentence is also shown • Asked 3 different workers for each question 22 # edits 3 2 1 0 # sents. ratio 3,755 (7.9%) 12,498 (26.4%) 18,289 (38.6%) 12,878 (27.2%)
  • 23. Step 3: Verification of Edits • 54,550 edits which were generated in Step 2 • Original Chinese sentence is also shown • Asked 5 different workers for each question 23 # better 5 4 3 2 1 0 # sents. ratio 25,053 (45.9%) 16,478 (30.2%) 7,706 (14.1%) 3,338 (6.1%) 1,462 (2.7%) 513 (0.9%)
  • 24. Translation Experiment • Dataset: whole FDEC Corpus – Cleaned: verified to be better by the majority • Decoder: KyotoEBMT [Richardson+, 2014] • Evaluation: BLEU 24 # sentences Original Cleaned Train 1,220,597 1,256,908 Dev 11,186 11,489 Test 11,200 11,495
  • 25. Experimental Results • Corpus cleaning contributes to improve the translation quality! • Cleaning the Dev and Test sets has bad effect on translation quality… 25 Train Original Cleaned Cleaned Cleaned Dev Original Original Cleaned Cleaned Test Original Original Original Cleaned BLEU 21.39 21.69 21.34 21.12
  • 26. Natural, but Incorrect/Unequal • Reviewed 100 edits which are judged to be more natural than the original sentence by 5 workers • Found 3 types of inequalities 1. deletion of symbols (8 cases) 2. omission (13 cases) 3. mistranslation (5 cases) see the proceedings for detailed examples 26
  • 27. Experimental Results • The inequalities have bad effect on the automatic evaluation scores because they suppose the content of the input and output are strictly equal 27 Train Original Cleaned Cleaned Cleaned Dev Original Original Cleaned Cleaned Test Original Original Original Cleaned BLEU 21.39 21.69 21.34 21.12
  • 28. Crowdsourcing Cost • Cost for cleaning 6.8M words used in the experiments 28 Professional* Our Work Fee 40 million JPY 2.6 million JPY Time 1700 days 186 hours * These values are estimated from http://www.editage.com
  • 29. Conclusion and Future Work • Proposed a framework of cleaning existing parallel corpora efficiently and cheaply – 3-step monolingual crowdsourcing – Improved the fluency of the sentences • Future work – How to reduce the inequalities of the edits? – How to improve the correctness of the translation by monolingual workers? 29
  • 30. 30

Notes de l'éditeur

  1. proverb
  2. imperative
  3. mù’ěr
  4. The bilingual workers would edit the translations more precisely with the reference source sentence, and monolingual workers just ignore them