State of the Machine Translation by Intento (November 2017)

State of the
Machine Translation
by Intento
November 2017

About
• At Intento, we want to make Machine Intelligence
services easy to discover, choose and use.
• So far, evaluation is the most problematic part: to
compare different services, one need to sign a lot of
contracts and integrate a lot of APIs.
• We deliver this overview report for FREE. To get the
full report or evaluate on you own dataset, contact us.
• Also, check out our Natural Language Understanding
Benchmark. NLU may help you to automate workﬂows
beyond the automated translation.
November 2017© Intento, Inc.

Overview
© Intento, Inc. November 2017
11
Machine Translation Services
35
Language Pairs
TRANSLATION QUALITY LANGUAGE COVERAGE
DEVELOPER EXPERIENCE MISCELLANEOUS
Get the full version of this report

Changes since July 2017
• +2 vendors: DeepL (beta), SAP (beta)
• +21 language pair
• Detailed performance analysis
• Developer experience comparison

Machine Translation
Services* Compared
* We have evaluated general purpose Cloud Machine Translation services with prebuilt translation models, provided via API. Some vendors also provide
web-based, on-premise or custom MT engines, which may differ on all aspects from what we’ve evaluated.
Baidu
Translate API
DeepL
API (beta)
Google Cloud
Translation API
GTCom
YeeCloud MT
IBM Watson
Language
Translator
Microsoft
Translator Text
API
PROMT
Cloud API
SAP Translation
Hub (beta)
SDL Cloud
Machine
Translation
Systran REST
Translation API
Yandex
Translate API

Translation Quality
Evaluation Methodology
Overall Performance
Available MT Quality
Price vs. Performance

Evaluation methodology (I)
• Translation quality is evaluated by computing LEPOR
score between reference translations and the MT output
(Slide 9).
• Currently, our goal is to evaluate performance of
translation between the most popular languages (Slide
10).
• We use public datasets from StatMT/WMT and
CASMACAT News Commentary (Slide 11).
• We have performed LEPOR metric convergence analysis
to identify minimal viable number of segments in the
dataset. See Slide 12 for some details.

Evaluation methodology (II)
• We consider MT service A more performant than B for the
language pair C if:
- mean LEPOR score of A is greater than LEPOR of B for
the pair C, and
- lower bound of the LEPOR 95% conﬁdence interval of A
is greater than the upper bound of the LEPOR conﬁdence
interval of B for the pair C. See Slide 12 for an example.
• Different language pairs (and different datasets) impose
different translation complexity. In order to compare
overall MT performance of different services we regularise
LEPOR scores across all language pairs (See Appendix A
for more details).

LEPOR score
• LEPOR: automatic machine translation evaluation metric
considering the enhanced Length Penalty, n-gram
Position difference Penalty and Recall
• In our evaluation, we used hLEPORA v.3.1:
• best metric from ACL-WMT 2013 contest
https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt
https://github.com/aaronlifenghan/aaron-project-lepor
LIKE BLEU,
BUT BETTER

Language Pairs
We focus on the
en-P1, P1-en and
P1-P1 (partially)
* https://w3techs.com/technologies/overview/content_language/all
Language groups by
web popularity*:
P1 - ≥ 2.0% websites
P2 - 0.5%-2% websites
P3 - 0.1-0.3% websites
P4 - <0.1% websites
en ru ja de es fr pt it zh cs tr ﬁ ro
en ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
ru ✓ ✓ ✓ ✓ ✓
ja ✓ ✓
de ✓ ✓
es ✓ ✓
fr ✓ ✓ ✓
pt ✓
it ✓ ✓
zh ✓ ✓
cs ✓
tr ✓
ﬁ ✓
ro ✓

Datasets
• WMT-2013 (translation task, news domain)
- en-es, es-en
- fr-en, en-fr
- cs-en, en-cs, de-en, en-de, ro-en, en-ro, ﬁ-en, en-ﬁ, ru-en, en-ru, tr-en, en-tr
- zh-en, en-zh
• NewsCommentary-2011
- en-ja, ja-en, en-pt, pt-en, en-it, it-en, ru-de, ru-es, ru-fr, ru-pt, ja-fr, de-ja, es-zh, fr-ru,
fr-es, it-pt, zh-it

LEPOR Convergence
We used 1440 - 3000 sentences per language pair. In all cases it’s clear that
the metric stabilises and adding more from the same domain won’t change the
outcome.
number of sentences
regularisedhLEPORscores
Aggregated across all language pairs Examples for individual language pairs:
Aggre-
gated
mean
Conﬁ-
dense
interval
Detailed data on each language pair provided in the full report

Overall Performance
35 language pairs, 1440-3000 sentences per pair
>70%
<40%
Variance
among
language
pairs

Available MT Quality
en ru ja de es fr pt it zh cs tr ﬁ ro
en 4 7 3 2 7 1 2 6 2 4 5 2
ru 2 3 3 4 4
ja 1 4
de 8 2
es 7 4
fr 6 1 8
pt 4
it 7 2
zh 6 4
cs 2
tr 2
ﬁ 1
ro 4
70 %
60 %
50 %
40 %
30 %
Maximal Achieved
hLEPOR score:
No. of
top-performing
MT Providers
Minimal price
for this quality,
per 1M char:
$$$ ≥$20
$$ $10-15
$ <$10
$$
$
$
$$
$$
$
$
$$$
$$$
$ $$$
$
$
$$$
$$ $$ $
$$$
$
$
$$
$
$
$$
$
$$
$$ $
$$
$
$$$
$
$$
$

Sample pair analysis: en-pt
LEPOR
score
Providers
Price range
(per 1M characters)
77 % Google $20
72 % Yandex, Microsoft $4.5-15
70 % Baidu, SDL, IBM $8-$21
62 % Systran, PROMT $3-$8
BEST QUALITY:
BEST PRICE:
PRICE&QUALITY:
Google
PROMT
Microsoft
ALL 35 PAIRS
AVAILABLE
IN THE FULL
REPORT

Price vs. Performance
AFFORDABILITY
PERFORMANCE
As of November 2017
COST-EFFECTIVE
ACCURATE
FREE
(BETA)
NOT
SET
YET
COST-EFFECTIVE
Performance
Regularized hLEPOR
score aggregated across
all language pairs in the
dataset
Affordability = 1/price
Using public volume-
based pricing tiers
Legend
• performance range:
- regularised average
- max across all pairs
- min across all pairs
• price range

Language Coverage
Supported and Unique per Provider
Coverage by Language Popularity

Language coverage
Unique
language
pairs -
supported
exclusively by
one provider
1
100
10000
Google Yandex Microsoft Baidu Systran SDL MT PROMT SAP DeepLIBM WatsonGTCom
2
1
2
1
54
920
1 246
3 202
6
20
4247
88
104111
756
3 422
8 556
10 712
Total
Unique
Supported language pairs
© Intento, Inc. July 2017Detailed data on the supported languages is provided in the full report

Language popularity
Language groups by
web popularity*:
P4 - <0.1% websites
* https://w3techs.com/technologies/overview/content_language/all
A total of
29070
pairs possible,
12989
are supported
across all providers
P1
en, ru, ja, de, es,
fr, pt, it, zh
P2
pl, fa, tr, nl, ko, cs, ar,
vi, el, sv in, ro, hu
P3
da, sk, ﬁ, th, bg, he, lt, uk,
hr, no, nb, sr, ca, sl, lv, et
P4
hi, az, bs, ms, is, mk, bn, eu, ka, sq,
gl, mn, kk, hy, se, uz, kr, ur, ta, nn, af,
be, si, my, br, ne, sw, km, ﬁl, ml, pa,
…

Language groups by
web popularity*:
P4 - <0.1% websites
100% 94% 63%
88%
31%
P1 P2 P3 P4
P1
P2
P3
P4
59%
100%
100%
94%
63%
100% 94%
94%
63%
63% 59%
Supported language pairs by popularity
* up from 44% in July 2017 as we better distinguish variations of the Chinese language© Intento, Inc.
Language coverage
45%*
of possible
language pairs
July 2017

by service provider
Google Cloud
Translation API
Microsoft
Translator Text API
Yandex
Translate API
Systran
REST Translation
API
SDL Cloud
Machine Translation
PROMT
Cloud API
IBM Watson
Language
Translation
July 2017© Intento, Inc.
Language coverage
Baidu
Translate API
GTCom
YeeCloud API
DeepL
API
SAP
Translation Hub
Detailed data on the supported languages is provided in the full report

Developer Experience (DX)
Evaluation Methodology
DX Charts

Evaluation methodology (I)
Here we evaluate overall service organisation from the following
angles:
• Product - Support of Machine Translation features desired for using the API in various MT scenarios
• Design - Overall API design and technical convenience
• Documentation - How well the API is documented
• Onboarding - How easy is to integrate and start using the API
• Commercial - Flexibility of the commercial terms
• Implementation - Important low-level features of the API
• Maintenance - Convenience of getting information about the API changes for ongoing support
• Reliability - Various technical issues we’ve encountered
Some references:
• http://talks.kinlane.com/apistrat/api101/index.html#/14
• https://mathieu.fenniak.net/the-api-checklist/
• https://www.slideshare.net/jmusser/ten-reasons-developershateyourapi
• https://restfulapi.net/richardson-maturity-model/
• https://github.com/shieldfy/API-Security-Checklist
• https://nordicapis.com/why-api-developer-experience-matters-more-than-ever/
• http://www.drdobbs.com/windows/measuring-api-usability/184405654?pgno=1

Evaluation methodology (II)
• Translation domains
• Translation engines
• Language autodetect
• Glossaries
• TM Support
• Custom engines
• Bulk mode
• Formatted text
• XLIFF support
Product
Design
Documentation
Onboarding
Commercial
Implementation
Maintenance
Reliability
• Authentication
• Use of SSL
• Quota info
• Domain info
• Balance info
• Self-sufﬁcient
• Intuitive
• Versioning
• Bulk mode
• Task-invocation
ratio
• I/O Structure
• List of endpoints
• User documentation
• Supported languages
• Quotas
• Response codes
• Error codes
• Error messages
• API explorer
• API console
• Number of docs
• HTML doc
• Self-registration
• Self-issued keys
• Self-payment
• Free / Trial plan
• Sandbox
• Test console
• Github repo
• Code libraries
• SDK / PDK
• Sample code
• Direct support
• Ticket system
• Self-support
• Tutorial
• FAQ / KB
• Starter package
• Public pricing
• Pay as you go
• Post-paid
• Volume discounts
• Payment systems
• Billing history
• API spec
• Data compression
• Supports JSON
• Negotiable content
• Unicode support
• Error codes
• Error messages
• News source
• Subscription news
• Versioning
• Changelog
• Release notes
• Roadmap
• Status dashboard
• Developer dashboard
• Exportable logs
• Uptime
• Sporadic errors
• Bugs
• Performance issues
• Status dashboard
• Outage alerts

by service provider
Developer experience
Available in the full report

Miscellaneous
API change frequency

API & Documentation changes since July 2017:
Change frequency
SDL Language Cloud 31
Google Translate API 14
IBM Watson Translator 10
Microsoft Translator API 7
SAP Translation Hub 6

Detailed version of this report
• We give this over view version for free.
• The full evaluation report contains:
- Detailed best-deal analysis for each of the 35 language
pairs
- Developer experience analysis for each of the 11 MT
providers
• Also, by ordering the full report you support our
ongoing evaluation of the Cloud MT
• To get the full report, reach us at hello@inten.to

Discover the best service providers for
your AI task
Evaluate performance on your own data
at a fraction of the potential cost
saving
Access any provider with no effort
using to our Single API
Intento Service Platform

Intento
https://inten.to
Konstantin Savenkov
CEO Intento, Inc.

<ks@inten.to>

Appendix A
Overall performance of the MT services across many language
pairs is computed in the following way:
1. [Standardisation] We compute mean language-standardised
LEPOR score (or z-score) for each provider.
2. [Scale adjustment] We restore the original scale by
multiplying z-score for each MT provider by the global LEPOR
standard deviation and adding the global mean LEPOR score.

State of the Machine Translation by Intento (November 2017)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to State of the Machine Translation by Intento (November 2017)

Similar to State of the Machine Translation by Intento (November 2017) (20)

More from Konstantin Savenkov

More from Konstantin Savenkov (20)

Recently uploaded

Recently uploaded (20)

State of the Machine Translation by Intento (November 2017)