Tip:
Highlight text to annotate it
X
>> Human parity and automatic translation
is that even possible?
I can tell you that machine translation has surpassed
this human's ability to
translate between Estonian and Korean,
a long, long time ago.
But it turns out that
surpassing or even meeting the quality of
the professional human translator is quite a challenge
and has really been
the holy grail of machine translation research,
for the last 60 years or so.
>> Until now.
>> Until today or maybe a couple of
weeks ago that's right.
Today we have Hany Hassan Awadalla
and Christian Federmann,
report on the project to achieve human parity.
That's the Parity with the kind of
human who was a professional translator.
Hany and Christian both work
in the Microsoft Translator Team.
That's the team that builds
the translation service that
powers translations in Microsoft products,
and in many third party products.
Hany is a principal research scientist
in the team focused on
advancing the foundational technologies that
formed the core of the translation engine,
the piece that we call the decoder.
His interests are in the areas of machine learning
and deep learning applied to machine translation,
natural language processing, speech translation,
and semi supervised machine learning.
He completed his PhD at Dublin City University
after finishing his master's at Cairo University.
He Started working at IBM in 1996,
where he was also a member of
the IBM team working on machine translation.
He has been a member of the translator team
since he joined Microsoft in 2010.
He's responsible for the major breakthroughs that enabled
us to ship Speech Translation in Skype in 2014,
namely the technology we call,
"True text," turning what a person
said into what the person actually meant to say.
And making machine translation friendlier to
social media and colloquial utterances.
Christian is a senior
Program Manager in the translator team.
He owns evaluation methodology,
and he owns orchestrating
the actual evaluations with human judges.
He's the gatekeeper for quality for every release,
and every update the team
makes to the production systems.
He defined what human parity means for
translation in concrete provable terms,
and was responsible for gathering results in
a meaningful and fully traceable way.
Because you can assume that making
such a claim invites a lot of scrutiny,
from the other people working in the field.
Christian finished his PhD in machine translation at
Saarland University in saarbrucken in 2013.
He has been leading MT evaluations
since that time including;
defining and managing the evaluation
for industry-wide MT competitions.
Namely the Workshop on Statistical Machine Translation,
and the International Workshop
on Spoken Language Translation.
Christian is also the author
of the Open Source Tool Appraise,
which is used for running all of these evaluations,
you can ask him after
the talk how to make
use of this tool, and how to contribute.
Now let's start, Found in Translation,
Achieving Human Parity on
Chinese and English News Translation.
>> Thanks Keith, so good morning everyone,
thanks for being here today.
Christian and myself will
like to introduce you to our journey,
trying to achieve human parity
on Chinese to English News translation.
That was a wonderful collaboration
between our Microsoft team at Redmond,
and MSRA teams, Deep Learning Team,
and NLC teams at China.
So, the people in archive and we have
many outstanding research we've
collaborated on that effort.
So, first the internal project name
or code name for that project was,
"Project Babel," and yeah,
you can imagine that the Babel is
correlated to achieving such high quality of translation.
But we started at the beginning,
trying to answer two main question; The first question,
is the machine translation quality is now high enough,
or approximation to the human parity?
Or how far we are?
The second which is more fundamental question,
how we can measure that?
On other fields like speech recognition for example,
it is easier to measure the human parity,
in translation it is even much,
much harder to measure the quality
rather than the human evaluation.
We will cover in the talk,
how we achieved the human parity and how we measured it.
There was claims around 2016,
that the new wave of
Neural Machine Translation is approximating human parity.
Now, we can be
sure about that we're really achieving that human parity.
I like that sketchy timeline
for machine translation from Chris Manning 2016.
It's very sketchy as you can imagine because it is not
actually based on some numbers,
on the MT quality side.
But machine translation has been a dream for decades.
Starting from the '40s and '60s,
people started to look at
machine translation as a problem
that can be solved by computers.
At around the '60s,
people stopped looking at that due to
some skepticism about how computers can do reasoning.
Still as of today,
we don't do any reasoning,
but we're getting better in translation.
So, around the early '90s,
there were a breakthrough in translation when
Statistical Machine Translation has been
introduced based on information theory.
And from that time a lot of
momentum for the translation has been gained.
Mainly, introducing phrase based translation in
the mid of the last decade.
That enabled us and
all other players on the online Machine Translation,
Google, and others to shape online translation system,
that is really scalable and of good-quality.
Eventually, researcher community start to add
syntax based to the translation,
phrase based translation.
But still it is being
noticeable that it is
machine translation, not human translation.
Then around 2016, when
the new wave of
Neural Machine Translation has start to be mainstream,
and at that time,
we start seeing good momentum
for the Neural Machine Translation.
As of today, we can modify Chris Manning chart a
little bit by adding
our milestone of achieving human parity,
and we still think there is a lot of
remaining problems to be done and
solved on machine translation to reach better quality.
Okay. So, we will have an overview of
the current state of the art
of neural machine translation,
then we'll go through our contribution in that work.
Sorry I keep reading the transcription
for debugging purposes but it looks okay.
So, the main approach
for Neural Machine Translation,
is called the encoder-decoder.
In the older day when we started
doing phrase based translation,
we used to split
or shunk that input sentence into pieces.
These pieces were called the phrases,
and for those phrases we seek to
translate each phrase on it's own,
and then eventually we use
language models to put pieces together.
As you can imagine this introduces
a lot of disfluent albeit on translation,
and that was the main characteristic
of the phrase based translation.
Encoder-decoder framework for neural machine translation
has been early proposed around 2014,
by two different people and it mainly
tries to achieve full representation
for the input syntax,
and use that representation for
decoding the output word-by-word.
In representing the encoder a lot of
approaches has been proposed there but
mainly the mainstream component
follow the encoder-decoder architecture.
As you can imagine that is more at
least appealing for the human cognition,
because when you're trying to
understand or translate a sentence,
you really look at the full sentence trying to
understand the meaning and
to get the representation from there.
The actual representation of the encoder.
The first minutes dream, neural network,
neural machine translation system,
all of them based on encoder-decoder framework,
and it tries to represent the input as recurrent network.
In recurrent network you
go through time step on each one,
trying to learn the fuller representation
for the sentence at once.
And then by the end of the sentence,
you have a stage that you can
use to generate the decoding step-by-step.
That was very surprisingly for
the whole machine translation community that it worked
at the beginning, and yeah,
it's a very simple,
it's a just bilingual language model that
use simple end-to-end model to produce translation.
It works quite well for shorter sentence,
but not for the longer one.
For the longer sentence,
I can imagine that you are not seeing
the title of the slide I told you to the transcription.
What we actually do, okay.
>> You can move it to the bottom and-
>> It's okay.
So, mainly- How do you move it from here.
>> [inaudible] left and right, top right.
>> [inaudible].
>> Yes. Okay. So, the main breakthrough
in Neural Machine Translation,
was the introduction of the attention based model.
Attention based model are like the sequence to sequence,
encoder-decoder model without attention.
It can represent longer-range representation.
Instead of depending on
the last state for representing the whole sentence,
we have self distribution on all possible states.
Mainly, we can query
the similarity between the current word
we are trying to translate,
and all other words on the sequence of the encoder,
and when do with soft weights that we can normalize and
do with a soft distribution on all possible state.
That gives us flexible representation for
the input sentence based on which word you translate.
That is really what enabled
neural machine translation to work
across all different approaches.
So, the mainstream approach now for sequence to sequence
modelling is RNN-based approaches.
It is very efficient representation.
Our products are based on that actually currently,
and it can get a lot of credit
from the attention-based for long-range dependencies.
Though, it has a couple of limitations,
meaning it is very hard to parallelize.
Second, it has to go through sequential computation for
each time-step and harder
to catch long-term dependencies.
Some other approaches start to appear to
try to overcome such limitations,
mainly convolutional-based approaches.
By fear and Google-bite Newton WaveNet.
WaveNet is again, encoder-decoder architecture
that is now the new state
of the art for text-to-speech workload.
Moreover, newer approach start
to depend only on the attention,
which is the Attention Audio Mode Model Transformer,
proposed by Google Brain Team.
Here, we just relieve the whole constraints
of sequential operation and
move everything to the attention field.
Instead of having the recurrent network
to model the encoder-decoder,
we use what's called 'Self Attention'.
In that, each word can
attend to the next towards in the sequence.
This nice visualization, representation,
for the attention model,
you can see the encoder are going step-by-step,
calculating the attention between
all sequence on the source side.
Then at decoding time,
it starts calculating the attention between
the source and the target as well as different targets,
and to keep generating in that way.
You choose only the last layer on
the encoder and it choose the left side of the decoder.
As you can see here,
it is very easy to parallelize
all those operations because
there is no sequence bandwidth.
It can be just two matrices
that when multiplied together,
they can speed up the training process massively.
So, let us take a more closer look
on that Transformer Model because that was
the main component that was
used on the human parity project.
So, starting from the Encoder,
we'll have input embeddings and,
since the Transformer Model doesn't know
anything about the position,
we add positional Encoding to that as well.
Then we step-by-step try to
have layers of that Multi-Layer Attention
over all sequences of the inputs
along with residual and normalization,
followed by Feed Forward Network.
These all represent one single layer,
that we have many such layer there.
At the Decoder side,
we'll follow exactly the same architecture.
Though here it is mastered because we don't want
to look at the right side, only the left side.
Then we have Multi-Layer Attention
between the source and the target,
followed by a Feed Forward Network.
All these once again,
one layer only, and you have like six layers among those.
Transformer Model now represents the state of
the art on Neural Machine Translation,
and we are using it in most
of the systems that we are reporting today.
So, we will cover
the areas that we explored during that project.
You can imagine that we explored a lot of other areas,
but at least we are here covering both works.
A lot of the approaches we tried didn't work actually.
So, we are trying to explore two main directions here.
First, how we can explore
the Duality of the Machine Translation Problem.
The Duality of the Machine
Translation Problem means that we can
have two different problems that really help each other.
The first is, translating from- for example,
Chinese to English, the
other is translating from English to Chinese.
If we can get the systems to help each
other to refine their translation,
that is the duality of the translation problem.
We'll explore that duality
in many components and areas in our system,
as we'll show later.
To give some contrast
between duality and back translation,
which is widely used in the research community,
and who are still using back translation as well.
One hour work in back translation,
you start with bilingual training data,
you train a reverse system from English to Chinese.
For example, use monolingual data
to generate synthetic training data,
and combine that with
the original bilingual data to train your system.
In contrast to the duality of
the MT dual learning approach,
we have some limitation for the back translation.
First, if you have back translation,
it will be propagating toward your system.
You don't know how to judge if it's good or bad.
In the dual problem, the other system can
guide you if it's good or bad.
The second, in back translation you
only depend on the target monolingual corpus,
not for some target.
In the dual learning,
we can depend on both [inaudible].
For sure, for dual learning,
we can use it both supervised and semi-supervised.
In our work, we deploy and utilize
both Back Translation and
Dual Learning for improving our system.
The first use of Dual Learning,
we use it in an unsupervised way.
In that setup, we have
monolingual corpus in both source and site, and in that,
we can use a lot of feedback signal during
the Dual Learning loop
to learn how to improve the system.
First, we can start from the English sentence here,
and move it to Chinese using the first system,
then we bring it back using Chinese-English translation.
Then we have the original sentence,
and the wrong translation.
We can calculate the reconstruction
caused and that give feedback signal to the model,
how good is that whole process.
The second, for sure since we have
Multilingual Corpus on both source
and site, and target side,
we can use the language model to
get feedback signal on how good those translations are.
We keep doing that during
our learning process to improve both models together.
The second view for
the Dual Learning is
the Dual Supervised learning approach.
In that task we don't have Monolingual Corpus,
but we still have two systems
trained on Bilingual Corpus.
the main objective of
that model is to try to minimize the gap,
and the differences between
the joint probabilities of
the symptoms in the source and the target.
If it is produced in either x to y, or y to x.
That is very helpful in
refining the translation model as well and,
the work of the human pilot project we deployed
both supervised and unsupervised in the same way,
in the same set of Imin.
We do that online during training,
like for each mini-batch we
pick one mini-batch from that direction,
other from the other direction
one mini-batch from Monolingual Corpus x,
and other formal English Corpus y and
keep training and improving the models.
Another approach for utilizing
the duality of the machine translation model instead of
trying to do that on online platform
like we did with the dual learning approach,
we do that in batch mode.
In other words, we train a baseline system using
only b zero system here
and there using only the parallelism.
And using that system,
we can use the monolingual corpus or the source,
monolingual corpus on the target to
generate say two parallel data for the boost system.
That comes with scores on
how confident we are in the translation.
We train a newer system in
the first situation for both direction,
keep iterating four, five
iteration over that to improve our system.
As you can imagine, it very similar to
the learning but that is not done in virtual line mode,
it is done in more expectation maximization fashion,
and when you try to train the whole systems
and refine the probabilities at each iteration.
There are some theoretical foundation for that
from EM perspective and the second is
that you are propagating like a lot of
the gradient on very large batch mode
rather than doing that on the online mode.
Both approaches have been tried and pushed Hilbert Paras
on like utilizing the duality
of the machine translation prompt.
So far, we covered the dual learning and joint training.
Second problem we focused
on on the payoffs for the left-to-right.
During translation.
We usually translate from left to right and that
cause a problem in that we are propagating the errors.
At training time, we see that
through examples during decoding,
we see the expected or predicted examples there.
If we produced a problem or error at the beginning,
it will be amplified quickly till the end,
especially if the sentence are quite long.
Again, we explore two different approaches
to like outcomes such problem.
The first is what about if we do multi-pass decoding?
Like rather than decoding the sentence ones,
what about decoding it twice?
And that is very coherent with
what human are doing actually
because when you are trying to translate or understand
the sentence you can write it once,
revise twice, revise the third time, and so on.
How we do that actually,
would do that in two parts decoding.
As the fairest, we run on an encoder as
usual and then we run first pass decoder,
full system trend end
to end to produce the first pass decoder.
To produce the first pass decoder,
we take the output of
the fifth pass decoder augmented with
the output of the encoder and pass that to
a second pass decoder.
The second pass decoder now has full view
of the source and a rough estimation of
the full sentence in the target and that can
help a lot on getting
the translation in a better quality.
In a more concrete way here,
that is the formalization for the model of the problem,
we start with the encoder,
whenever we finish the encoder,
that's the last layer of the encoder, for example,
transformer model, and then we
start decoding the first pass.
Rather than predicting the actual word,
we take the last layer of the decoder.
Before Softmax and feed that
into another through a gate to another decoder.
As well as augmented or concatenating
the encoder path to that second best decoder.
If that is the transformer model,
the second pass decoder now doesn't
have to follow the constraint of looking
to the left only not right because here you
have full view of the rough sentence from first pass,
the full source sentence so you can do
full attention on both source
and target to revise your translation.
That has proved very good approach to get better quality.
Second approach, try that when that is done
as the depletion networks are
done as two-pass approach when you
have to do two decoder at the runtime.
We can think of another way of doing that
simply trying to retrain the model in
some way to respect that agreement
between right-to-left pass and left-right pass.
We do that same, very similar
to what we did with the joint training,
but instead of doing joint training
on source and target would
do it on right-to-left model and left-to-right model.
In the left-right model is the usual model we train,
the right-to-left model is the other model we try
to predict English from
the last word to first, for example.
If we can get the two models to agree together,
then we seek better agreement between both right-to-left,
left-to-right translation and that
can help in solving such problems.
Now, we turn our attention
to increasing the amount of training data we have so far.
Using the above approaches we ended up
with many different approaches, many different systems.
Still up to that point,
we are using standard WMT training data along
with monolingual data through
the back-translation dual learning and joint training.
Up to that we are ending around 35 million symptoms.
We thought that we may scale up our approach
to more data but with that comes
more noisy data from
the web crawl data and we turned our focus on
how to pick cleaner data to help our mode.
For that, we proposed
a new approach as
a multilingual sentence representation using
multilingual system and we end up with
a sentence representation for
each sentence in our various corpus,
hundreds of million of sentences,
and then we measure the proximity
or the similarity between
the source sentence and the target sentence.
Based on that similarity we can decide if
the two sentences are
good translation to each other or not.
Simply, we do that using sequence-sequence model again
but at that time the model is
trained as a multilingual system,
in that case, it's bilingual only.
The input can be either English or Chinese.
The output can be either Chinese or English.
The system doesn't know what is the input sentence,
and it always can produce the correct output.
That is multilingual system.
It is slightly less quality
than a dedicated one-directional system,
but works quite well to get us
a representation for a given sentence
regardless of the language.
We end up with having the same exact representation
for each sentence,
regardless either Chinese or English.
Okay, how we utilize such systems?
We run the encoder part only of the joint system,
the tutoring, the multilingual.
Giving English sentence or Chinese sentence,
we end up with averaging over
the context representation for each word,
and to get the sentence vectors
for calculating the cosine similarity
between the two centers.
It prove to be very good approach,
even detecting bad translation,
partial translation with
even machine translated data because
some of the machine translated data from
very old system are quite bad
and quite harmful for the system.
So, we ended up filtering out of
the data, but at the end,
we can get cleaner data to fit our system more.
We ended up with many systems now.
As you can imagine, we tried the joint training,
we tried dual learning,
we tried towards decoding,
area to regularization, and we have many systems to try.
And not just for trying
system livery ranking on the sentence level.
For that, we use feature to
re-score the candidates from the different systems,
and rerank those candidate to pick the best system.
We train such ways for a system
reranking on held-out this set,
and we use many features that is listed over here.
Among the features we are using,
the original system score and which system will use
which translation, regular language,
five-figram model trained on the English news data,
and we re-score both the input and the translation using
another right-to-left system to see how good or
bad the translation given the right-to-left constraints,
then we do the same with English to
Chinese as well, like reverse direction,
and we repeat that same features rather than
going through the system to score both translation,
we use sentence vector representation
to measure how good or
bad we are doing using that similarity score.
So, we now move to our experimental result.
Starting with comparing two-three base line,
the first base line is Google base line which is
the best system on WMT evaluation last year,
and our baseline transformer model
is getting us around 24 BLEU points,
adding the back translation we get 25.5,
and then we'll start trying
our different approach proposed
above plus combination of them all.
As you can see across the board,
we are getting almost one and sometimes two BLEU points
over the baseline plus
back translation which is quite a large improvement.
Although, [inaudible] are state of the art
compared to the last WMT evaluation,
though, it is not good enough to
achieve human parity yet.
Then, that is the set up of
the data selection with filtering more data,
moving toward larger models,
that this model, as I said,
that is a transformer model with back translation,
and when we have very large transformer model,
when we would like 8,000 dimension
for the feed-forward network,
we'll get slightly better improvement
of such large system,
then we start adding more data to see how
far we can get from
adding 35 million more to 50 million and so on.
And as we can see in the third part of the table,
we can get a lot of improvement using
the data selection approach here,
and we vary that with from dropout ratios
because it can be very
effective on training very large systems.
So, as you can imagine here,
we ended up with many,
many systems that we need to combine together,
and as I said, the system combination is
just sentence level combination.
So, start combining system
from different categories here,
and we ended up with combining a lot of
variation of the systems.
The last combo systems here are the
best-performing giving BLEU score on that test set.
Okay, so far,
we have reviewed the approaches we have introduced.
We have reviewed the neural machine translation,
and before moving to Christian to
review the human parity evaluation
that is among many of
the research areas that still worth investigating,
the two didn't touch, and we think it will be
of huge impact in the near future.
Low-resource languages for sure,
real-world problem and named entity,
domain adaptation, topic adaptation.
All our work here is sentence level,
we don't get at all into
the document level or quantics level translation,
like if you have a dialogue between
two speakers and how to get the context into the game,
how to get the document level information into the game.
Multimodality, text speech video,
and a lot of signal coming from
all those modalities can be
used to improve the translation quality.
We reviewed the different approaches,
different architecture for neural MT.
We still believe there is a lot of space
for mixing and matching those approaches together,
and transformer, CNE, and RNN can
play together in much efficient way
to get better systems.
Finally, returning back to
the 60s that we're not doing any reasoning,
and we may want to still do
some structure and understand what we are translating.
Rather than depending
on pattern with translation as we are doing now.
So now, I'm moving to
Christian who will review the human evaluation.
>> Thank you Hany.
Test, test. Yes, that works. All right.
We take questions at the end of the talk when
we're done with the human-eval stuff.
So, the basic premise of the whole human parity project
was that now given the NMT advances,
we see MT quality which
is so much improved compared to the 60s,
or the 90s, or the 2000s,
that it seemed to be a good idea to
now check if we can best
Google's 2016 year parity by actually
achieving a proper human parity.
What do you need to do that?
First, you need to define what
exactly you understand by human parity.
So, the obvious first thing is you do this in
a direct equivalents based fashion,
where you basically say something when a
bilingual human being assigns
a quality metric score whatever to
machine translation output into human output
and this is equivalent in terms of quality.
Then at that point,
your machine has achieved human parity.
That is a good first draft
but it's really hard to measure
and determine what your equivalents
of translation quality is.
So that means, instead of doing that,
we inverse the whole thing and
go for indirect difference-based approach
because that is something we
can measure statistically by
testing if there's a significant difference between
the two systems and if not given enough sample points,
we then can deduce
that we have equivalents and hence parity.
That somewhat flips the whole,
say point of proof and burden of
proof in our favor a little bit.
Because if we are simply unable of determining
reliably existing difference between the two system,
the human and the machine translation system,
then by our base assumption,
we still have human parity that somewhat a
point of criticism we ended up being flagged for.
But at least this allows us,
given there are somewhat reliable scoring metric,
to measure it and to provably claim we
have parity based on sample data points we collected.
So yeah, that's what I just said.
So effectively, we will jump from
this equivalent-based direct approach
which allows us to imply or
deduce human parity to the not an absence of
significant difference allows us to go for human parity.
If you tricked worked out surprisingly well,
as we will see in the following slides.
An important thing to frame
the whole discussion is all based on assumptions.
Because of the main feedback points we received
after the parity announcement
did not come necessarily from the research field,
where everybody believes, okay.
Yes, you can do a sampling of
data to form test sets and then you can take
segment scores and aggregate them into
a joint metric of system quality for translation.
But then you have end users,
you have translators and people who believe that
machines cannot really do proper translation.
For them, you have to
state the fact that we have to assume this is
possible because otherwise
the whole framework falls apart.
Lastly of course, without
any reliable metric of measuring quality,
the house of cards falls apart
and we cannot really measure this.
Luckily, we have several such metrics so we can proceed.
Second, it was important,
and we spend some care on the paper writing,
making sure that we do not claim any superiority.
So, we're not here, we are not yet talking
about superhuman translation and
we're barely touching the first steps
of human parity for MT.
So, this is different from ASR where you can
measure your human level of quality and where
some of the existing systems
actually achieve superhuman performance.
Second, it gives you no guarantees whatsoever that
the translation output of
a human parity system is necessarily
error-free or always mostly right.
Humans make translation errors and
our machine translation systems are not
much different in that aspect.
I guess, we might even make more errors but,
yeah, so it's not necessarily error-free.
Lastly, given the fact that all of that is based
on evaluation on test sets and samples of course,
any of the results only
correspond to the respective sample.
So, they need not generalize to
the full set of potential scenarios domains problems.
Though of course, using random sampling,
we hope that for a lot of
those scenarios our results would still apply.
So, that's the base of
the whole thing in
terms of defining it and that for the first time.
So previously, nobody ever
dare talking about human parity because
that immediately triggered knee-jerk reflexes
from everybody outside of your own research group,
telling you you're nuts and this is not possible.
So, now we tried to start off a very,
hopefully, agreeable definition on what
human parity could be defined as.
Then reformulating it in a way where we can measure it.
So that now, people can take our annotation output
and check if that actually fulfills our definition.
Then at some point, they might concede,
okay, you have parity for that sample or not.
So, that's the whole idea.
I talk about the reliable scoring metrics.
So the question is, what can we do to measure
the translation quality of
our human translation system
and our machine translation system?
One idea we have is everybody uses BLEU.
So, BLEU is stood the test of
time and gets the respect of water NaCl.
This year, everybody uses that for the tuning,
for development for evaluation.
When I started working on MT-Eval it was 10 years back.
It was more or less not
so common that people run human-eval.
So, my main mission is to bring human-eval to
every MT experiment on this world,
and everybody just reported BLEU.
There was a sharp point in time where people
diverged a bit into METEOR got some attention, TRH.
This nice set of metrics you could use,
but at the end of the day
everybody goes back to using BLEU.
It's nice and simple and people tend to trust it.
So of course, given our definition of human parity,
what we could do is we could simply
use BLEU as our reliable scoring metric.
Take MT output, take human output,
see all that compares against
the BLEU test set and then decide, okay,
that's human parity or not. What would we need?
We would need high-quality references,
because otherwise of course,
the whole argument of human parity
becomes somewhat weakened.
We wanted to base the whole Chinese to
English news translation parity project
on what had been done at last year's WMT,
because that's state of the art
and accepted in the research community,
it was the right language pair.
Sadly, what we did not have at
the WMT17 test set front
was super reliable high-quality references.
That was the first time we ever had
Chinese in WMT inside the news shared task,
and some of the references we're just not that great.
So, there's a subset of data which later was
replaced with an improved reference.
But the trustworthiness of
the WMT17 reference translation was not that great.
So, we decided okay,
in order to make this whole thing fly,
what we first need to do is we need
to create additional references.
So that meant, we actually
tried to measure two different things.
One is we created
a post-editing based reference based on crowd-sourcing.
That's quick and dirty way of getting
additional human quality or
human level, human generated reference.
But of course, it will not be
the same level of quality compared
to a full from scratch human translation,
which is why we created the later,
the HT reference as well.
So we could compare,
how does translation quality for a reference of
a human from scratch differ to post-editing quality.
That's of course, quite expected that they differ because
post-editors would not be paid the same.
We will try to optimize again by being faster,
so there should be differences.
Now, having those two different human systems,
we looked at how do
things look when we actually compute the BLEU's course.
So, what you see here is or what you can't see because
of the subtitles on top,
they would be combo six which refers
to the table Hany presented just before.
So, combo six, combo five,
combo four would be
our best combined systems from the parity project.
Then, there's Sogou which is
the winning system from last year's WMT17.
Then there's two online systems
which happened to be part of
WMT and also part of
our evaluation campaigns throughout the parity project.
Now, these list of systems
is ordered by decreasing quality
according to all the human evaluation
we did for the final paper.
So that means, combo six, combo five,
combo four are better than Sogou.
In this case, significantly
and all four of them are better than the online systems.
Now, the interesting thing is the only reference set for
BLEU which captures that is
the HT reference because that gives us more
or less the same scores for
the combo systems which makes sense.
So, goes a little worse and then
the online systems should be flipped
according to BLEU but that matches expectation.
Then, if you look at online-B and
the post-editing and the WMT references,
something is severely odd,
because based on those scores,
we would assume maybe online-B should be really, really,
really great and it
consistently in any of the campaigns we
ran across the parity projects or 15,
a larger amount of campaigns since last year,
the two online systems always ended in the last cluster.
So, they never ever jumped
into a position where they would even
somewhat become close to being second worst, right?
So, this was always really bad.
An interesting thing here is to
see a system achieving 46,
48 BLEU points but
being completely and consistently
penalized by human annotators.
So at that point,
we decided a human BLEU score
is not what we need for the parity project.
Instead, we really want to focus on asking
bilingual speakers and not only
ask them to compare output against references
because references even might introduce bias and OB 40.
So, we actually opted for source-based human evaluation.
So, quick summary of what that means for
our measuring of human parity so now we talked about
definition and why blue is not a metric for that.
So, in order to proceed for parity
what we needed was this reliable score metric.
Now, that we already
took the WMT test set from
last year's WMT we designed okay look at
the human evaluation we performed there and adopt
what they did which
is basically called a direct assessment.
It's a technique where you
take a pair of candidate translation and
some reference or translation hint and you ask
people to assign an absolute score between 0 and 100,
collect enough of those scores and then over time you
can identify spammers and random clickers,
eliminate them from your pool of workers
and collecting enough data allows you
later to determine system level quality
for your set of systems.
So that's adopting the state of the art from WMT
17 and we then
had our little requirement of going source-based.
Which is we adopted or changed the recipe to
follow what we did last December
for the WSAT workshop where we flipped
from reference-based direct assessment
to source-based direct assessment.
The obvious caveat here
is you now need bilingual annotators
which are harder to find and more expensive to pay for
in the context of our project that
was luckily not an issue.
It is an issue for WMT and WSAT because there if
your crowd pricing jumps from a couple of
dollars per task to 10 times that,
they will not end up getting enough data points
so the whole campaign is at risk.
Last but not least,
an important distinction to the previous campaigns and
our human parity approach is that we wanted to make sure
that really nobody could claim that our results are
just random and we were just majorly lucky over time.
So, that means instead of just following
the standard assessment approach which is you
take a random sample of segments for system A, B, C,
D whatever, and collecting
scores for all those segments
aggregating a system score based
on those and we actually enforce that for
every system we looked at,
we compare the same segments.
So, the set of segments for
all the systems is actually
frozen and completely identical.
That's a difference in random, and
in standard the aids just randomly sampled so that over
time given enough segments you will
end up having a fair representation of
system quality and in our case it actually is completely
redundant so that we can for
every segment for all the systems compare how they fared.
Then to make it even more redundant
we not only had the single annotator per
segment but we enforce at least three of those
and see in a little while on the evaluation design for
the final campaign for the parity thing we
tripled that again so the final data we report
results on this actually at least nine
annotators per segment or system for a lot of segments.
So, highly redundant and we believe highly reliable,
so that's why we put out all the data
publicly so people can
use that to check and verify them.
For the final evaluation,
an interesting twist on the evaluation campaign design is
that campaigns or share tasks like
WMT and WSAT operate on a once a year level.
So they get a dump of systems and
outputs and they are only concerned
finding this one ranking of systems in the matter of
a couple of weeks and then declare winner.
Then everything is forgotten about,
and next year there's a completely different set of
participants and a different campaign.
In our scenario we actually needed to
have some level of continuity between
the campaigns because we wanted to inform
system builders on our team how we fared compared
to the other systems so that
we would measure progress over time and
that means we had to introduce
some continuity between events
in the form of systems we always evaluated.
There was also making the task harder because we had
to always carry six systems which
we kept stable across
evaluation campaigns which meant we basically wasted
annotation mass for those systems but at
least it meant we could
now fairly compare stuff overtime.
The rest is just small stuff.
So, we tested on a lot of different subsets.
We ended up collecting roughly the same amount of
data points than was collected for WMT
17 so it's a large-scale eval just
for the purpose of our parity announcement,
and we covered due
to the different subsets nearly half of
the 2000 segments of WMTs new test 2017,
test set which is also a lot.
This is the direct assessment base thing.
The example just shows you
the English reference-based thing from WMT last year.
In our case the reference would be
the Chinese source text
and it's literally a very simple task.
This is a lot faster and it's
a lot easier for people in terms of cognitive load
when compared to the relative ranking task we did in the
2008 to 13 range for WMT.
The nice bit is you can enforce
reliability of scores by editing spike data.
So what we literally do is we have
some source and candidate translation which gets scored
whatever and then what we do is we take
the same candidate translation and we artificially
distroy its contents by chopping off
a phrase of a certain length
relative to the target length and
then we know by swapping in
something from a different system which is completely
unrelated on a different segment that this
cannot be same level of quality.
So, we know this is expected to be really,
really worse in quality
and then if you collect enough of
those and run significance testing,
you end up having a good metric which allows you to
quickly eliminate random clickers and then spammers.
So, as I mentioned we have this enforced segment overlap
so again reliable scores as much as we could make them.
In terms of the eval campaigns,
the rough timeline was that we conducted
monthly large-scale evals for
the current best research systems we had and our set
of framing systems to make them continuous over
time and then the final round
of evaluation campaigns happened in February,
March this year when we approached
parity and found in one of our final evals it gave,
we're there and then we set up
the big evaluation campaign described in
the paper to be able to announce the whole thing.
In terms of scale, it's literally WMT 17 plus
plus and a lot of redundancies so it
is as large-scale as anybody has done it.
So, we can skip about that.
Let's just look over this.
So, this is just a quick look at the evolution over time.
So, the first eval from November
was just the very first attempt
at having a research system in
the Bible parity project and we just
compare that against online systems and go
and on your references Then over time we had
larger sets of systems and saw how the whole thing fared.
Noteworthy is how online systems always end up being dead
last which confirms results
from NASA's WMT where they also
did not fare well in the human eval,
so that's consistent and then the interesting thing is
how the research systems over
time more and more closely approach quality
of the human references.
So, what's missing here is the say
the watershed moment in
at the end of February and a one-off eval
and we figured out that our combo systems actually
jumped into the same result cluster and
that was the point in time when we design to keep them,
now we need to set up the whole thing see if we
can measure the whole parity.
Finding and prove it,
and the tables you saw there
actually are the result clusters from
WMT lingo that is
based on pair-wise comparison of systems.
It's not very intuitive and it's always confusing.
So we have a much nicer representation coming up next.
In a nutshell, what those tables mean is
that each of the systems in a high cluster.
So the first rank one cluster outperforms
significantly all the systems in lower ranked clusters.
That means for some system,
if it is significantly better than some other systems,
but then within the same cluster
you have no idea if it's better or not,
because in that same cluster you have
this underlying assumption of if I
can not prove a significant difference,
then it is equivalent,
by our definition number 2,
and then you have the Z-score and R-score that are just
used to provide an ordering
because people like to see stuff in some ordered fashion,
but doesn't really tell you much.
Specifically, the differences and
Deltas between the standardized scores,
the Z-scores don't tell you anything,
and the rest is just the average score,
the system achieved from this 0 to 100 scale.
Which tells you a bit,
but also I mean it's hard to tell
if the 0.5 between convex six and
reference HT means something or what
they exactly denote or not and then
considering the reference post editing
is still in the same cluster
but has significantly lower scores
in R- score and Z-score,
the whole thing is, it works as a means
of properly computing result,
rankings but it's hard to interpret for human beings.
So what we do here is,
for the rest I'll just show you the final results for
the parity claim for our best systems
against the respective run-up systems,
and give you a graphical way of understanding
the whole thing by moving from the result clusters which
contain just all the systems to
a pairwise density based representation.
This now looks a lot more,
a lot nicer and hopefully,
easier to interpret what you have.
On the X-axis is just the
scores ranging from 0 to 100 so that's
literally the range of the slider
the humans had to assign quality,
and then you have an overlaid version of
the histograms of the score bins.
And then you have the corresponding density functions
for the two systems.
So what you see here is that
the blue graph will always be
our best combination system from the parity project,
and the gray one will always be a competing system.
Here it's Zho, so that's the winner from WMT 17.
What you can see is the dash lines are
the mean scores for the two systems and you see
that four combo 6 we have roughly 69 as the mean score,
and Zho is at 62.
So there's an openly visible gap
between the two systems and you also see based
on the histogram and on the shape of
the two density curves that
obviously the Zho curve in gray,
has more area in the lower part of the whole graph,
whereas the blue curve of the combo system
it's at a lower level there, right?
This gives you a sort of some understanding
if I now tell you that based on this we
claim that our system is better than
the Zho system this is a much nicer way
of framing it than
this stupid cluster where you
don't understand any of this. Yes.
>> Once we're both Zho don't do that well.
>> Yes.
>> Sixty or below something like this.
We look at whether what's
the correlation of the seven seasons.
Are we bad with the same sentences
or are we bad for different things?
>> We didn't look, but you can, all the data is there.
Or I can, yes.
This is, yeah.
It's an open, as Hany said,
there's an open research question for
a lot of things and that's one of the questions we will.
Technically, we use the same power covariance so
that is something I'm looking at a lot.
But we didn't explicitly for this year.
So now jumping back and forth between that Zho,
nest WMT and that's the WMT reference.
So what you see here is, where is it?
So that's Zho again, that's WMT.
So the gray curve moves a little bit,
but not much the mean score
is literally a little worse than for Zho,
but again clearly, this here begs
why they end up in
the lower cluster and we end up in a higher cluster.
So again, more understandable how our system is better.
Now it gets interesting, now we jump to the cluster where
post-editing and human translation
and in our system ends up in the same cluster.
That's the one way you see
everything is indistinguishable,
we don't really know how that looks like and now,
it gets much more interesting.
You see directly that the mean scores
jump to being very close,
and you also see that the shape of the curves
somewhat converges in a sense, right?
So and here that explains in a graphical way what it
means to be indistinguishable.
The main thing at the end is
the dumped to the human reference,
we literally have nearly perfectly overlapping curves
and except for some noise on the histograms.
These systems seem to
perform somewhat on the same level of quality.
Now an important point all of you brought this up,
we don't necessarily know if the scientist scores for
the respective bins correlate to the same segments.
So there might be some difference there,
but considering that we
collected tens of thousands of data points,
the main take-home message here would be
that to the best of our data collection abilities at
this point in time given the constraints of
having the human evaluation people around, etc.
These systems are equivalent
or not significantly different
and hence in our indirect fashion,
they are equivalent and that's why we end
up having human parity.
So that was the final thing we then
reported versus a nicer graphical way
of explaining those wonky tables
and following Hany's approach here,
let's also talk about what we could be doing next.
Of course, we released all the data publicly
which is not so common in the industry.
So we encourage people to look at
the data and tell us where we messed
up so that we can improve on that.
Improving the translation quality is
something Manie had a ton of things NMP architecture,
work, and modeling can do.
Most importantly, people complained about us not being
able to consider context in any way,
where we just have this sentence level approach.
An interesting comment on node
was that it might be somewhat now possible to
look at how does MT compare to humans of
a certain certifiable level of translation.
The ability for some scenario.
We could build a test set or find annotators who have
a certain certified ability
and then we can compare our [inaudible] Is our MTs,
I don't know, like the first grade,
or the fourth grade.
I don't know, I mean, that's an approach
or a direction we could
look into and it's not worth it.
So why this is parity,
this statistical definition we have
enough data points to back that this actually
is parity not near parity, it's parity.
It's only the first step on our trajectory towards
having and reaching fuller,
more omnipresent human parity for MT,
and there will be lots of
new languages, domains, scenarios,
architectures we can try and push
towards ever increasing human parity like quality.
All right and with that,
we conclude and I invite Teniap and we
take questions that you might have. Thank you.