Tip:
Highlight text to annotate it
X
Voiceover: Cryptographic hash functions
are basically fundamental building blocks
that are used within many cryptographic algorithms
and protocols,
and they have a number of very important
applications in the context of information security
as a whole.
Now, some of the more common algorithms
in this category that are known
as cryptographic has functions include:
things like MD5, and also,
it has some predecessors like MD4 and others,
as well as algorithms like SHA-256,
and actually, SHA-256 is preceded
by other algorithms like SHA-1 and so on,
and also there are some algorithms
that maybe you've heard of,
or maybe that are a bit lesser known,
but things like RIPEMD, and BLAKE, and Skein,
and others.
Now, cryptographic hash functions are basically
used as critical building blocks in many applications,
and really the first motivating application,
the first historical application
of these types of hash functions
was in the context of what's known
as a digital signature,
and digital signatures are used in so many different
cryptographic applications today.
They're the cornerstone of many
ecommerce protocols.
They are used in doing things like bitcoin generation
and so on.
Cryptographic has functions are also used
in things like message authentication protocols,
in pseudorandom number generation
and password security,
even encryption to some degree.
In fact, aside from their use in digital signatures,
these hash functions are also used in other places
in the bitcoin protocol as well.
First of all, let me talk about
what a cryptographic hash function actually is,
and of course, as the name implies,
the first thing it is,
it's a hash function.
And by hash function,
I basically mean that it will take input.
It's a mathematical function,
a transformation, if you will,
that takes a particular input,
and we call this input a message,
and the message can be of arbitrary length,
and the hash function basically applies
a mathematical transformation,
or maybe a set of mathematical transformations
to this input to produce a single output,
and we typically call this output a digest,
although, sometimes you will see
the output referred to as a tag, or as a hash,
or as a fingerprint,
but digest is sort of the most common nomenclature.
In fact, MD5, which was
one of the earlier hash functions,
stands for message digest 5,
and MD4 was message digest 4,
and so on, and so forth.
Now, the message, as I mentioned briefly
can be of arbitrary size.
It can be as long as you want,
or as short as you want,
but the output,
the size of the digest or the size of the tag,
is going to be fixed in length,
and for example, in the context of a hash function
like, let's say, SHA-256,
the digest will actually be exactly 256 bits in length.
It's going to have a fixed-length output,
but an arbitrary-length input.
The other thing I want to point out
about these cryptographic hash functions
is that the function here is a deterministic function,
and by that, I mean
that the output will always be the same
for a given input,
so if you have a given input,
you're going to see the exact same output.
You won't have a situation in which, maybe,
a given input will have two different possible outputs.
It's going to be consistent.
Now, traditional hash functions have been
used in computer science for quite some time,
and they are used in many
computing applications,
so, for example, you may have heard of something
like a hash function used in something
called a hash table,
but the kinds of hash functions that are used
in hash tables, and I want to stress this,
they aren't necessarily the same
as cryptographic hash functions.
The qualifier "cryptographic" here is very important,
and it usually means or implies that
that hash function has to have a certain set
of critical design goals and maybe some
particular properties in mind
that make it suitable for use
in applications that use cryptography
or cryptographic applications,
areas where perhaps security or privacy
or confidentiality or authentications
are a serious concern.
First and foremost,
maybe in describing some of these properties
is that, and maybe this almost goes without saying,
one of the first properties you want
of a cryptographic hash function
is that it should be computationally efficient,
and by that, I mean that it shouldn't take a lot of time
to compute the output from a given input.
If you're given a message
and you want to apply this set of transformations
to that message to get a digest,
that set of transformation should not take a long time
to implement on a computer.
It should be very fast, or relatively fast.
It almost goes without saying,
but I think it's important to stress and point it out
because I've seen people come up
with grossly inefficient hash functions sometimes,
and those would not be considered suitable
in the context of when typical
cryptographic hash functions are used
for cryptographic applications.
The second property you typically need
in the context,
and this is especially in the context of digital signing,
is that you want it to be the case
that it's hard to find two inputs that actually map
to the same output,
and I mean two distinct inputs
whose corresponding digest is identical.
This property is typically referred to as
collision resistance.
That means it's hard to find a colliding pair of input,
so in other words, if you have two inputs
and let's say you have a message M1,
and you have a message M2,
their output under the application
of the hash function should not be the same.
You won't ever have it be the same that the,
you won't even have it be the case, rather,
that the output of M1 and M2
under an application of the hash function
is the same.
They should never be the same.
It should always be different.
Now, I should take a step back here
and point out that of course in practice,
given that messages can be of arbitrary size
and given that the input or the output, rather,
is a fixed size,
it's not mathematically possible to guarantee
that the output will always be different
for two distinct messages,
but what you typically want is not that the outputs
are necessarily different,
but that it's hard to find two distinct messages
that produce the same output.
We know they exist by virtue of the fact
that there are many messages that can be hash,
and only a finite, small number,
relatively small number compared to
the number of messages,
a smaller number of possible digest values,
but aside from that consideration,
it should be hard. It should take a long time,
and by long, I mean an astromonically long time
to find two distinct messages that would result
in the same output
under the application of the hash function.
Now, the third thing that I want to point out
is that in many cases, you might want,
also, in the context of a hash function,
for the hash function to be able to hide
information about the inputs.
In other words,
given the output,
it should be hard to glean anything useful
or interesting about the input.
Anything, any relevant detail,
and I don't just mean the whole input,
but maybe even something as simple as
was the input an even number or an odd number?
I mean, that should be the kind of thing
that should be hard to infer when you see the output,
even something as simple as the,
the evenness or oddness of the input.
Now, a fourth property I want to point out
is that you typically want the output to be
well distributed.
In other words,
the output should look random.
In other words, it should look
like a set of coin flips took place,
not that there was a predictable way
in which the output was created.
Really, what that means is that,
and you can think of it maybe more conceptually as
it's almost as if you flipped a bunch of coins
to get to the output.
It should look just that random.
And so what you can maybe think of
cryptographic hash functions as,
as it's, perhaps, maybe the mathematical equivalent
or mathematical analog of a meat grinder of sort.
It can take inputs and apply
these mathematical transformations to them
such that the output looks,
for all intents and purposes,
completely random and unrelated
to the original input.
I do want to make a few quick remarks
about these particular properties,
and first of all, they are interrelated.
For example, if you have a situation
where the outputs, let's say,
really appear to bear no relationship to the input
and the outputs also look random,
then that will, in fact, give you to a large degree,
a lot of the collision resistance properties
because just the fact that you can't
predict the output
and the fact that it hides all this information
implies that it's going to be hard to find
two inputs that are distinct
that map to the same output.
And so sometimes, you get one property
in exchange for the others.
The second thing I want to point out
is that typically, these properties in practice,
or maybe in the underlying mathematics,
are things that you hope for
but you can't always guarantee
that they'll always hold,
and it may be entirely possible that
you could design a hash function
that you think is completely collision resistant,
but someone might come along a year from now
and come up with a more clever way for finding
a collision.
Maybe they'll find a clever shortcut
that does not involve doing a brute force search
of any sort.
It turns out that cryptographers,
for better or worse currently do not have
any mathematical techniques.
They haven't developed techniques for being able
to work around some of these limitations.
So we often do take it on face value
that these schemes are secure
based on how long they've been around.
Now, I also want to point out,
and the last thing I want to point out,
is that this treatment that I've given
is not meant to be mathematically rigorous
by any stretch of the imagination.
There are far more precise ways to describe
these underlying properties,
but my hope, instead, is that
this video gives you, maybe, a bit of a flavor
for what is required of a cryptographic hash function
without maybe getting bogged down in some of the
mathematical minutia and formalism.