Bitcoin - Cryptographic hash function

Voiceover: Cryptographic hash functions are basically fundamental building blocks that are used within many cryptographic algorithms and protocols, and they have a number of very important applications in the context of information security as a whole. Now, some of the more common algorithms in this category that are known as cryptographic has functions include: things like MD5, and also, it has some predecessors like MD4 and others, as well as algorithms like SHA-256, and actually, SHA-256 is preceded by other algorithms like SHA-1 and so on, and also there are some algorithms that maybe you've heard of, or maybe that are a bit lesser known, but things like RIPEMD, and BLAKE, and Skein, and others. Now, cryptographic hash functions are basically used as critical building blocks in many applications, and really the first motivating application, the first historical application of these types of hash functions was in the context of what's known as a digital signature, and digital signatures are used in so many different cryptographic applications today. They're the cornerstone of many ecommerce protocols. They are used in doing things like bitcoin generation and so on. Cryptographic has functions are also used in things like message authentication protocols, in pseudorandom number generation and password security, even encryption to some degree. In fact, aside from their use in digital signatures, these hash functions are also used in other places in the bitcoin protocol as well. First of all, let me talk about what a cryptographic hash function actually is, and of course, as the name implies, the first thing it is, it's a hash function. And by hash function, I basically mean that it will take input. It's a mathematical function, a transformation, if you will, that takes a particular input, and we call this input a message, and the message can be of arbitrary length, and the hash function basically applies a mathematical transformation, or maybe a set of mathematical transformations to this input to produce a single output, and we typically call this output a digest, although, sometimes you will see the output referred to as a tag, or as a hash, or as a fingerprint, but digest is sort of the most common nomenclature. In fact, MD5, which was one of the earlier hash functions, stands for message digest 5, and MD4 was message digest 4, and so on, and so forth. Now, the message, as I mentioned briefly can be of arbitrary size. It can be as long as you want, or as short as you want, but the output, the size of the digest or the size of the tag, is going to be fixed in length, and for example, in the context of a hash function like, let's say, SHA-256, the digest will actually be exactly 256 bits in length. It's going to have a fixed-length output, but an arbitrary-length input. The other thing I want to point out about these cryptographic hash functions is that the function here is a deterministic function, and by that, I mean that the output will always be the same for a given input, so if you have a given input, you're going to see the exact same output. You won't have a situation in which, maybe, a given input will have two different possible outputs. It's going to be consistent. Now, traditional hash functions have been used in computer science for quite some time, and they are used in many computing applications, so, for example, you may have heard of something like a hash function used in something called a hash table, but the kinds of hash functions that are used in hash tables, and I want to stress this, they aren't necessarily the same as cryptographic hash functions. The qualifier "cryptographic" here is very important, and it usually means or implies that that hash function has to have a certain set of critical design goals and maybe some particular properties in mind that make it suitable for use in applications that use cryptography or cryptographic applications, areas where perhaps security or privacy or confidentiality or authentications are a serious concern. First and foremost, maybe in describing some of these properties is that, and maybe this almost goes without saying, one of the first properties you want of a cryptographic hash function is that it should be computationally efficient, and by that, I mean that it shouldn't take a lot of time to compute the output from a given input. If you're given a message and you want to apply this set of transformations to that message to get a digest, that set of transformation should not take a long time to implement on a computer. It should be very fast, or relatively fast. It almost goes without saying, but I think it's important to stress and point it out because I've seen people come up with grossly inefficient hash functions sometimes, and those would not be considered suitable in the context of when typical cryptographic hash functions are used for cryptographic applications. The second property you typically need in the context, and this is especially in the context of digital signing, is that you want it to be the case that it's hard to find two inputs that actually map to the same output, and I mean two distinct inputs whose corresponding digest is identical. This property is typically referred to as collision resistance. That means it's hard to find a colliding pair of input, so in other words, if you have two inputs and let's say you have a message M1, and you have a message M2, their output under the application of the hash function should not be the same. You won't ever have it be the same that the, you won't even have it be the case, rather, that the output of M1 and M2 under an application of the hash function is the same. They should never be the same. It should always be different. Now, I should take a step back here and point out that of course in practice, given that messages can be of arbitrary size and given that the input or the output, rather, is a fixed size, it's not mathematically possible to guarantee that the output will always be different for two distinct messages, but what you typically want is not that the outputs are necessarily different, but that it's hard to find two distinct messages that produce the same output. We know they exist by virtue of the fact that there are many messages that can be hash, and only a finite, small number, relatively small number compared to the number of messages, a smaller number of possible digest values, but aside from that consideration, it should be hard. It should take a long time, and by long, I mean an astromonically long time to find two distinct messages that would result in the same output under the application of the hash function. Now, the third thing that I want to point out is that in many cases, you might want, also, in the context of a hash function, for the hash function to be able to hide information about the inputs. In other words, given the output, it should be hard to glean anything useful or interesting about the input. Anything, any relevant detail, and I don't just mean the whole input, but maybe even something as simple as was the input an even number or an odd number? I mean, that should be the kind of thing that should be hard to infer when you see the output, even something as simple as the, the evenness or oddness of the input. Now, a fourth property I want to point out is that you typically want the output to be well distributed. In other words, the output should look random. In other words, it should look like a set of coin flips took place, not that there was a predictable way in which the output was created. Really, what that means is that, and you can think of it maybe more conceptually as it's almost as if you flipped a bunch of coins to get to the output. It should look just that random. And so what you can maybe think of cryptographic hash functions as, as it's, perhaps, maybe the mathematical equivalent or mathematical analog of a meat grinder of sort. It can take inputs and apply these mathematical transformations to them such that the output looks, for all intents and purposes, completely random and unrelated to the original input. I do want to make a few quick remarks about these particular properties, and first of all, they are interrelated. For example, if you have a situation where the outputs, let's say, really appear to bear no relationship to the input and the outputs also look random, then that will, in fact, give you to a large degree, a lot of the collision resistance properties because just the fact that you can't predict the output and the fact that it hides all this information implies that it's going to be hard to find two inputs that are distinct that map to the same output. And so sometimes, you get one property in exchange for the others. The second thing I want to point out is that typically, these properties in practice, or maybe in the underlying mathematics, are things that you hope for but you can't always guarantee that they'll always hold, and it may be entirely possible that you could design a hash function that you think is completely collision resistant, but someone might come along a year from now and come up with a more clever way for finding a collision. Maybe they'll find a clever shortcut that does not involve doing a brute force search of any sort. It turns out that cryptographers, for better or worse currently do not have any mathematical techniques. They haven't developed techniques for being able to work around some of these limitations. So we often do take it on face value that these schemes are secure based on how long they've been around. Now, I also want to point out, and the last thing I want to point out, is that this treatment that I've given is not meant to be mathematically rigorous by any stretch of the imagination. There are far more precise ways to describe these underlying properties, but my hope, instead, is that this video gives you, maybe, a bit of a flavor for what is required of a cryptographic hash function without maybe getting bogged down in some of the mathematical minutia and formalism.