The following hash function is based on a user's public key modulus and
on the basic function ModExp(A,B,C) which calculates A**B mod C, creating
a 256 bit hash value and using ModExp() on every single character
in the message text as a round function.
LargePrime=106603488380168454820927220360012878679207958575989291522270608237193062808643
Ho = Modulus mod LargePrime
Hi+1 = (Hi + 19) XOR ModExp((Hi + Ho), (ord(character i) + 19) , LargePrime)
As the basic function ModExp() is multiplicative, the hash function has to
be designed to be multiplication free (see [And-93]), i.e it should
be computational infeasible to find any x, y, z so that hash(x) * hash(y)
equals hash(z). Therefore ModExp() will be used in feed-forward mode
and its output is XORed with the previous hash value, which depends on
the chaining-variable Hi and one single
character of the message. Adding the small prime to the new hash will
contribute to breaking the multiplicative homomorphism of ModExp().
To avoid modular exponentiation with a zero exponent (which would result in
a chaining-variable of 1) a small prime (currently 19) is introduced to
increase the exponent accordingly. In the next round the new hash value
will be the base for exponentiation with the following character as
an exponent. As it is inevitable that with a very small probability
the output of ModExp() is either 0 or 1, in both cases the next
character being processed would be ignored. Including the lower 256 bits
of the public key modulus in the base of the exponentiation will prevent
such an interruption of the expontentation chain.
The hash function above can be coded with very few lines in the Python language
def ModExp (Base, Exp, Mod) : Hash = 1 X = Exp Factor = Base while X > 0 : Remainder = X % 2 X = X / 2 if Remainder == 1: Hash = Hash * Factor % Mod Factor = Factor * Factor % Mod return Hash def create_hash(Message, Modulus) : Prime = 1066034883801684548209272203600128786792079585759892915222706082 37193062808643L # second factor from RSA-155 challenge , factored Aug. 1999 SmallPrime = 19 LastHash = 0L Hash = HashModulus = Modulus % Prime for Character in Message : LastHash = Hash + SmallPrime Power = ModExp(Hash+HashModulus, ord(Character)+SmallPrime, Prime) Hash = (Power ^ LastHash) % Prime return Hash
What would it take to make the hash invertible?
For very small plaintexts of only a few characters it would simply be more efficient to start an exhaustive search on the plaintext than to calculate roots mod prime to recover the chaining-variable Hi. During the processing of every single character an amount of information is thrown away while calculating the modular exponentiation and mixing the result with the hashvalue of the previous round will make it computationally infeasible to recover the chaining-variables from the final output, given the length of a 256 bit hashvalue.
How hard is it to find a preimage for a given hash value?
Firstly, let us look at the effects of a possible manipulation of input bits across the border between two adjacent characters. A manipulation of a single character cannot be compensated by another in a deliberate way, because the alteration of the first character will give a different remainder (mod prime) which is not only masked with the old chaining variable but also the result is used as the new base in the next round, propagating this change through the rest of the chain. It would be easier to try a compensation with the following character than using anyone further down the road, but as the opponent knows Hi she can calculate the effects of an increased or decreased exponent, but then she has to compensate this effect using the new base with a different exponent for which she has only 255 different choices, which reduces the probability of a successful compensation (the new message should give the same hash value as the old one) drastically.
But is there a possible compensation by toggling some bits inside a given character?
Changing bits inside a character will result in an additional factor basek (where k is a small integer smaller than 256) which will not be devided by the prime used in the hash function. Because of the fact that the base of the exponentiation is always smaller than twice the prime there are no common factors and the additional factor cannot be a multiple of the prime. Thus the additional factor can never result in the same output of ModExp() and will always lead to a different hash value. The same argument holds for bit manipulations making the exponent smaller. As the new and old outputs of ModExp() differ only by a factor (basek), the production of the same reminder would require that the prime divides the factor (basek) which is impossible because the base is smaller than twice the prime.
Giving a satisfying answer to this question is almost impossible as significant statistical analysis of the hash function's behaviour regarding the clustering of its output values cannot be performed due to the sheer amount of possible hash values. Although it is questionable whether one can extrapolate empirical results that have been found using much smaller primes to reduce the amount of hash values to the full 256 bit hash space, some of the following observations indicate the hash function's ability to scatter its output values:
After having performed a series of statistical tests on the hash function with smaller primes and different input material I will try to draw conclusions out of the results hoping that they shed some light on the quality of the hash function. I would like to present the details of my tests in a separate paper explaining only the most significant results here.
To reduce the amount of hash values small primes (31, 97, 997, 9973 and 32999) were used and the input data was taken from an entropy file made out of 1110000 bytes from /dev/random. Every 10 bytes were hashed and the number of collisions for each hash value was recorded. The data showed that there were some hash values that attracted more collisions than the average ones but the ones who attracted more collisions exceeded the average ones by a factor of 3 to 4 only. Thus the most probable hash value was no more than 4 times more likely to have a collision than the majority of others. The number of hash values that never got a hit grew bigger the larger the prime and consequently the smaller the average value of collisions became. With the prime 32999 and 111000 random input texts 1,5 percent of all hash values did never appear while an average number of 3,2 collisions was found on most of the hashes.
The second result revealed from larger input texts. Always using the prime 32999 I found that with over a million inputs of 10 bytes random data the maximum of collisions was only twice as high as the average while with only 111000 inputs the maximum was 4 times the average. One would expect that with more and more data fed into the hash function the difference between the maximum and the average number of collisions will slowly disappear. Using ordinary text files taken from the manual section of Linux as inputs I produced almost the same results (for around 150000 inputs) with 80 character text strings as with 10 bytes of random data.
These observations are preliminary and should be taken with great caution but for now I do not see any sign that the hash function does not scatter its output values properly.
The fact, that by chance two different messages (m1 and m2) can hash to the same value when hashed under two different public key moduli cannot be exploited for cheating as long as the correct public keys were used for verification or decryption. Talking about the hash of a message is meaningless unless you know which key the hash was created for, so we note hash(message, key) instead of hash(message) in PCP. There is another advantage of the hash function's dependence on a public key modulus that comes into play when we look at the way in which the secret key is protected for storage on the local computer system (see the relevant chapter for details).
PCP uses two different secret keys, one for signing (a long-term high security key) and one for encryption (which can be considerably shorter than the signing key) and even if you use the same passphrase to protect them the different moduli will assure that each secret key will be protected with a different storage key, because hash(passphrase, signingkey) differs from hash(passphrase, encryptionkey).
- [ And-93 ]
- Ross Anderson : The Classification of Hash Functions
in: Codes and Ciphers (proceedings of fourth IMA Conference on Cryptography and Coding, December 1093), published by IMA (1995) pp 83-93