Remarks on Security

The following hash function is based on a user's public key modulus and on the basic function ModExp(A,B,C) which calculates A**B mod C, creating a 256 bit hash value and using ModExp() on every single character in the message text as a round function.

LargePrime=106603488380168454820927220360012878679207958575989291522270608237193062808643 HAs the basic function ModExp() is multiplicative, the hash function has to be designed to be multiplication free (see [And-93]), i.e it should be computational infeasible to find any x, y, z so that hash(x) * hash(y) equals hash(z). Therefore ModExp() will be used in feed-forward mode and its output is XORed with the previous hash value, which depends on the chaining-variable H_{o}= Modulus mod LargePrime H_{i+1}= (H_{i}+ 19) XOR ModExp((H_{i}+ H_{o}), (ord(character i) + 19) , LargePrime)

The hash function above can be coded with very few lines in the Python language

def ModExp (Base, Exp, Mod) : Hash = 1 X = Exp Factor = Base while X > 0 : Remainder = X % 2 X = X / 2 if Remainder == 1: Hash = Hash * Factor % Mod Factor = Factor * Factor % Mod return Hash def create_hash(Message, Modulus) : Prime = 1066034883801684548209272203600128786792079585759892915222706082 37193062808643L # second factor from RSA-155 challenge , factored Aug. 1999 SmallPrime = 19 LastHash = 0L Hash = HashModulus = Modulus % Prime for Character in Message : LastHash = Hash + SmallPrime Power = ModExp(Hash+HashModulus, ord(Character)+SmallPrime, Prime) Hash = (Power ^ LastHash) % Prime return Hash

*What would it take to make the hash invertible?*

For very small plaintexts of only a few characters it would simply be
more efficient to start an exhaustive search on the plaintext than to
calculate roots mod prime to recover the chaining-variable
H_{i}. During the processing of every single character
an amount of information is thrown away while calculating the
modular exponentiation and mixing the result with the hashvalue of
the previous round will make it computationally infeasible to recover
the chaining-variables from the final output, given the length of
a 256 bit hashvalue.

*How hard is it to find a preimage for a given hash value?*

Firstly, let us look at the effects of a possible manipulation of input bits
across the border between two adjacent characters. A manipulation
of a single character cannot be compensated by another in a deliberate
way, because the alteration of the first character will give a different
remainder (mod prime) which is not only masked with the old chaining
variable but also the result is used as the new base in the next round,
propagating this change through the rest of the chain. It would be easier
to try a compensation with the following character than using anyone
further down the road, but as the opponent knows H_{i}
she can calculate the effects of an increased or decreased exponent,
but then she has to compensate this effect using the new base with a
different exponent for which she has only 255 different choices,
which reduces the probability of a successful compensation
(the new message should give the same hash value as the old one)
drastically.

*But is there a possible compensation by toggling some bits inside a given
character?*

Changing bits inside a character will result in an additional factor
base^{k} (where k is a small integer smaller than 256) which will
not be devided by the prime used in the hash function. Because of the
fact that the base of the exponentiation is always smaller than twice the
prime there are no common factors and the additional factor cannot be
a multiple of the prime.
Thus the additional factor can never result in the same output of ModExp()
and will always lead to a different hash value. The same argument holds
for bit manipulations making the exponent smaller. As the new and old
outputs of ModExp() differ only by a factor (base^{k}), the
production of the same reminder would require that the prime divides
the factor (base^{k}) which is impossible because the base is
smaller than twice the prime.

Giving a satisfying answer to this question is almost impossible as significant statistical analysis of the hash function's behaviour regarding the clustering of its output values cannot be performed due to the sheer amount of possible hash values. Although it is questionable whether one can extrapolate empirical results that have been found using much smaller primes to reduce the amount of hash values to the full 256 bit hash space, some of the following observations indicate the hash function's ability to scatter its output values:

After having performed a series of statistical tests on the hash function with smaller primes and different input material I will try to draw conclusions out of the results hoping that they shed some light on the quality of the hash function. I would like to present the details of my tests in a separate paper explaining only the most significant results here.

To reduce the amount of hash values small primes (31, 97, 997, 9973 and 32999) were used and the input data was taken from an entropy file made out of 1110000 bytes from /dev/random. Every 10 bytes were hashed and the number of collisions for each hash value was recorded. The data showed that there were some hash values that attracted more collisions than the average ones but the ones who attracted more collisions exceeded the average ones by a factor of 3 to 4 only. Thus the most probable hash value was no more than 4 times more likely to have a collision than the majority of others. The number of hash values that never got a hit grew bigger the larger the prime and consequently the smaller the average value of collisions became. With the prime 32999 and 111000 random input texts 1,5 percent of all hash values did never appear while an average number of 3,2 collisions was found on most of the hashes.

The second result revealed from larger input texts. Always using the prime 32999 I found that with over a million inputs of 10 bytes random data the maximum of collisions was only twice as high as the average while with only 111000 inputs the maximum was 4 times the average. One would expect that with more and more data fed into the hash function the difference between the maximum and the average number of collisions will slowly disappear. Using ordinary text files taken from the manual section of Linux as inputs I produced almost the same results (for around 150000 inputs) with 80 character text strings as with 10 bytes of random data.

These observations are preliminary and should be taken with great caution but for now I do not see any sign that the hash function does not scatter its output values properly.

The fact, that by chance two different messages (m1 and m2) can hash to the same value when hashed under two different public key moduli cannot be exploited for cheating as long as the correct public keys were used for verification or decryption. Talking about the hash of a message is meaningless unless you know which key the hash was created for, so we note hash(message, key) instead of hash(message) in PCP. There is another advantage of the hash function's dependence on a public key modulus that comes into play when we look at the way in which the secret key is protected for storage on the local computer system (see the relevant chapter for details).

PCP uses two different secret keys, one for signing (a long-term high security key) and one for encryption (which can be considerably shorter than the signing key) and even if you use the same passphrase to protect them the different moduli will assure that each secret key will be protected with a different storage key, because hash(passphrase, signingkey) differs from hash(passphrase, encryptionkey).

- [ And-93 ]
Ross Anderson: The Classification of Hash Functions

in:Codes and Ciphers (proceedings of fourth IMA Conference on Cryptography and Coding, December 1093), published by IMA (1995) pp 83-93